Does a web crawler return the extracted text from webpages only? Say, if there are some pdf/doc files stored in the web server as well. Can a web crawler crawl through them and return their content as well? Anyway what are the suggestions for a good opensource Java web crawler?
Thank You!
Web crawler doesn’t extract the text. It simply returns the htmls with some transformations [UTF-8 conversion for example] applied.
If you think of it that way for crawler it doesn’t matter at the first hop. Of course for multiple hops it needs to look inside these documents and typical crawlers don’t provide multiple hops in pdf/docs etc.