Does a web crawler return the extracted text from webpages only? Say, if there

Question

0

Asked: May 23, 20262026-05-23T09:11:59+00:00 2026-05-23T09:11:59+00:00

Does a web crawler return the extracted text from webpages only? Say, if there

0

Does a web crawler return the extracted text from webpages only? Say, if there are some pdf/doc files stored in the web server as well. Can a web crawler crawl through them and return their content as well? Anyway what are the suggestions for a good opensource Java web crawler?

Thank You!

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-23T09:11:59+00:00

Editorial Team

2026-05-23T09:11:59+00:00Added an answer on May 23, 2026 at 9:11 am

Web crawler doesn’t extract the text. It simply returns the htmls with some transformations [UTF-8 conversion for example] applied.

If you think of it that way for crawler it doesn’t matter at the first hop. Of course for multiple hops it needs to look inside these documents and typical crawlers don’t provide multiple hops in pdf/docs etc.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Does a web crawler return the extracted text from webpages only? Say, if there

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply