I recently got interested in web crawlers but one thing isn’t a very clear

Question

0

Asked: June 7, 20262026-06-07T17:46:31+00:00 2026-06-07T17:46:31+00:00

I recently got interested in web crawlers but one thing isn’t a very clear

0

I recently got interested in web crawlers but one thing isn’t a very clear one to me. Imagine a simple crawler that would get the page, extract links from it and queue them for later processing the same way.

How crawlers handle the case when certain link wouldn’t lead to another page but to some asset or maybe other kind of static file instead? How would it know? It probably doesn’t want to download this kind of maybe large binary data, nor even xml or json files. How content negotiation fall into this?

How I see content negotiation should work is on the webserver side when I issue a request to example.com/foo.png with Accept: text/html it should send me back an html response or Bad Request status if it cannot satisfy my requirements, nothing else is acceptable, but that’s not how it works in the real life. It send me back that binary data anyway with Content-Type: image/png even when I’m telling it I only accept text/html. Why webservers work like this and not coercing the right response I’m asking for?

Is implementation of content negotiation broken or it’s application’s responsibility to implement it correctly?

And how does real crawlers work? Sending HEAD request ahead to check whats on the other side of a link sees as an unpractical waste of resources.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T17:46:32+00:00

Not ‘Bad Request’, the correct response is 406 Not Acceptable.

The HTTP spec states that it SHOULD send back this spec[1], but most implementations don’t do this. If you want to avoid download a content-type you’re not interested in, your only options is indeed to do a HEAD first.
Since you probably crawled these images, you may also be able to make some intelligent guesses that it was in fact an image (for instance, it appeared in an <img> tag).

You could also just start the request as normally, and as soon as you notice that you’re getting binary data back, cut the TCP connection short. But I’m not sure how good of an idea this is.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I recently got interested in web crawlers but one thing isn’t a very clear

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply