Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8232265
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 7, 20262026-06-07T17:46:31+00:00 2026-06-07T17:46:31+00:00

I recently got interested in web crawlers but one thing isn’t a very clear

  • 0

I recently got interested in web crawlers but one thing isn’t a very clear one to me. Imagine a simple crawler that would get the page, extract links from it and queue them for later processing the same way.

How crawlers handle the case when certain link wouldn’t lead to another page but to some asset or maybe other kind of static file instead? How would it know? It probably doesn’t want to download this kind of maybe large binary data, nor even xml or json files. How content negotiation fall into this?

How I see content negotiation should work is on the webserver side when I issue a request to example.com/foo.png with Accept: text/html it should send me back an html response or Bad Request status if it cannot satisfy my requirements, nothing else is acceptable, but that’s not how it works in the real life. It send me back that binary data anyway with Content-Type: image/png even when I’m telling it I only accept text/html. Why webservers work like this and not coercing the right response I’m asking for?

Is implementation of content negotiation broken or it’s application’s responsibility to implement it correctly?

And how does real crawlers work? Sending HEAD request ahead to check whats on the other side of a link sees as an unpractical waste of resources.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-07T17:46:32+00:00Added an answer on June 7, 2026 at 5:46 pm

    Not ‘Bad Request’, the correct response is 406 Not Acceptable.

    The HTTP spec states that it SHOULD send back this spec[1], but most implementations don’t do this. If you want to avoid download a content-type you’re not interested in, your only options is indeed to do a HEAD first.
    Since you probably crawled these images, you may also be able to make some intelligent guesses that it was in fact an image (for instance, it appeared in an <img> tag).

    You could also just start the request as normally, and as soon as you notice that you’re getting binary data back, cut the TCP connection short. But I’m not sure how good of an idea this is.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I recently got interested in web development coming from kernel based applications. I am
I have just recently got interested in the Google Analytics plugin but found out
I recently got a new primary computer. On my old one, I was working
I recently got hired as a web developer, and the project that I am
We recently got a call from one of our clients, complaining that their site
I hope this isn't knocked for being too general, but... I recently had occasion
I recently got to know about trinidad of Apache MYfaces. For web application i
I recently got a mac and have very little experience with macos. * The
I recently got interested in financial board games and saw how they can be
Recently, i've got interested in making a front-end for command-line program. I guess there's

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.