A web-bot crawling your site and using bandwdith resources. Bots are numerous and for

Question

0

Asked: May 19, 20262026-05-19T03:44:39+00:00 2026-05-19T03:44:39+00:00

A web-bot crawling your site and using bandwdith resources. Bots are numerous and for

0

A web-bot crawling your site and using bandwdith resources.
Bots are numerous and for many purposes, starting from homemade, university research, scrappers, new startups to established search engines (and many more categories probably)

Apart from large search engines which can potentially send traffic to a site, why webmasters allow other bots whose purpose they do not know immediately ?
What are the incentives for webmasters to allow these bots ?

2nd question is:

Should a distributed crawler with multiple crawlagent-nodes on internet, use different User-Agent string for each agent, because if they all use same UA, then benefit of scaling via multiple agents is highly reduced.
Because large websites with high crawl-delay set, may take weeks or months to crawl fully.

3rd question:
Since robots.txt (the only defined crawl control method) is at domain level.
Should crawler have politeness policy per domain or per IP (sometimes many websites hosted on same IP) .

How to tackle such web poilteness problems ? Any other related things to keep in mind ?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-19T03:44:40+00:00

There are many useful bots besides search engine bots and there are a growing number of search engines. In any case, the bots you want to block are probably using incorrect user-agent strings and ignoring your robots.txt files so how are you going to stop them? You can block some at the IP level once you detect them but for others it’s hard.
The user agent string has nothing to do with crawl rate. Millions of browser users are all using the same user agent string. Web sites throttle access based on your IP address. If you want to crawl their site faster you’ll need more agents, but really, you shouldn’t be doing that – your crawler should be polite and should be crawling each individual site slowly whilst making progress on many other sites.
Crawler should be polite per-domain. A single IP may server many different servers but that’s no sweat for the router that’s passing packets to and fro. Each individual server will likely limit your ability to maintain multiple connections and how much bandwidth you can consume. There’s also the one-web-site-served-by-many-IP addresses scenario (e.g. round robin DNS or something smarter): sometimes bandwidth and connection limits on sites like these will happen at the router-level, so once again, be polite per domain.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

A web-bot crawling your site and using bandwdith resources. Bots are numerous and for

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply