I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I’ve found seem old and awkward, so I’d like to get some current (and practical) insights from the web developer community.
I want to use a crawler to walk over ‘the web’ for a super simple purpose – ‘does the markup of site XYZ meet condition ABC?’.
This raises a lot of questions for me, but I think the two main questions I need to get out of the way first are:
- It feels a little ‘iffy’ from the get go — is this sort of thing acceptable?
- What specific considerations should the crawler take to not upset people?
Obey robots.txt (and not too aggressive like has been said already).
You might want to think about your user-agent string – they’re a good place to be up-front about what you’re doing and how you can be contacted.