I am looking for a good open source bot to determine some quality, often required for google indexing.
For example
- find duplicate titles
- invalid links ( jspider do this, and I think a lot more will do this)
- exactly the same page, but different urls
- etc, where etc equals google quality reqs.
Your requirements are very specific so it’s very unlikely there is an open source product that does exactly what you want.
There are, however, many open source frameworks for building web crawlers. Which one you use depends on your language preference.
For example:
Generally, these frameworks will provide classes for crawling and scraping pages of a site based upon the rules you give, but then it’s up to you to extract the data you need by hooking in your own code.