I am using Nutch to crawl a large website.
The webpages are generated by CGI program. Most of the webpages’ URL contains expressions such as ?id=2323&title=foo.
I want to crawl these webpages as they contain many useful information.
However, a problem I’m facing is that this website has a calendar. Some date-like webpages are generated too. That means Nutch will try to crawl some innocent webpages such as year=2030&month=12.
This is quite stupid.
How can I avoid such trap in Nutch? Writing many regex expression?
Add regex patterns to
conf/regex-urlfilter.txtto speficy rules to accept or reject urls.