I need to grab some links that are depending on the sent cookies within a GET Request.
So when I want to crawl the page with crawler4j I need to send some cookies with it to get the correct page back.
Is this possible (I searched the web for it, but didn’t find something useful)? Or is there a Java crawler out there who is capable doing this?
Any help appreciated.
It appears that crawler4j might not support cookies: http://www.webuseragents.com/ua/427106/crawler4j-http-code-google-com-p-crawler4j-
There are several alternatives:
I would say that Nutch and Heritrix are the best ones and I would put special emphasis on Nutch, because it’s probably one of the only crawlers that is designed to scale well and actually perform a big crawl.