hi all I’m writing a simple web crawling script that needs to connect to a webpage, follow the 302 redirects automatically, give me the final url from the link and let me grab the html.
What’s the preferred java lib for doing these kinds of things?
thanks
You can use Apache HttpComponents Client for this (or “plain vanilla” the Java SE builtin and verbose
URLConnectionAPI). For the HTML parsing/traversing/manipulation part Jsoup may be useful.Note that a bit decent crawler should obey the robots.txt. You may want to take a look at existing Java based webcrawlers, like
J-SpiderApache Nutch.