I need to supply a base URL (such as http://www.wired.com) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?
Thanks.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
I have used
Web Harvesta couple of times, and it is quite good for web scraping.Alternatively, you can roll your own web scraper using tools such as
JTidyto first convert an HTML document to XHTML, and then processing the information you need withXPath. For example, a very naïve XPath expression to extract all hyperlinks fromhttp://www.wired.com, would be something like//a[contains(@href,'wired')]/@href. You can find some sample code for this approach in this answer to a similar question.