I need to supply a base URL (such as http://www.wired.com ) and need to

Question

0

Asked: May 20, 20262026-05-20T00:13:29+00:00 2026-05-20T00:13:29+00:00

I need to supply a base URL (such as http://www.wired.com ) and need to

0

I need to supply a base URL (such as http://www.wired.com) and need to spider through the entire site outputting an array of pages (off the base URL). Is there any library that would do the trick?

Thanks.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T00:13:29+00:00

I have used Web Harvest a couple of times, and it is quite good for web scraping.

Web-Harvest is Open Source Web Data
Extraction tool written in Java. It
offers a way to collect desired Web
pages and extract useful data from
them. In order to do that, it
leverages well established techniques
and technologies for text/xml
manipulation such as XSLT, XQuery and
Regular Expressions. Web-Harvest
mainly focuses on HTML/XML based web
sites which still make vast majority
of the Web content. On the other hand,
it could be easily supplemented by
custom Java libraries in order to
augment its extraction capabilities.

Alternatively, you can roll your own web scraper using tools such as JTidy to first convert an HTML document to XHTML, and then processing the information you need with XPath. For example, a very naïve XPath expression to extract all hyperlinks from http://www.wired.com, would be something like //a[contains(@href,'wired')]/@href. You can find some sample code for this approach in this answer to a similar question.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need to supply a base URL (such as http://www.wired.com ) and need to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply