I am looking to write a Java app which queries multiple URLs (defined by a list of URIs) for their HTML source and returns the contents of a specific element with a defined id on each page.
As an example, lets say one started with a list of a list of blog post URLs such as…
…now, if a sample page looks like the following…
<html>
<body>
<div class="content">
<h2 id="post_title">Post Title</h2>
<p class="post_paragraph">Here is the content of my post.</p>
</div>
</body>
</html>
How can I grab the contents of the “post_title” id for each of my URLs, and print it to the console with the classic System.out.print(String s)?
Thanks for all input.
First you resolve the URL using Java’s connection API
http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html
Then you will need to parse the HTML
http://www.google.be/search?q=java+html+parser
And finally you will need to walk the parsed document structure (that will depend on the parser you choose) to find an element with the given id.