There are a number of resources to parse HTML pages and extract textual content. Jsoup is an example. In my case, I would like to extract the textual content tagged with the html tags under which each sentence occurs. For example, take this page
<html>
<head><title>Test Page</title>
<body>
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
</body>
</html>
I’m expecting the output to be like this:
<h1>This is a test page</h1>
<p> The goal is to extract <b>textual content <em> with html tags</em> </b> from html pages.
In other words, I want to include specific html tags within the textual content of the page.
To get your result you can use this:
Instead of the String
htmlyou can load a file or a website too – jsoup provides this all.In this example
bodycontains the html you posted as result.Or do you need to select something like “h1 followed by p tag”?
However you may take a look at the Jsoup Selector API