Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7163931
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 28, 20262026-05-28T13:59:05+00:00 2026-05-28T13:59:05+00:00

I’m scraping values from HTML pages using XPath inside of a java program to

  • 0

I’m scraping values from HTML pages using XPath inside of a java program to get to a specific tag and occasionally using regular expressions to clean up the data I receive.

After some research, I landed on HTML Cleaner ( http://htmlcleaner.sourceforge.net/ ) as the most reliable way to parse raw HTML into a good XML format. HTML Cleaner, however, only supports XPath 1.0 and I find myself needing functions like ‘contains’. for instance, in this piece of XML:

<div>
  <td id='1234 foo 5678'>Hello</td>
</div>

I would like to be able to get the text ‘Hello’ with the following XPath:

//div/td[contains(@id, 'foo')]/text()

Is there any way to get this functionality? I have several ideas, but would prefer not to reinvent the wheel if I don’t need to:

  • If there is a way to call HTML Cleaner’s evaluateXPath and return a TagNode (which I have not found), I can use an XML serializer on the returned TagNode and chain together XPaths to achieve the desired functionality.
  • I could use HTML Cleaner to clean to XML, serialize it back to a string, and use that with another XPath library, but I can’t find a good java XPath evaluator that works on a string.
  • Using TagNode functions like getElementsByAttValue, I could essentially recreate XPath evaluation and insert in the contains functionality using String.contains

Short question: Is there any way to use XPath contains on HTML inside an existing Java Library?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-28T13:59:06+00:00Added an answer on May 28, 2026 at 1:59 pm

    Regarding this:

    I could use HTML Cleaner to clean to XML, serialize it back to a
    string, and use that with another XPath library, but I can’t find a
    good java XPath evaluator that works on a string.

    This is exactly what I would do (except you don’t need to operate on a string (see below)).

    A lot of HTML parsers try to do too much. HTMLCleaner, for example, does not properly/completely implement the XPath 1.0 spec (contains (for example) is an XPath 1.0 function). The good news is that you don’t need it to. All you need from HTMLCleaner is for it to parse the malformed input. Once you’ve done that, it’s better to use the standard XML interfaces to deal with the resulting (now well-formed) document.

    First convert the document into a standard org.w3c.dom.Document like this:

    TagNode tagNode = new HtmlCleaner().clean(
            "<div><table><td id='1234 foo 5678'>Hello</td>");
    org.w3c.dom.Document doc = new DomSerializer(
            new CleanerProperties()).createDOM(tagNode);
    

    And then use the standard JAXP interfaces to query it:

    XPath xpath = XPathFactory.newInstance().newXPath();
    String str = (String) xpath.evaluate("//div//td[contains(@id, 'foo')]/text()", 
                           doc, XPathConstants.STRING);
    System.out.println(str);
    

    Output:

    Hello
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have thousands of HTML files to process using Groovy/Java and I need to
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
That's pretty much it. I'm using Nokogiri to scrape a web page what has
For some reason, after submitting a string like this Jack’s Spindle from a text
I want use html5's new tag to play a wav file (currently only supported
I'm using v2.0 of ClassTextile.php, with the following call: $testimonial_text = $textile->TextileRestricted($_POST['testimonial']); ... and
I am currently running into a problem where an element is coming back from
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
We're building an app, our first using Rails 3, and we're having to build

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.