Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 4571160
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 21, 20262026-05-21T19:27:53+00:00 2026-05-21T19:27:53+00:00

I’m using Java XPath API to extract content from a xhtml file. I’m pasring

  • 0

I’m using Java XPath API to extract content from a xhtml file. I’m pasring the html and trying to extract the content of a specific . The contains text and few within. When I’m using XPath, strangely it ignores all html tags and extract the textual content only. Here’s a html snippet.

<html>
<body>
<div class="content">
    <div class="content_wrapper">
        <table border="0" cellspacing="0" cellpadding="0" class="test_class">
            <tr>
                <td>
                    <p>
                        Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
                        download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.
                    </p>
                    <p style="text-align: center;">
                        <img src="/testsource/fckdata/208123/image/showcarswatch.jpg" alt="" />
                        <img src="/testsource/fckdata/208123/image/engineswatch.jpg" alt="" />
                        <img src="/th.gen/?:760x0:/userdata/fckdata/208123/image/toasterswatch.jpg" alt="" />
                        <img src="/testsource/fckdata/208123/image/smartphoneswatch.jpg" alt="" />
                    </p>
                    <p>
                        <br />
                        Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
                        just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it:<br />
                    </p>
                    <p>
                        <strong>Operating System</strong><br />
                        • Microsoft® Windows® XP Professional (SP 2 or higher)<br />
                        • Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br />
                        • Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)
                    </p>
                </td>
            </tr>
        </table>
    </div>
</div>
</body>
</html>

Now, here’s the code I’m using. I need to do some cleanup before using the xpath.

CleanerProperties props = new CleanerProperties();
props.setOmitDoctypeDeclaration(true);
props.setAllowHtmlInsideAttributes(true);
props.setOmitUnknownTags(true);

TagNode tagNode = new HtmlCleaner(props).clean(urlXML, "UTF-8");        
Document doc = new DomSerializer(props, true).createDOM(tagNode);

String content = XPathAPI.eval(doc, "/html/body//div[@class='content']/div[@class='content_wrapper']").toString();

And here’s the output.


Reading and looking at images or movies is one thing. Experiencing it in 3D the other. If you like to figure out more about what Showcase is, I would really encourage you to
download Showcase Viewer and have a look at the demo files also available on this site. Interact with the models and see how real it looks.

Showcase Viewer is actually a full Showcase install, except data processing and creation tools. This means that you can look at any data created with a regular Showcase you
just can´t add any information. But you may join a collaboration session hosed by a Showcase Professional user. Here is where you can get it

Operating System
• Microsoft® Windows® XP Professional (SP 2 or higher)<br /> 
• Windows XP Professional x64 Edition (Autodesk® Showcase® software runs as a 32-bit application on 64-bit operating system)<br /> 
• Microsoft Windows Vista® 32-bit or 64-bit, including Business, Enterprise or Ultimate (SP 1)

All I need is the complete content within the content_wrapper div.

Any pointers will be highly appreciated.

  • Thanks

EDIT

Sample code in response to yamburg solution.

XPathFactory factory = XPathFactory.newInstance();
XPath xpathCompiled = factory.newXPath();
XPathExpression expr = xpathCompiled.compile(contentPath);
NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);


for (int i = 0; i < nodes.getLength(); i++) {
    Node n = (Node)nodes.item(i);
    traverseNodes(n);
}

public static void traverseNodes( Node n ) {
    NodeList children = n.getChildNodes();
    if( children != null ) {
        for(int i = 0; i &gt; children.getLength(); i++ ) {
            Node childNode = children.item( i );
            System.out.println( "node name = " + childNode.getNodeName() );
            System.out.println( "node value = " + childNode.getNodeValue() );
            System.out.println( "node type = " + childNode.getNodeType() );
            traverseNodes( childNode );
        }
    }
}
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-21T19:27:54+00:00Added an answer on May 21, 2026 at 7:27 pm

    XPath matches a node set. Text node in your case, with child element nodes. toString() gets the textual representation of that node(s) which is just that — text, without element names or attributes.

    You should get the node:

    NodeSequence nodes = (NodeSequence)XPathAPI.eval();
    

    and then walk through nodes, dumping what ever you want from them, or convert it into a new DOM document, for instance.

    P.S. Xalan is good, but modern Java has JAXP. For the sake of portability of code and knowledge I’d suggest to use that (unless Xalan extensions are required/useful):

    XPathFactory factory = XPathFactory.newInstance();
    XPath xpathCompiled = factory.newXPath();
    XPathExpression expr = xpathCompiled.compile(xpath);
    
    NodeList nodes = (NodeList) expr.evaluate(doc, XPathConstants.NODESET);
    

    Then, to convert it into String (apparently that’s what you want):

    StringWriter sw = new StringWriter();
    Transformer serializer = TransformerFactory.newInstance().newTransformer();
    serializer.transform(new DOMSource(nodes.item(0)), new StreamResult(sw));
    String result = sw.toString(); 
    

    Note that it only takes the very first element from the NodeList, because XML must have a root element. In your case it is OK, if I understand right, otherwise you’d need to add a top-level element over the node set.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm new to using the Perl treebuilder module for HTML parsing and can't figure
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I want use html5's new tag to play a wav file (currently only supported
Is it possible to replace javascript w/ HTML if JavaScript is not enabled on
Does anyone know how can I replace this 2 symbol below from the string
In order to apply a triggered animation to all ToolTip s in my app,
I want to count how many characters a certain string has in PHP, but
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
Seemingly simple, but I cannot find anything relevant on the web. What is the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.