Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9021211
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 16, 20262026-06-16T05:12:08+00:00 2026-06-16T05:12:08+00:00

I’m hoping someone will just point out something obvious that I’m missing here. I

  • 0

I’m hoping someone will just point out something obvious that I’m missing here. I feel like I’ve done this a hundred times and for some reason tonight, the behavior coming from this is throwing me for a loop.

I’m reading in some XML from a public API. I want to extract all the text from a certain node (everything within ‘body’), which also includes a variety of child nodes. Simple example:

<xml>
    <metadata>
        <article>
            <body>
                <sec>
                    <title>A Title</title>
                    <p>
                        This contains 
                        <italic>italics</italic> 
                        and
                        <xref ref-type="bibr">xref's</xref>
                        .
                    </p>
                </sec>
                <sec>
                    <title>Second Title</title>
                </sec>
            </body>
        </article>
    </metadata>
</xml>

So ultimately I want to traverse the tree within the desired node (again, ‘body’) and extract all the text contained in its natural order. Simple enough, so I just write up this little Groovy script…

def xmlParser = new XmlParser()
def xml = xmlParser.parseText(rawXml)
xml.metadata.article.body[0].depthFirst().each { node ->
    if(node.children().size() == 1) {
        println node.text()
    }   
}

…which proceeds to blow up with “No signature of method: java.lang.String.children()”. So I’m thinking to myself “wait, what? Am I going crazy?” Node.depthFirst() should only return a List of Node’s. I add a little ‘instanceof’ check and sure enough, I’m getting a combination of Node objects and String objects. Specifically the lines not within entities on the same line are returned as String’s, aka “This contains” and “and”. Everything else is a Node (as expected).

I can work around this easily. However, this doesn’t seem like correct behavior and I’m hoping someone can point me in the right direction.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-16T05:12:09+00:00Added an answer on June 16, 2026 at 5:12 am

    I’m pretty sure that’s correct behavior (though I’ve always found the XmlSlurper and XmlParser to have screwy APIs). All things you can iterate through really should implement a node interface IMO and potentially have a type of TEXT that you could use to know to get the text from them.

    Those text nodes are valid nodes that in many cases you’d want to hit as it did a depth first traversal through the XML. If they didn’t get returned, your algorithm for checking if the children size of 1 wouldn’t work because some nodes (like the <p> tag) has both mixed text and elements underneath it.

    Also, why depthFirst doesn’t consistently return all text nodes where the text is the only child, such as for italic above, makes things even worse.

    I tend to like to use the signature of groovy methods to let the runtime figure out which is the right way to handle each node (rather than using something like instanceof) like this:

    def rawXml = """<xml>
        <metadata>
            <article>
                <body>
                    <sec>
                        <title>A Title</title>
                        <p>
                            This contains 
                            <italic>italics</italic> 
                            and
                            <xref ref-type="bibr">xref's</xref>
                            .
                        </p>
                    </sec>
                    <sec>
                        <title>Second Title</title>
                    </sec>
                </body>
            </article>
        </metadata>
    </xml>"""
    
    def processNode(String nodeText) {
        return nodeText
    }
    
    def processNode(Object node) {
       if(node.children().size() == 1) {
           return node.text()
       }
    }
    
    def xmlParser = new XmlParser()
    def xml = xmlParser.parseText(rawXml)
    def xmlText = xml.metadata.article.body[0].'**'.findResults { node ->
        processNode(node)
    }
    
    println xmlText.join(" ")
    

    Prints

    A Title This contains italics and xref's .  Second Title
    

    Alternatively, the XmlSlurper class probably does more what you want/expect it to and has a more reasonable set of output from the text() method. If you really don’t need to do any sort of DOM walking with the results (what XmlParser is “better” for), I’d suggest XmlSlurper:

    def xmlParser = new XmlSlurper()
    def xml = xmlParser.parseText(rawXml)
    def bodyText = xml.metadata.article.body[0].text()
    println bodyText
    

    Prints:

    A Title
                        This contains 
                        italics 
                        and
                        xref's
                        .
                    Second Title
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I've got a string that has curly quotes in it. I'd like to replace
I'm parsing an RSS feed that has an &#8217; in it. SimpleXML turns this
I know there's a lot of other questions out there that deal with this
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I need a function that will clean a strings' special characters. I do NOT
I have a string like this: La Torre Eiffel paragonata all&#8217;Everest What PHP function
link Im having trouble converting the html entites into html characters, (&# 8217;) i
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I have just tried to save a simple *.rtf file with some websites and
For some reason, after submitting a string like this Jack’s Spindle from a text

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.