i am parsing an html page using XmlSlurper and HtmlCleaner, i have the GPathResult with
def page = new XmlSlurper(false,false).parseText(xml)
now i can use GPath to access the various nodes.
In the html i have a paragraph like this one:
<p>
some_text1
<br />
some_text2
<br />
some_text3
<br />
....
some_textN
<br />
</p>
the problem is that now i don’t know how to parse the text in the paragraph, i need to split the text inside the paragraph using the <br /> tag as separator and get a list like
[some_text, some_text1, some_text2, .... ,some_textN]
Having the node like
def node = page.body.some_path.p[0]
if i use text() i get all the text in the paragraph but without the <br /> so i cannot use the split method, and i don’t find a way to get the real html inside the paragraph from the node.
There is some way to parse this text?
Thanks for the help.
I’ve had this problem in the past with GPath and couldn’t really find a good way to go about it either.
What I ended up doing is a search/replace for
<br />in this case, replacing it with something that isn’t an XML element. Call itREPLACEMENT_SEPARATOR.That way, you could call
node.text().split(REPLACEMENT_SEPARATOR)and get your array.