Is there anyway using Pentaho to parse a tables td’s from an html page?
Lets say I have this html content
<html>
<body>
<table>
<tr>
<td>info1</td>
<td>info2</td>
</tr>
<tr>
<td>info3</td>
<td>info4</td>
</tr>
</table>
</body>
</html>
I am using in Pentaho the "Get data from XML" with the following data:
Content:: Loop XPath: /html/body/table/tr Fields:: Name: tableData XPath: td
The data information I would like to have is
info1 info2 info3 info4
in any kind of way.
Any help would be truly appreciated!
I solved it by making reading every row of my file as rows. Then I added a Pentaho step “User Defined Java Class” and made it parse my table content using XSLT to a new XML file. Using that XML I was able to get the data needed to complete the task.
Here is what I wrote in “User Defined Java Class”: