using htmlparser (http://htmlparser.sourceforge.net/) I have been trying to extract information (Content1 + Link) from a html table.
sample html:
<td class="xx">
<a href="http://link">Content1</a>
</td>
java code:
CssSelectorNodeFilter cssFilter = new CssSelectorNodeFilter("td[class=\"xx\"]");
NodeList nodes = parser.parse(cssFilter);
resultSet = new String[nodes.size()][2];
for (int i=0;i<nodes.size();i++) {
resultSet[i][0]=nodes.elementAt(i).toPlainTextString().trim();
LinkTag tag = (LinkTag) (nodes.elementAt(i));
resultSet[i][1]=tag.getLink();
}
I can extract the first part (the Content1 String) with no problems, but I am having trouble getting the link. It either says I cannot cast on a TextNode (with the code above) or it returns null.
as above – result: TableColumn cannot be cast to LinkTag
LinkTag tag = (LinkTag) (nodes.elementAt(i));
resultSet[i][1]=tag.getLink();
result: TextNode cannot be cast to LinkTag
LinkTag tag = (LinkTag) (nodes.elementAt(i).getFirstChild());
resultSet[i][1]=tag.getLink();
result: NullPointer
LinkTag tag = (LinkTag) (nodes.elementAt(i).getFirstChild().getFirstChild());
resultSet[i][1]=tag.getLink();
result: returns null
Tag tag = (Tag) (nodes.elementAt(i));
resultSet[i][1]=tag.getAttribute("href");
Thanks for any ideas/solutions =)
If you print out the contents of the
<TD>tag, you get:Therefore what you want is the sibling of the first child of the TD – though you are then at the mercy of whatever formatting is in the table.
To find the first link in the table data, you can use this code: