I am dealing with very primitive HTML construction that goes like this:
<a NAME="header1"></a><b><font face="Verdana, Serif"><font color="#000000"><font size=+1>Hygiene</font></font></font></b>
<p><b><font face="Verdana, Serif"><font color="#000000">Shampoo</font></b>
<p><b><font face="Verdana, Serif"><font color="#000000"></font>Soap</font></b>
<p><b><font face="Verdana, Serif"><font color="#000000">Deodorant</font></b>
<p><b><font face="Verdana, Serif"><font color="#000000">Toothpaste</font></b>
<p><b><font face="Verdana, Serif"><font color="#000000"></font>Brush</font></b>
<a NAME="header2"></a><b><font face="Verdana, Serif"><font color="#000000"><font size=+1>Food</font></font></font></b>
<p><b><font face="Verdana, Serif"><font color="#000000">Meat</font></b>
<p><b><font face="Verdana, Serif"><font color="#000000">Vegetables</font></b>
<p><b><font face="Verdana, Serif"><font color="#000000">Fruit</font></b>
The thing is now, I want to get all items from Hygiene header (top) which are Shampoo, Soap, Deodorant, Toothpaste, Brush (and put them in let’s say HashMap> for now).
I use this XPath to get the headers (Hygiene and Food):
//html/body//b/font/font/font
And it works fine, I get what I need.
Then I use this XPath to collect the items:
//html/body//p/b/font/font
for ALL items. So this (last) XPath would return a list from all items which are [Shampoo, Soap, Deodorant, Toothpaste, Brush, Meat, Vegetables, Fruit]. The thing is that I don’t know when to stop putting items in the first list (like, when another header starts, which is Food in this case, create new list and put the Food items there). All I can get with this XPaths is the values of the headers (Hygiene, Food) and ALL items from both lists (not separate).
I need to get something like:
- Map{“Hygiene”, [Shampoo, Soap, Deodorant, Toothpaste, Brush]}
- Map{“Food”, [Meat, Vegetables, Fruit]}
All items are thrown like this and they are not in separate divs or spans so that I can recognize when new header had cometh.
Thanks!
It’s not easy to parse this HTML because it’s not amenable to parsing (judging from the
<font>tags you could probably use some colorful language about it as well).AFAIK there’s no way to express a “following siblings until X” condition in XPath, so here’s an alternative: use one XPath expression that matches both headers and items, for example with this specific markup you could use
which will select all text nodes (“Hygiene”, “Shampoo”, “Soap”, …).
The nodes will be returned in document order (this is extremely important), so afterwards you can iterate over the results and perform a test on each to determine if it’s a header or an item (in this case you could check if the parent is a
<font>element that has asizeattribute).This way you can keep a reference to the last “header” found and add all following “items” to an appropriate data structure under it until you come across the next header, etc.