In the following HTML, I can parse the table element, but I don’t know how to skip the th elements.
I want to get only the td elements, but when I try to use:
foreach (HtmlNode cell in row.SelectNodes("td"))
…I get an exception.
<table class="tab03">
<tbody>
<tr>
<th class="right" rowspan="2">first</th>
</tr>
<tr>
<th class="right">lp</th>
<th class="right">name</th>
</tr>
<tr>
<td class="right">1</td>
<td class="left">house</td>
</tr>
<tr>
<th class="right" rowspan="2">Second</th>
</tr>
<tr>
<td class="right">2</td>
<td class="left">door</td>
</tr>
</tbody>
</table>
My code:
var document = doc.DocumentNode.SelectNodes("//table");
string store = "";
if (document != null)
{
foreach (HtmlNode table in document)
{
if (table != null)
{
foreach (HtmlNode row in table.SelectNodes("tr"))
{
store = "";
foreach (HtmlNode cell in row.SelectNodes("th|td"))
{
store = store + cell.InnerText+"|";
}
sw.Write(store );
sw.WriteLine();
}
}
}
}
sw.Flush();
sw.Close();
This method uses LINQ to query for
HtmlNodeinstances that have the nametd.I also noticed your output appears as
val|val|(with the trailing pipe), This sample usesstring.Join(pipe, array)as a less-hideous method of removing that trailing pipe:val|val.