I am trying to parse a page using YQL. A HTML table is being parsed. The issue is that YQL adds HTML “p” tag by itself whereas it is not included in the original HTML. What should be done so that the YQL should not return the HTML.
The YQL query can be seen here.
If one looks in the td tag such as below there is a p tag included where as in original html can be seen here does not have a p tag in table html.
<tr>
<td class="ttl">
<a href="#" onclick="helpW('h_weight.htm');">Weight</a>
</td>
<td class="nfo">
<p>169 g</p>
</td>
</tr>
It’s not YQL doing this but the HTML5 engine itself. Part of the philosophy of HTML5 is that if you give it invalid HTML, it repairs it for you by adding any elements that you missed out, and what you are seeing in your query is a tree representing the repaired content. (Yes, this makes it hard to write queries. But this is not a place to apportion blame…)