I am trying to use YQL to extract a portion of HTML from a series of web pages. The pages themselves have slightly different structure (so a Yahoo Pipes “Fetch Page” with its “Cut content” feature does not work well) but the fragment I am interested in always has the same class attribute.
If I have an HTML page like this:
<html>
<body>
<div class="foo">
<p>Wolf</p>
<ul>
<li>Dog</li>
<li>Cat</li>
</ul>
</div>
</body>
</html>
and use a YQL expression like this:
SELECT * FROM html
WHERE url="http://example.com/containing-the-fragment-above"
AND xpath="//div[@class='foo']"
what I get back are the (apparently unordered?) DOM elements, where what I want is the HTML content itself. I’ve tried SELECT content as well, but that only selects textual content. I want HTML. Is this possible?
You could write a little Open Data Table to send out a normal YQL
htmltable query and stringify the result. Something like the following:You could then query against that custom table with a YQL query like:
Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. 🙂