I need to grab some content from an HTML (XHTML valid) page. I grab the page using curl and store it in memory.
I played with the idea of using regex with the PCRE library, but simply I couldn’t find any examples using it with C. Then I moved on to look at HTML parsers and again there is not a good selection. All I could find was a skimpy documented module for libxml called HTMLparser.
Are there any alternatives? If not, then examples for what I found already?
You want to use HTML tidy to do this. The Lib curl page has some source code to get you going. Documents traversing the dom tree. You don’t need an xml parser. Doesn’t fail on badly formated html.
http://curl.haxx.se/libcurl/c/htmltidy.html