I already searched a long time for a good solution, but I can’t find anything that fits my needs…
I want to parse an HTML file and display its content in a table. Everything is almost like writing yet another RSS feed reader. Doing that by parsing valid XML files is simple and straight forward using NSXMLParser or TouchXML or libxml directly or some other XML parseres out there… But these frameworks either only work with XML and/or are not working with non-tidy HTML. The site consists of divs including links that include images or paragraphs including links and images etc. etc… just a normal website. Using libxml seems way too complicated in that case.
Does somebody have more experience with parsing dirty HTML pages? Which (free) library/framework did you use? I have the feeling that I just miss something obvious here. It can’t be that difficult to parse HTML files, or not?
I hope you can point me to the right direction!
If you need to parse most of the page, trying to use libXML2 as per Anurag is a good idea.
If you just want small segments of data from the file, you are better off using RegEx expressions to read out data – there’s also a built-in regex library, which you can use the wrapper RegExKitLite to access.