I just need a suggestion. I have a program that takes valid html, and saves it to a file, I need a way to parse this html file to retrieve every image documented within that html file. (e.g. /foo/bar.jpg). Is there a html parsing library that I could use to achieve this?
Share
Half an answer: There’s a Java parser called Tagsoup which will “Just Keep On Truckin'”, parsing anything with angle brackets and always producing a valid set of events to the application.
I mention this because I know that the idea and, crucially, the name have been adopted by libraries which have the same intention, in other languages. I can’t find a C version right now, but you may have more luck if you try some inventive searches with that starting point (the point is that the application which sits atop the parser doesn’t have to care about the horrors in the original source, but can pretend that it was well-formed XML, and do XMLish things to/with it).
Edit: oooh, and … there we go Taggle (C++, but possibly close enough, and that posting suggests that porting it from Java wasn’t hard)