I’ve seen this question, which is very nice and informative. However, it doesn’t deal with a rather common scenario.
Say I need to scrape a multitude of websites (or even pages in the same domain), but the author of that website didn’t care enough for his code, and has some seriously malformed code "that kinda works". I need to take information from that website.
How do I do it in this case? Ideally without going í͞ń̡͢͡s̶̢̛á̢̕͘ń̵͢҉e̶̸̢̛.
Is it possible? Do I have to revert to RegExp?
You need a DOM Parser. Php has one. And then there are some alternatives (and more… just google for them). You can even run the “garbled HTML” trhu HTML Purifier if you want.