I have taken over a code base and I have to read in these html files that were generated by Microsoft Word, I think so it has all kinds of whacky inline formatting.
is there anyway to parse out all of the bad inline formatting and just get the text from this stream. I basically want a purifier programmatically so I can then apply some sensible css
in the end i just wrote a small class that did a bunch of find and replaces. not pretty but it worked.