I have articles on my website which I would like to get corrected and translated automatically. But I need to get the content, without having the HTML tags around.
The idea is to have a regex that could retrieve all the content between the tags (and, if possible, also the content found in tags fields like <img alt='Little house'>). The problem is that I don’t really know how to write such a regex. Any ideas?
I would recommend using an HTML parser, rather than relying on a regex. Parsing HTML with regex is generally a no-no and are nearly impossible to get right for all cases. There are many questions here on SO that arrive at the same conclusion.
EDIT looks like a couple of us had the same idea… Also, here is a question that discusses more parsers.