I have the task of migrating THE worst HTML product descriptions you will ever encounter. It consists of a mixture of tables and paragraphs. The majority are not even 100% valid HTML and there are plenty of Microsoft tags courtesy of MS Word. It is littered with in line style tags and the most of it relies on the most bonky set of css rules you will ever see.
Essentially I have come the the realisation that the only thing of use is the paragraphs of text. I can not just grab the <p> tags as sometimes the paragraphs do not use them and sometimes titles or single words have their own <p> tag.
So my question is can I match text that is longer then x characters between html tags?
Ideally it would also ignore <br/> and <br>
Here is a link to an example of the html I am dealing with
Note it is just the description I am processing, not the whole page.
Group
1of this regex will matchn+chars between tags (n =100in this example):Notes:
<([^>]+)>([^<]{100,})<\1>) because of OP’s sloppy HTML – a tag is a tag(?<=<[^>]+>)) because the match is of arbitrary length, which can cause backtracking problems (some languages, like java, do not even support it).