I have the task of migrating THE worst HTML product descriptions you will ever

Question

0

Asked: June 1, 20262026-06-01T16:57:24+00:00 2026-06-01T16:57:24+00:00

I have the task of migrating THE worst HTML product descriptions you will ever

0

I have the task of migrating THE worst HTML product descriptions you will ever encounter. It consists of a mixture of tables and paragraphs. The majority are not even 100% valid HTML and there are plenty of Microsoft tags courtesy of MS Word. It is littered with in line style tags and the most of it relies on the most bonky set of css rules you will ever see.

Essentially I have come the the realisation that the only thing of use is the paragraphs of text. I can not just grab the <p> tags as sometimes the paragraphs do not use them and sometimes titles or single words have their own <p> tag.

So my question is can I match text that is longer then x characters between html tags?

Ideally it would also ignore <br/> and <br>

Here is a link to an example of the html I am dealing with

Note it is just the description I am processing, not the whole page.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T16:57:25+00:00

Group 1 of this regex will match n+ chars between tags (n = 100 in this example):

<[^>]+>([^<]{100,})<[^>]+>

Notes:

I have deliberately not matched for a matching closing tag (<([^>]+)>([^<]{100,})<\1>) because of OP’s sloppy HTML – a tag is a tag
I have avoided using a lookbehind ((?<=<[^>]+>)) because the match is of arbitrary length, which can cause backtracking problems (some languages, like java, do not even support it).

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the task of migrating THE worst HTML product descriptions you will ever

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply