I would like to extract from a general HTML page, all the text (displayed

Question

0

Editorial Team

Asked: May 10, 20262026-05-10T17:04:17+00:00 2026-05-10T17:04:17+00:00

I would like to extract from a general HTML page, all the text (displayed

0

I would like to extract from a general HTML page, all the text (displayed or not).

I would like to remove

any HTML tags
Any javascript
Any CSS styles

Is there a regular expression (one or more) that will achieve that?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

score 0 · Answer 1 · 2026-05-10T17:04:17+00:00

You can’t really parse HTML with regular expressions. It’s too complex. RE’s won’t handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.

You’ll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.

Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE’s. All it requires is patience and hard work. But it’s often simpler to use someone else’s parser.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I would like to extract from a general HTML page, all the text (displayed

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply