I have the following regex to detect start and end script tags in the

Question

0

Asked: June 4, 20262026-06-04T20:06:18+00:00 2026-06-04T20:06:18+00:00

I have the following regex to detect start and end script tags in the

0

I have the following regex to detect start and end script tags in the html file:

<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:[^s]))*)</script>

meaning in short it will catch: <script "NOT THIS</s" > "NOT THIS</s" </script>

it works but needs really long time to detect <script>,
even minutes or hours for long strings

The lite version works perfectly even for long string:

<script[^<]*>[^<]*</script>

however, the extended pattern I use as well for other tags like <a> where < and > are possible to appears also as values of attributes.

python test:

import re
pattern = re.compile('<script(?:[^<]+|<(?:[^/]|/(?:[^s])))*>(?:[^<]+|<(?:[^/]|/(?:^s]))*)</script>', re.I + re.DOTALL)
re.search(pattern, '11<script type="text/javascript"> easy>example</script>22').group()
re.search(pattern, '<script type="text/javascript">' + ('hard example' * 50) + '</script>').group()

how can I fix it?
The inner part of regex (after <script>) should be changed and simplified.

PS 🙂 Anticipate your answers about the wrong approach like using regex in html parsing,
I know very well many html/xml parsers, and what I can expect in often broken html code, and regex is really useful here.

comment:
well, I need to handle:
each <a < document like this.border="5px;">
and approach is to use parsers and regex together
BeautifulSoup is only 2k lines, which not handling every html and just extends regex from sgmllib.

and the main reason is that I must know exact the position where every tag starts and stop. and every broken html must be handled.

BS is not perfect, sometimes happens:
BeautifulSoup(‘< scriPt\n\n>a<aa>s< /script>’).findAll(‘script’) == []

@Cylian:
atomic grouping as you know is not available in python’s re.
so non-geedy everything .*? until <\s/\stag\s*>** is a winner at this time.

I know that is not perfect in that case:
re.search(‘<\sscript.?<\s*/\sscript\s>’,'< script </script> shit </script>’).group()
but I can handle refused tail in the next parsing.

It’s pretty obvious that html parsing with regex is not one battle figthing.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-04T20:06:20+00:00

Editorial Team

2026-06-04T20:06:20+00:00Added an answer on June 4, 2026 at 8:06 pm

I don’t know python, but I know regular expressions:

if you use the greedy/non-greedy operators you get a much simpler regex:

<script.*?>.*?</script>

This is assuming there are no nested scripts.

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have the following regex to detect start and end script tags in the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply