The following regex pattern, when applied to very long strings (60KB), causes java to

Question

0

Asked: May 25, 20262026-05-25T19:35:55+00:00 2026-05-25T19:35:55+00:00

The following regex pattern, when applied to very long strings (60KB), causes java to

0

The following regex pattern, when applied to very long strings (60KB), causes java to seem to “hang”.

.*\Q|\E.*\Q|\E.*\Q|\E.*bundle.*

I don’t understand why.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-25T19:35:56+00:00

Basically, the “.*” (match any number of anything) means try to match the entire string, if it doesn’t match, then go back and try again, etc. using one of these is not too much of a problem, but the time necessary to use more than one increases exponentially. This is a fairly in-depth (and much-more accurate) discussion of this sort of thing: http://discovery.bmc.com/confluence/display/Configipedia/Writing+Efficient+Regex

EDIT: (I hope you really wanted to know WHY)

Example Source String:

aaffg,  ";;p[p09978|ksjfsadfas|2936827634876|2345.4564a bundle of sticks

ONE WAY OF LOOKING AT IT:

The process takes so long because the .* matches the entire source string (aaffg, ";;p[p09978|ksjfsadfas|2936827634876|2345.4564a bundle of sticks), only to find that it does not end in a | symbol, then backtracks to the last case of a | symbol (...4876|2345...), then tries to match the next .* all the way to the end of the string.

It starts looking for the next | symbol specified in your expression, and not finding it, it then backtracks to the first | symbol that was matched (the one in ...4876|2345...), discards that match and finds the closest | before it (...dfas|2936...), so that it will be able to match the second | symbol in your match expression.

It will then proceed to match the .* to 2936827634876 and the second | to the one in ...4876|2345... and the next .* to the remaining text, only to find that you wanted yet another |. It will then continue to backtrack again and again, until it matches all of the symbols you specified.

ANOTHER WAY OF LOOKING AT IT:

(Original expression):

.*\Q|\E.*\Q|\E.*\Q|\E.*bundle.*

this roughly translates to

match:
               any number of anything, 
followed by    a single '|', 
followed by    any number of anything, 
followed by    a single '|', 
followed by    any number of anything, 
followed by    a single '|', 
followed by    any number of anything,
followed by    the literal string 'bundle',
followed by    any number of anything

the problem is that any number of anything includes | symbols, requiring parsing of the entire string over and over again where what you really mean is any number of anything that is not a '|'

To fix or improve the expression, I would recommend three things:

First (and most significant), replace the majority of the “match anything”s (.*) with negated character classes ([^|]) like so:

[^|]*\Q|\E[^|]*\Q|\E[^|]*\Q|\E.*bundle.*

…this will prevent it from matching to the end of the string over and over again, but instead matching all the non-| symbols up to the first character that is not a “not a | symbol” (that double negative means up to the first | symbol), then matching the | symbol, then going to the next, etc…

The second change (somewhat significant, depending upon your source string) should be making the second-to-last “match any number of anything” (.*) into a “lazy” or “reluctant” type of “any number of” (.*?). This will make it try to match anything with the idea of looking out for bundle instead of skipping over bundle and matching the rest of the string, only to realize that there is more to match once it gets there, having to backtrack. This would result in:

[^|]*\Q|\E[^|]*\Q|\E[^|]*\Q|\E.*?bundle.*

The third change I would recommend is for readability – replace the \Q\E blocks with a single escape, as in \|, like so:

[^|]*\|[^|]*\|[^|]*\|[^|].*?bundle.*

This is how the expression is internally processed anyways – there is literally a function that converts the expression to “escape all the special characters in between \Q and \E” – \Q\E is a shorthand only, and if it does not make your expression shorter or easier to read, it should not be used. Period.

The negated character classes have an un-escaped | because | is not a special character within the context of character classes – but let’s not digress too much. You can escape them if you’d like, but you don’t have to.

The final expression translates roughly to:

match:
               any number of anything that is not a '|', 
followed by    a single '|', 
followed by    any number of anything that is not a '|', 
followed by    a single '|', 
followed by    any number of anything that is not a '|', 
followed by    a single '|', 
followed by    any number of anything, up until the next expression can be matched,
followed by    the literal string 'bundle',
followed by    any number of anything

A good tool that I use (but costs some money) is called RegexBuddy – a companion/free website for understanding regex’s is http://www.regular-expressions.info, and the particular page that explains repetition is http://www.regular-expressions.info/repeat.html

RegexBuddy emulates other regex engines and says that your original regex would take 544 ‘steps’ to match as opposed to 35 ‘steps’ for the version I provided.

SLIGHTLY LONGER Example Source String A:

aaffg,  ";;p[p09978|ksjfsadfas|12936827634876|2345.4564a bundle of sticks

SLIGHTLY LONGER Example Source String B:

aaffg,  ";;p[p09978|ksjfsadfas|2936827634876|2345.4564a bundle of sticks4me

Longer source string ‘A’ (added 1 before 2936827634876) did not affect my suggested replacement, but increased the original by 6 steps

Longer source string ‘B’ (added ‘4me’ at the end of the expression) again did not affect my suggested replacement, but added 48 steps to the original

Thus, depending on how a string is different from the examples above, a 60K string could only take 544 steps, or it could take more than a million steps

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The following regex pattern, when applied to very long strings (60KB), causes java to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply