The following regex pattern, when applied to very long strings (60KB), causes java to seem to “hang”.
.*\Q|\E.*\Q|\E.*\Q|\E.*bundle.*
I don’t understand why.
Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.
Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.
Lost your password? Please enter your email address. You will receive a link and will create a new password via email.
Please briefly explain why you feel this question should be reported.
Please briefly explain why you feel this answer should be reported.
Please briefly explain why you feel this user should be reported.
Basically, the “.*” (match any number of anything) means try to match the entire string, if it doesn’t match, then go back and try again, etc. using one of these is not too much of a problem, but the time necessary to use more than one increases exponentially. This is a fairly in-depth (and much-more accurate) discussion of this sort of thing: http://discovery.bmc.com/confluence/display/Configipedia/Writing+Efficient+Regex
EDIT: (I hope you really wanted to know WHY)
Example Source String:
ONE WAY OF LOOKING AT IT:
The process takes so long because the
.*matches the entire source string (aaffg, ";;p[p09978|ksjfsadfas|2936827634876|2345.4564a bundle of sticks), only to find that it does not end in a|symbol, then backtracks to the last case of a|symbol (...4876|2345...), then tries to match the next.*all the way to the end of the string.It starts looking for the next
|symbol specified in your expression, and not finding it, it then backtracks to the first|symbol that was matched (the one in...4876|2345...), discards that match and finds the closest|before it (...dfas|2936...), so that it will be able to match the second|symbol in your match expression.It will then proceed to match the
.*to2936827634876and the second|to the one in...4876|2345...and the next.*to the remaining text, only to find that you wanted yet another|. It will then continue to backtrack again and again, until it matches all of the symbols you specified.ANOTHER WAY OF LOOKING AT IT:
(Original expression):
this roughly translates to
the problem is that
any number of anythingincludes|symbols, requiring parsing of the entire string over and over again where what you really mean isany number of anything that is not a '|'To fix or improve the expression, I would recommend three things:
First (and most significant), replace the majority of the “match anything”s (
.*) with negated character classes ([^|]) like so:…this will prevent it from matching to the end of the string over and over again, but instead matching all the non-
|symbols up to the first character that is not a “not a|symbol” (that double negative means up to the first|symbol), then matching the|symbol, then going to the next, etc…The second change (somewhat significant, depending upon your source string) should be making the second-to-last “match any number of anything” (
.*) into a “lazy” or “reluctant” type of “any number of” (.*?). This will make it try to match anything with the idea of looking out forbundleinstead of skipping overbundleand matching the rest of the string, only to realize that there is more to match once it gets there, having to backtrack. This would result in:The third change I would recommend is for readability – replace the
\Q\Eblocks with a single escape, as in\|, like so:This is how the expression is internally processed anyways – there is literally a function that converts the expression to “escape all the special characters in between \Q and \E” –
\Q\Eis a shorthand only, and if it does not make your expression shorter or easier to read, it should not be used. Period.The negated character classes have an un-escaped
|because|is not a special character within the context of character classes – but let’s not digress too much. You can escape them if you’d like, but you don’t have to.The final expression translates roughly to:
A good tool that I use (but costs some money) is called RegexBuddy – a companion/free website for understanding regex’s is http://www.regular-expressions.info, and the particular page that explains repetition is http://www.regular-expressions.info/repeat.html
RegexBuddy emulates other regex engines and says that your original regex would take 544 ‘steps’ to match as opposed to 35 ‘steps’ for the version I provided.
SLIGHTLY LONGER Example Source String A:
SLIGHTLY LONGER Example Source String B:
Longer source string ‘A’ (added
1before2936827634876) did not affect my suggested replacement, but increased the original by 6 stepsLonger source string ‘B’ (added ‘4me’ at the end of the expression) again did not affect my suggested replacement, but added 48 steps to the original
Thus, depending on how a string is different from the examples above, a 60K string could only take 544 steps, or it could take more than a million steps