Let’s say we have a string: “abcbcdcde”
I want to identify all substrings that are repeated in this string using regex (i.e. no brute-force iterative loops).
For the above string, the result set would be: {“b”, “bc”, “c”, “cd”, “d”}
I must confess that my regex is far more rusty than it should be for someone with my experience. I tried using a backreference, but that’ll only match consecutive duplicates. I need to match all duplicates, consecutive or otherwise.
In other words, I want to match any character(s) that appears for the >= 2nd time. If a substring occurs 5 times, then I want to capture each of occurrences 2-5. Make sense?
This is my pathetic attempt thus far:
preg_match_all( '/(.+)(.*)\1+/', $string, $matches ); // Way off!
I tried playing with look-aheads but I’m just butchering it. I’m doing this in PHP (PCRE) but the problem is more or less language-agnostic. It’s a bit embarrassing that I’m finding myself stumped on this.
Your problem is recursi … you know what, forget about recursion! =p it wouldn’t really work well in PHP and the algorithm is pretty clear without it as well.
Out of interest, I wrote Tim’s answer in PHP as well:
I’ve let them fight it out in a small benchmark of 800 bytes of random data:
Each code is run for 10 rounds and the execution time is measured. The results?
It gets weirder when you look at 24k bytes (or anything above 1k really):
It turns out that the regular expression broke down after 1k characters and so the
$matchesarray was empty. These are my .ini settings:It’s not clear to me how a backtrack or recursion limit would have been hit after only 1k of characters. But even if those settings are “fixed” somehow, the results are still obvious, PCRE doesn’t seem to be the answer.
I suppose writing this in C would speed it up somewhat, but I’m not sure to what degree.
Update
With some help from hakre’s answer I put together an improved version that increases performance by ~18% after optimizing the following:
Remove the
substr()calls in the outer loop to advance the string pointer; this was a left over from my previous recursive incarnations.Use the partial results as a positive cache to skip
strpos()calls inside the inner loop.And here it is, in all its glory (: