I don’t know if this is possible using regex. I’m just asking in case someone knows the answer.
I have a string ="hellohowareyou??". I need to split it like this
[h, el, loh, owar, eyou?, ?].
The splitting is done such that the first string will have length 1, second length 2 and so on. The last string will have the remaining characters. I can do it easily without regex using a function like this.
public ArrayList<String> splitString(String s)
{
int cnt=0,i;
ArrayList<String> sList=new ArrayList<String>();
for(i=0;i+cnt<s.length();i=i+cnt)
{
cnt++;
sList.add(s.substring(i,i+cnt));
}
sList.add(s.substring(i,s.length()));
return sList;
}
I was just curious whether such a thing can be done using regex.
Solution
The following snippet generates the pattern that does the job (see it run on ideone.com):
Note that this solution uses techniques already covered in my regex article series. The only new thing here is
\Gand forward references.References
This is a brief description of the basic regex constructs used:
(?x)is the embedded flag modifier to enable the free-spacing mode, where unescaped whitespaces are ignored (and#can be used for comments).^and$are the beginning and end-of-the-line anchors.\Gis the end-of-previous match anchor.|denotes alternation (i.e. "or").?as a repetition specifier denotes optional (i.e. zero-or-one of). As a repetition quantifier in e.g..*?it denotes that the*(i.e. zero-or-more of) repetition is reluctant/non-greedy.(…)are used for grouping.(?:…)is a non-capturing group. A capturing group saves the string it matches; it allows, among other things, matching on back/forward/nested references (e.g.\1).(?=…)is a positive lookahead; it looks to the right to assert that there’s a match of the given pattern.(?<=…)is a positive lookbehind; it looks to the left.(?!…)is a negative lookahead; it looks to the right to assert that there isn’t a match of a pattern.Related questions
[nested-reference]series:(?<=#)[^#]+(?=#)work?Explanation
The pattern matches on zero-width assertions. A rather complex algorithm is used to assert that the current position is a triangular number. There are 2 main alternatives:
(?<=^.), i.e. we can lookbehind and see the beginning of the string one dot awaymeasureto reconstruct how the last match was made (using\Gas reference point), storing the result of the measurement in "before"\Gand "after"\Gcapturing groups. We thencheckif the current position is the one prescribed by the measurement to find where the next match should be made.Thus the first alternative is the trivial "base case", and the second alternative sets up how to make all subsequent matches after that. Java doesn’t have custom-named groups, but here are the semantics for the 3 capturing groups:
\1captures the string "before"\G\2captures some string "after"\G\1is e.g. 1+2+3+…+k, then the length of\2needs to be k.\2 .has length k+1 and should be the next part in oursplit!\3captures the string to the right of our current positionassertEntiretyon\1 \G \2 . \3, we match and set the new\GYou can use mathematical induction to rigorously prove the correctness of this algorithm.
To help illustrate how this works, let’s work through an example. Let’s take
abcdefghijklmas input, and say that we’ve already partially splitted off[a, bc, def].Remember that
\Gmarks the end of the last match, and it occurs at triangular number indices. If\Goccured at 1+2+3+…+k, then the next match needs to be k+1 positions after\Gto be a triangular number index.Thus in our example, given where
\Gis where we just splitted offdef, we measured that k=3, and the next match will split offghijas expected.To have
\1and\2be built according to the above specification, we basically do awhile"loop": for as long as it’snotGyet, we count up to k as follows:+NBefore, i.e. we extend\1by oneforEachDotBehind+1After, i.e. we extend\2by just oneNote that
notGyetcontains a forward reference to group 1 which is defined later in the pattern. Essentially we do the loop until\1"hits"\G.Conclusion
Needless to say, this particular solution has a terrible performance. The regex engine only remembers WHERE the last match was made (with
\G), and forgets HOW (i.e. all capturing groups are reset when the next attempt to match is made). Our pattern must then reconstruct the HOW (an unnecessary step in traditional solutions, where variables aren’t so "forgetful"), by painstakingly building strings by appending one character at a time (which isO(N^2)). Each simple measurement is linear instead of constant time (since it’s done as a string matching where length is a factor), and on top of that we make many measurements which are redundant (i.e. to extend by one, we need to first re-match what we already have).There are probably many "better" regex solutions than this one. Nonetheless, the complexity and inefficiency of this particular solution should rightfully suggest that regex is not the designed for this kind of pattern matching.
That said, for learning purposes, this is an absolutely wonderful problem, for there is a wealth of knowledge in researching and formulating its solutions. Hopefully this particular solution and its explanation has been instructive.