I have the string like this:
String s = "word=PS1,p1,p2,p3=q1,q2|word2=PS3,p4,p5,p6=q3";
or like this:
String s2 = "word3=PS2,p7,p8=q4,q5,q6|=PS3,p9=";
or like this:
String s3 = "=PS3=";
So, in formal – string contains some word definitions in dictionary, splitted by “|” symbol.
here:
-
word – word in the dictionary (optional, like in S2 or S3)
-
PS1, PS2, PS3 – Part of speech tag (required)
-
p1,p2,… – some parameters (optional)
-
q1, q2, q3, … – some another parameters (also optional)
I want to build regex, which finds all occurrences of such strings in the text and gives me the groups:
- group1 – word
- group2 – part of speech tag
- group3, group4, … – parameters p
- group(k), group(k+1), … – another parameters (q)
I don’t care for index of group of the last p parameter and first q parameter. I should know, that first group – is word (may be null), second group – part of speech, and other groups – parameters p and q.
Now I have such regex:
"([a-z]*)?=([A-Z]+)(,?[a-z]+)*=(,?[a-z]+)*")
But it doesn’t work correctly. It shows me only the last parameters p and q. I.e. (for S2) :
- group1 = word3 – OK
- group2 = PS2 – OK
- group3 = p8 – NOT OK (only last p-parameter)
- group4 = q6 – NOT OK (also last q-parameter)
Could you help me?
UPDATE:
“=”-character only the split-character between p-parameters and q-parameters. It’s not necessary in my problem. You should think, that p-parameters and q-parameters are not different.
example of real input:
String s = "bread=NOUN,plur,link=form|=VERB="
You can’t have a variable number of capture-groups in Regex. In .Net you could have multiple captures for each group, but not in Java. The problem for you is that the regex engine only stores the last successful match for each group. The best you could do is to match all p- and q- parameters into two big groups, and then split them up.
I used
[^|=,]*to match any non-special character.