I have a string of text chunked into phrases, with each phrase surrounded by square brackets:
[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]
Sometimes a chunk does not start with a p-character (like the last one above).
My problem is I need to capture each chunk. That’s okay under normal circumstances, but sometimes this input is mis-formatted, for example, some chunks might have only one bracket, or none. So it might look like this:
[pX textX/labelX] pY textY/labelY] textZ/labelZ
But it ought to come out like this:
[pX textX/labelX] [pY textY/labelY] [textZ/labelZ]
The problem does not include nested brackets. After diving into loads of different people’s regex solutions like never before (I’m new at regex), and downloading cheat-sheets and getting a Regex tool (Expresso) I still don’t know how to do this. Any ideas? Maybe regex doesn’t work. But how is this problem solved? I imagine it’s not a very unique problem.
Edit
Here is a specific example:
$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";
This is a great compact solution from @FailedDev:
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = $& }
but I think two points need to be added for emphasis in the problem:
- some chunks have no brackets at all
- ,/PUNC and w#hm/CC_PRP_MP3] are separate chunks that need to be separated.
However, since this case is a fixed one (ie. a PUNCTUATION mark followed by a text/label pattern that has only one square bracket on the right), I kind of hard-coded it into the solution like this:
my @stuff;
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
if($& =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
{
@bits = split(/ /,$&); # split by space
push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
push(@stuff, substr($&, 7)); # after that space is the other chunk
}
else { push(@stuff, $&); }
}
foreach(@stuff){ print $_; }
Trying the example I added in the edit, this works just fine except for one problem. The last ./PUNC gets left out, so the output is:
[VP sysmH/VBD_MS3]
[PP ll#/IN_DET Axryn/NNS_MP]
,/PUNC
w#hm/CC_PRP_MP3]
[NP AEDA'/NN]
,/PUNC
[PP b#/IN m/NN_FS]
[NP >HyAnA/NN]
How can I keep the last chunk?
You could use this
Assuming your string is something like :
It will not work with this for example :
pY [[[textY/labelY]Perl specific solution :
Update :
This works with your updated string, but you should trim the whitespace of the results, if you need to.
Update : 2
I suggest opening a different question, because your original question is totally different than the last one.