Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6816043
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 26, 20262026-05-26T20:52:44+00:00 2026-05-26T20:52:44+00:00

I have a string of text chunked into phrases, with each phrase surrounded by

  • 0

I have a string of text chunked into phrases, with each phrase surrounded by square brackets:

[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]

Sometimes a chunk does not start with a p-character (like the last one above).

My problem is I need to capture each chunk. That’s okay under normal circumstances, but sometimes this input is mis-formatted, for example, some chunks might have only one bracket, or none. So it might look like this:

 [pX textX/labelX] pY textY/labelY] textZ/labelZ

But it ought to come out like this:

 [pX textX/labelX] [pY textY/labelY] [textZ/labelZ]

The problem does not include nested brackets. After diving into loads of different people’s regex solutions like never before (I’m new at regex), and downloading cheat-sheets and getting a Regex tool (Expresso) I still don’t know how to do this. Any ideas? Maybe regex doesn’t work. But how is this problem solved? I imagine it’s not a very unique problem.

Edit

Here is a specific example:

$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";

This is a great compact solution from @FailedDev:

while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = $& }

but I think two points need to be added for emphasis in the problem:

  1. some chunks have no brackets at all
  2. ,/PUNC and w#hm/CC_PRP_MP3] are separate chunks that need to be separated.

However, since this case is a fixed one (ie. a PUNCTUATION mark followed by a text/label pattern that has only one square bracket on the right), I kind of hard-coded it into the solution like this:

my @stuff;
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
    if($& =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
    {
        @bits = split(/ /,$&); # split by space
        push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
        push(@stuff, substr($&, 7)); # after that space is the other chunk
    }
    else { push(@stuff, $&); } 
}
foreach(@stuff){ print $_; }

Trying the example I added in the edit, this works just fine except for one problem. The last ./PUNC gets left out, so the output is:

[VP sysmH/VBD_MS3]
[PP ll#/IN_DET Axryn/NNS_MP]
,/PUNC
w#hm/CC_PRP_MP3]
[NP AEDA'/NN]
,/PUNC
[PP b#/IN m/NN_FS]
[NP >HyAnA/NN]

How can I keep the last chunk?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-26T20:52:45+00:00Added an answer on May 26, 2026 at 8:52 pm

    You could use this

    /(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*)/
    

    Assuming your string is something like :

    [pX textX/labelX] pY textY/labelY]  pY textY/labelY]  pY textY/labelY]  [pY textY/labelY] [3940-823490-2 [30-94823049 [32904823498]
    

    It will not work with this for example : pY [[[textY/labelY]

    Perl specific solution :

    while ($subject =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
        # matched text = $&
    }
    

    Update :

    /(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s+[^[]+?(?:\s+|$))/
    

    This works with your updated string, but you should trim the whitespace of the results, if you need to.

    Update : 2

    /(\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s*[^[]+?(?:\s+|$))/
    

    I suggest opening a different question, because your original question is totally different than the last one.

    "
    (                 # Match the regular expression below and capture its match into backreference number 1
                         # Match either the regular expression below (attempting the next alternative only if this one fails)
          \[                # Match the character “[” literally
          [^[]              # Match any character that is NOT a “[”
             *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
          ]                 # Match the character “]” literally
       |                 # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
          [^[ ]             # Match a single character NOT present in the list “[ ”
          .                 # Match any single character that is not a line break character
             *?                # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
          ]                 # Match the character “]” literally
       |                 # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
          \[                # Match the character “[” literally
          [^[ ]             # Match a single character NOT present in the list “[ ”
             *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
       |                 # Or match regular expression number 4 below (the entire group fails if this one fails to match)
          \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
             *                 # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
          [^[]              # Match any character that is NOT a “[”
             +?                # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
          (?:               # Match the regular expression below
                               # Match either the regular expression below (attempting the next alternative only if this one fails)
                \s                # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
                   +                 # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
             |                 # Or match regular expression number 2 below (the entire group fails if this one fails to match)
                $                 # Assert position at the end of the string (or before the line break at the end of the string, if any)
          )
    )
    "
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a large string (with text). I need to split it into a
I have a string of text that contains html, and I need to extract
Suppose I have a string of text, of all characters Latin-based. With punctuation. How
Suppose you have a string with text in two or more scripts. When you
Suppose I have this code: String encoding = UTF-16; String text = [Hello StackOverflow];
I have a text string structured like this: = Some Heading (1) Some text
I have a text string that has been doctored to be web safe URLs
I have string like this /c SomeText\MoreText Some Text\More Text\Lol SomeText I want to
I have a string: [\n['-','some text what\rcontains\nnewlines'],\n\n trying to parse: Regex.Split(@[\n['-','some text what contains
Say I have an array of values: string[] text = new string[] { val1,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.