Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3810436
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 19, 20262026-05-19T15:27:36+00:00 2026-05-19T15:27:36+00:00

What class of languages do real modern regexes actually recognise? Whenever there is an

  • 0

What class of languages do real modern regexes actually recognise?

Whenever there is an unbounded length capturing group with a back-reference (e.g. (.*)_\1) a regex is now matching a non-regular language. But this, on its own, isn’t enough to match something like S ::= '(' S ')' | ε — the context-free language of matching pairs of parens.

Recursive regexes (which are new to me, but I am assured exist in Perl and PCRE) appear to recognize at least most CFLs.

Has anyone done or read any research in this area? What are the limitations of these “modern” regexes? Do they recognize strictly more or strictly less than CFGs, of LL or LR grammars? Or do there exist both languages that can be recognized by a regex but not a CFG and the opposite?

Links to relevant papers would be much appreciated.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-19T15:27:37+00:00Added an answer on May 19, 2026 at 3:27 pm

    Pattern Recursion

    With recursive patterns, you have a form of recursive descent matching.

    This is fine for a variety of problems, but once you want to actually do recursive descent parsing, you need to insert capture groups here and there, and it is awkward to recover the full parse structure in this way. Damian Conway’s Regexp::Grammars module for Perl transforms the simple pattern into an equivalent one that automatically does all that named capturing into a recursive data structure, making for far easier retrieval of the parsed structure. I have a sample comparing these two approaches at end of this posting.

    Restrictions on Recursion

    The question was what kinds of grammars that recursive patterns can match. Well, they’re certainly recursive descent type matchers. The only thing that comes to mind is that recursive patterns cannot handle left recursion. This puts a constraint on the sorts of grammars that you can apply them to. Sometimes you can reorder your productions to eliminate left recursion.

    BTW, PCRE and Perl differ slightly on how you’re allowed to phrase the recursion. See the sections on “RECURSIVE PATTERNS” and “Recursion difference from Perl” in the pcrepattern manpage. eg: Perl can handle ^(.|(.)(?1)\2)$ where PCRE requires ^((.)(?1)\2|.)$ instead.

    Recursion Demos

    The need for recursive patterns arises surprisingly frequently. One well-visited example is when you need to match something that can nest, such as balanced parentheses, quotes, or even HTML/XML tags. Here’s the match for balenced parens:

    \((?:[^()]*+|(?0))*\)
    

    I find that trickier to read because of its compact nature. This is easily curable with /x mode to make whitespace no longer significant:

    \( (?: [^()] *+ | (?0) )* \)
    

    Then again, since we’re using parens for our recursion, a clearer example would be matching nested single quotes:

    ‘ (?: [^‘’] *+ | (?0) )* ’
    

    Another recursively defined thing you may wish to match would be a palindrome. This simple pattern works in Perl:

    ^((.)(?1)\2|.?)$
    

    which you can test on most systems using something like this:

    $ perl -nle 'print if /^((.)(?1)\2|.?)$/i' /usr/share/dict/words
    

    Note that PCRE’s implementation of recursion requires the more elaborate

    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
    

    This is because of restrictions on how PCRE recursion works.

    Proper Parsing

    To me, the examples above are mostly toy matches, not all that interesting, really. When it becomes interesting is when you have a real grammar you’re trying to parse. For example, RFC 5322 defines a mail address rather elaborately. Here’s a “grammatical” pattern to match it:

    $rfc5322 = qr{
    
       (?(DEFINE)
    
         (?<address>         (?&mailbox) | (?&group))
         (?<mailbox>         (?&name_addr) | (?&addr_spec))
         (?<name_addr>       (?&display_name)? (?&angle_addr))
         (?<angle_addr>      (?&CFWS)? < (?&addr_spec) > (?&CFWS)?)
         (?<group>           (?&display_name) : (?:(?&mailbox_list) | (?&CFWS))? ; (?&CFWS)?)
         (?<display_name>    (?&phrase))
         (?<mailbox_list>    (?&mailbox) (?: , (?&mailbox))*)
    
         (?<addr_spec>       (?&local_part) \@ (?&domain))
         (?<local_part>      (?&dot_atom) | (?&quoted_string))
         (?<domain>          (?&dot_atom) | (?&domain_literal))
         (?<domain_literal>  (?&CFWS)? \[ (?: (?&FWS)? (?&dcontent))* (?&FWS)?
                                       \] (?&CFWS)?)
         (?<dcontent>        (?&dtext) | (?&quoted_pair))
         (?<dtext>           (?&NO_WS_CTL) | [\x21-\x5a\x5e-\x7e])
    
         (?<atext>           (?&ALPHA) | (?&DIGIT) | [!#\$%&'*+-/=?^_`{|}~])
         (?<atom>            (?&CFWS)? (?&atext)+ (?&CFWS)?)
         (?<dot_atom>        (?&CFWS)? (?&dot_atom_text) (?&CFWS)?)
         (?<dot_atom_text>   (?&atext)+ (?: \. (?&atext)+)*)
    
         (?<text>            [\x01-\x09\x0b\x0c\x0e-\x7f])
         (?<quoted_pair>     \\ (?&text))
    
         (?<qtext>           (?&NO_WS_CTL) | [\x21\x23-\x5b\x5d-\x7e])
         (?<qcontent>        (?&qtext) | (?&quoted_pair))
         (?<quoted_string>   (?&CFWS)? (?&DQUOTE) (?:(?&FWS)? (?&qcontent))*
                              (?&FWS)? (?&DQUOTE) (?&CFWS)?)
    
         (?<word>            (?&atom) | (?&quoted_string))
         (?<phrase>          (?&word)+)
    
         # Folding white space
         (?<FWS>             (?: (?&WSP)* (?&CRLF))? (?&WSP)+)
         (?<ctext>           (?&NO_WS_CTL) | [\x21-\x27\x2a-\x5b\x5d-\x7e])
         (?<ccontent>        (?&ctext) | (?&quoted_pair) | (?&comment))
         (?<comment>         \( (?: (?&FWS)? (?&ccontent))* (?&FWS)? \) )
         (?<CFWS>            (?: (?&FWS)? (?&comment))*
                             (?: (?:(?&FWS)? (?&comment)) | (?&FWS)))
    
         # No whitespace control
         (?<NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f])
    
         (?<ALPHA>           [A-Za-z])
         (?<DIGIT>           [0-9])
         (?<CRLF>            \x0d \x0a)
         (?<DQUOTE>          ")
         (?<WSP>             [\x20\x09])
       )
    
       (?&address)
    
    }x;
    

    As you see, that’s very BNF-like. The problem is it is just a match, not a capture. And you really don’t want to just surround the whole thing with capturing parens because that doesn’t tell you which production matched which part. Using the previously mentioned Regexp::Grammars module, we can.

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    use 5.010;
    use Data::Dumper "Dumper";
    
    my $rfc5322 = do {
        use Regexp::Grammars;    # ...the magic is lexically scoped
        qr{
    
        # Keep the big stick handy, just in case...
        # <debug:on>
    
        # Match this...
        <address>
    
        # As defined by these...
        <token: address>         <mailbox> | <group>
        <token: mailbox>         <name_addr> | <addr_spec>
        <token: name_addr>       <display_name>? <angle_addr>
        <token: angle_addr>      <CFWS>? \< <addr_spec> \> <CFWS>?
        <token: group>           <display_name> : (?:<mailbox_list> | <CFWS>)? ; <CFWS>?
        <token: display_name>    <phrase>
        <token: mailbox_list>    <[mailbox]> ** (,)
    
        <token: addr_spec>       <local_part> \@ <domain>
        <token: local_part>      <dot_atom> | <quoted_string>
        <token: domain>          <dot_atom> | <domain_literal>
        <token: domain_literal>  <CFWS>? \[ (?: <FWS>? <[dcontent]>)* <FWS>?
    
        <token: dcontent>        <dtext> | <quoted_pair>
        <token: dtext>           <.NO_WS_CTL> | [\x21-\x5a\x5e-\x7e]
    
        <token: atext>           <.ALPHA> | <.DIGIT> | [!#\$%&'*+-/=?^_`{|}~]
        <token: atom>            <.CFWS>? <.atext>+ <.CFWS>?
        <token: dot_atom>        <.CFWS>? <.dot_atom_text> <.CFWS>?
        <token: dot_atom_text>   <.atext>+ (?: \. <.atext>+)*
    
        <token: text>            [\x01-\x09\x0b\x0c\x0e-\x7f]
        <token: quoted_pair>     \\ <.text>
    
        <token: qtext>           <.NO_WS_CTL> | [\x21\x23-\x5b\x5d-\x7e]
        <token: qcontent>        <.qtext> | <.quoted_pair>
        <token: quoted_string>   <.CFWS>? <.DQUOTE> (?:<.FWS>? <.qcontent>)*
                                 <.FWS>? <.DQUOTE> <.CFWS>?
    
        <token: word>            <.atom> | <.quoted_string>
        <token: phrase>          <.word>+
    
        # Folding white space
        <token: FWS>             (?: <.WSP>* <.CRLF>)? <.WSP>+
        <token: ctext>           <.NO_WS_CTL> | [\x21-\x27\x2a-\x5b\x5d-\x7e]
        <token: ccontent>        <.ctext> | <.quoted_pair> | <.comment>
        <token: comment>         \( (?: <.FWS>? <.ccontent>)* <.FWS>? \)
        <token: CFWS>            (?: <.FWS>? <.comment>)*
                                 (?: (?:<.FWS>? <.comment>) | <.FWS>)
    
        # No whitespace control
        <token: NO_WS_CTL>       [\x01-\x08\x0b\x0c\x0e-\x1f\x7f]
        <token: ALPHA>           [A-Za-z]
        <token: DIGIT>           [0-9]
        <token: CRLF>            \x0d \x0a
        <token: DQUOTE>          "
        <token: WSP>             [\x20\x09]
        }x;
    };
    
    while (my $input = <>) {
        if ($input =~ $rfc5322) {
            say Dumper \%/;       # ...the parse tree of any successful match
                                  # appears in this punctuation variable
        }
    }
    

    As you see, by using a very slightly different notation in the pattern, you now get something which stores the entire parse tree away for you in the %/ variable, with everything neatly labelled. The result of the transformation is still a pattern, as you can see by the =~ operator. It’s just a bit magical.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there a good class repository like PEAR for PHP for other languages such
Which programming languages other than C++ support the concept of a constant class method?
just wanted to ask where I define initial class properties? From other languages I
This is my code (simplification of a real-life problem): class Foo { public: void
Possible Duplicate: What are the differences between struct and class in C++ http://www.cplusplus.com/reference/std/typeinfo/type_info/ I
Consider such code (this is just example not real code): class Foo(url : String)
When developing apps for use in multiple languages, I see a real benefit to
It's been a few years since my computer-language class and so I've forgotten the
I make three tabs. Language.( Include language Class with language layout) Activation ( Include
I wish to define the following typeclass Mapping : {-# LANGUAGE MultiParamTypeClasses #-} class

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.