Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7702311
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T23:11:56+00:00 2026-05-31T23:11:56+00:00

I’m having an issue building a grammar which is capable of parsing Python 3’s

  • 0

I’m having an issue building a grammar which is capable of parsing Python 3’s AST dump format and converting it into an AST format that’s easier for me to play around with. I decided to write an ANTLR grammar to do so, but I’m having an issue processing keyword blocks (but only keyword blocks, for some reason). I isolated out the keyword grammar, as shown:

grammar kwds;
options {output=AST;}

keywords:   'keywords=['((', '?)keyword)*']' -> keyword*
    ;

keyword :   'keyword(arg='STRING', value='str')'
    ;
str :   'Str(s='STRING')' -> STRING
    ;

STRING
    :  '\'' ( ESC_SEQ | ~('\\'|'\'') )* '\''
    ;

fragment
ESC_SEQ
    :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
    |   UNICODE_ESC
    |   OCTAL_ESC
    ;

EMPTYBRACKETS
    :   '[]';

fragment
OCTAL_ESC
    :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7') ('0'..'7')
    |   '\\' ('0'..'7')
    ;

fragment
UNICODE_ESC
    :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;

This is designed to accept a list of keywords (0 or more with comma separators), with the format shown in the keyword rule.

If you give the above grammar the following (valid) input,

keywords=[keyword(arg=’name’, value=Str(s=’UGA’)), keyword(arg=’rank’, value=Str(s=’2′))]

the grammar will recognize this, as it should.

However, using the “complete” python 3 AST format grammar that I’ve written (found at http://pastebin.com/ETrSVXvf for space-saving purposes, with the two above rules found on lines 106 and 109, respectively), which uses virtually the exact same grammar rules, the token stream seems to be a few characters off after parsing the first keyword match from the sample shown above, producing the following output when parsing against the keywords rule:

sample3.txt line 1:52 mismatched character 'e' expecting 'w'
sample3.txt line 1:53 no viable alternative at character 'y'
sample3.txt line 1:54 no viable alternative at character 'w'
sample3.txt line 1:55 no viable alternative at character 'o'
sample3.txt line 1:56 no viable alternative at character 'r'
sample3.txt line 1:57 no viable alternative at character 'd'
sample3.txt line 1:58 no viable alternative at character '('
sample3.txt line 1:59 missing ENDBR at 'arg='

I can only think of one possibility for this occurring: Something is being tokenized incorrectly due to an ambiguity in the grammar, since the pattern I used to detect multiple keyword statements works for other types of statements. However, I’m completely stuck as to where that ambiguity actually is in the grammar.

Also, any general improvement tips as to how I could improve my grammar in general would be appreciated!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T23:11:57+00:00Added an answer on May 31, 2026 at 11:11 pm

    If you add the rule:

    parse
     : (t=. {System.out.printf("type=\%-20s text='\%s'\n", tokenNames[$t.type], $t.text);})* EOF
     ;
    

    which simply matches zero or more tokens and prints out the type and text of these tokens, you will see that the lexer cannot cope with the input , keyword from your example:

    keywords=[keyword(arg='name', value=Str(s='UGA')), keyword(arg='rank', value=Str(s='2'))]
                                                     ^^^^^^^^^
    

    So there is no issue in one of your parser rules, but things go wrong on a lexical level.

    I advise you to remove all literal tokens from your parser and create lexer rule for them. Then add a single parse rule as I posted above with which you can test the lexer to see if the proper tokens are created. Once the proper tokens have been created, write your parser rules.

    I’m pretty sure the problem here is the fact that you don’t have a ', keyword' token and that, once the lexer “sees” ', k', it tries to create a ', kwargs' token which fails, of course. So I also recommend you do not include the comma’s and spaces in your tokens but let them be tokens of their own.

    Also, you don’t want to have a rewrite rule like this:

    stmtlist:       ((', '?) stmt)* -> stmt*
            ;
    

    which can potentially match nothing. If that happens, ANTLR will throw an exception while creating the AST. Always let a rewrite rule produce something:

    ...
    
    tokens {
      ...
      STMTLST;
      ...
    }
    
    ...
    
    stmtlist:       ((', '?) stmt)* -> ^(STMTLST stmt*)
            ;
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

link Im having trouble converting the html entites into html characters, (&# 8217;) i
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I have a French site that I want to parse, but am running into
We're building an app, our first using Rails 3, and we're having to build
That's pretty much it. I'm using Nokogiri to scrape a web page what has
I am trying to understand how to use SyndicationItem to display feed which is
I used javascript for loading a picture on my website depending on which small
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
this is what i have right now Drawing an RSS feed into the php,
I've got a string that has curly quotes in it. I'd like to replace

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.