Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8476693
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 10, 20262026-06-10T18:15:02+00:00 2026-06-10T18:15:02+00:00

I’d like to parse an UTF8 encoded text file that may contain something like

  • 0

I’d like to parse an UTF8 encoded text file that may contain something like this:

int 1
text " some text with \" and \\ "
int list[-45,54, 435 ,-65]
float list [ 4.0, 5.2,-5.2342e+4]

The numbers in the list are separated by commas. Whitespace is permitted but not required between any number and any symbol like commas and brackets here. Similarly for words and symbols, like in the case of list[

I’ve done the quoted string reading by forcing Scanner to give me single chars (setting its delimiter to an empty pattern) because I still thought it’ll be useful for reading the ints and floats, but I’m not sure anymore.

The Scanner always takes a complete token and then tries to match it. What I need is try to match as much (or as little) as possible, disregarding delimiters.

Basically for this input

int list[-45,54, 435 ,-65]

I’d like to be able to call and get this

s.nextWord()   // int 
s.nextWord()   // list
s.nextSymbol() // [
s.nextInt()    // -45
s.nextSymbol() // ,
s.nextInt()    // 54
s.nextSymbol() // ,
s.nextInt()    // 435
s.nextSymbol() // ,
s.nextInt()    // -65
s.nextSymbol() // ]

and so on.

Or, if it couldn’t parse doubles and other types itself, at least a method that takes a regex, returns the biggest string that matches it (or an error) and sets the stream position to just after what it matched.

Can the Scanner somehow be used for this? Or is there another approach? I feel this must be quite a common thing to do, but I don’t seem to be able to find the right tool for it.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-10T18:15:04+00:00Added an answer on June 10, 2026 at 6:15 pm

    I’m not an ANTLR expert, but this ANTLR grammar is capable to parse your code:

    grammar Expressions;
    
    expressions 
        :   expression+ EOF
        ;
    
    expression 
        :   intExpression
        |   intListExpression
        |   floatExpression
        |   floatListExpression
        |   textExpression
        |   textListExpression
        ;
    
    intExpression        :  intType INT;
    intListExpression    :  intType listType '[' ( INT (',' INT)* )? ']';
    floatExpression      :  floatType FLOAT;
    floatListExpression  :  floatType listType '[' ( (INT|FLOAT) (',' (INT|FLOAT))* )? ']';
    textExpression       :  textType STRING;
    textListExpression   :  textType listType '[' ( STRING (',' STRING)* )? ']';
    
    intType   :  'int';
    floatType :  'float';
    textType  :  'text';
    listType  :  'list';
    
    INT :   '0'..'9'+
        ;
    
    FLOAT
        :   ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
        |   '.' ('0'..'9')+ EXPONENT?
        |   ('0'..'9')+ EXPONENT
        ;
    
    STRING
        :  '"' ( ESC_SEQ | ~('\\'|'"') )* '"'
        ;
    
    fragment
    EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
    
    fragment
    HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
    
    fragment
    ESC_SEQ
        :   '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
        |   UNICODE_ESC
        |   OCTAL_ESC
        ;
    
    fragment
    OCTAL_ESC
        :   '\\' ('0'..'3') ('0'..'7') ('0'..'7')
        |   '\\' ('0'..'7') ('0'..'7')
        |   '\\' ('0'..'7')
        ;
    
    fragment
    UNICODE_ESC
        :   '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
        ;
    
    WS  :   ( ' '
            | '\t'
            | '\r'
            | '\n'
            ) {$channel=HIDDEN;}
        ;
    

    Of course you will need to improve it, but I think that with this structure is easy to insert code in the parser to do what you want (a kind of token stream). Try it in ANTLRWorks debug to see what happens.

    For your input, this is the parse tree:

    Parse Tree for OP input

    Edit: I changed it to support empty lists.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

For some reason, after submitting a string like this Jack’s Spindle from a text
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this
I have just tried to save a simple *.rtf file with some websites and
I've got a string that has curly quotes in it. I'd like to replace
I am trying to render a haml file in a javascript response like so:
I have this code to decode numeric html entities to the UTF8 equivalent character.
I have a French site that I want to parse, but am running into
I know there's a lot of other questions out there that deal with this
I have a reasonable size flat file database of text documents mostly saved in

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.