Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8902345
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T01:35:24+00:00 2026-06-15T01:35:24+00:00

I plan to include text metadata (like bold , font-size , etc.) in the

  • 0

I plan to include text metadata (like bold, font-size, etc.) in the process of parsing to achieve better recognition.

For instance, I have a given structure, where a word on its own line word/r/n which is bold and sized 24px, is the title for some article. In order to get better recognition results, I want to take the characters as well as the metadata in account. In terms of ANTRL I’m not sure how this could be done best. I’d like to do something like:

  1. Wrap each character of the original text into a custom object with fields for the metadata and pass that to ANTLR.
  2. Preprocess the text and insert at specific places annotations for the metadata which is considered by the grammer.

I really like to take option 1. but I’m not sure which part from ANTLR I need to subclass etc. Do I have to start at the ANTLRInputStream-Object, in order to get a proper stream for a subclassed Lexer to get custom Tokens for a subclassed Parser etc. Is there a more elegant way, especially in querying the tokens while parsing with actions in a {} block ?

If anyone has some hints and/or experiences this would be great!

EDIT:

Here is a more specific simple example: I have a file wich includes the encoding of metadata which I parse forehand. the actual text including newline look like the following:

entryOne
Here is some content one.
entryTwo
Here is some content two.

Where the titlesentryOneand entryTwo are originally font-size of 24px and the content is font-size of 12px (as exemplary given values). Char by char I create a new instance of a custom object encapsulating the character as String and the font-size.

I initialize respective objects for each of the characters with fields of the font-size, e.g for the first letter of entryOne like
MyChar aTitelChar = new MyChar("e", 24);
For the content, like the second line Here is some content one. I create instances of MyChar like:

MyChar aContentChar= new MyChar("H", 12);

All characters of the texts are wrapped in instances of the below MyChar–Class and added to a List<MyChar> in order to produce a new input for ANTLR.

below is the Java Class for the characters:

public class MyChar {
    private int fontSizePx;
    private String text;

    public MyChar(String text, int fontSizePx) {
        this.text = text;
        this.fontSizePx = fontSizePx;
    }

    public int getFontSizePx() {
        return fontSizePx;
    }

    public String getText() {
        return text;
    }
}

I want that my grammar matches the above two entries (or more formatted this way) which in turn consist each of a title and a content which is terminated with a fullstop. This grammar could look like this:

rule: entry+ NEWLINE
;
entry:
title
content
;   
title: 
letters NEWLINE
;
content:
(letters)+ '.' NEWLINE
;
letters:
LETTERS 
;
LETTERS:
('a'..'z' | 'A'..'Z')+
;
WS:
(' ' | '\t' | 'f' ) + {$channel = HIDDEN;};
NEWLINE:'\r'? '\n';

Now, for instance, what I want to do is to find out if it’s really a title of an entry by checking the font-size of all letters encompassing the title-token before titel-rule returns. In case the input conforms to the grammar but is actually some kind of mistake (the original metadata-encoded file starts with something that conforms to the title-rule but its actually the content) the author of the grammar could sort that out if he knows that the original font-size for titles is 24 and check this. If one of the letter-tokens doesn’t equal to font-size 24 throw an exception/don’t return/do smthg. appropriate.

The thing I’m pondering on is where to plug in the List<MyChar> to provide this functionality (to query kinds of metadata while parsing in context of ANTLR). I’m experimenting with ANTLR’s Classes but as I’m new to ANTLR I thought probably some of the experienced users can point me in the right direction, like where would be a good insertion points for custom objects? should I start by implenting CharStream and override some methods? Probably there is something which ANTLR provides which I haven’t found yet?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T01:35:26+00:00Added an answer on June 15, 2026 at 1:35 am

    Here’s one way to accomplish what I think you’re going for, using the parser to manage matching input to metadata. Note that I made whitespace significant because it’s part of the content and can’t be skipped. I also made periods part of content to simplify the example, rather than using them as a marker.

    SysEx.g

    grammar SysEx;
    
    @header {
        import java.util.List;
    }
    
    @parser::members {
            private List<MyChar> metadata;
            private int curpos;
    
            private boolean isTitleInput(String input) {
                return isFontSizeInput(input, 24);
            }
    
            private boolean isContentInput(String input){
                return isFontSizeInput(input, 12);
            }
    
            private boolean isFontSizeInput(String input, int fontSize){
                List<MyChar> sublist = metadata.subList(curpos, curpos + input.length());
    
                System.out.println(String.format("Testing metadata for input=\%s, font-size=\%d", input, fontSize));
    
                int start = curpos;            
                //move our metadata pointer forward.
                skipInput(input);
    
                for (int i = 0, count = input.length(); i < count; ++i){
                    MyChar chardata = sublist.get(i);
                    char c = input.charAt(i);
                    if (chardata.getText().charAt(0) != c){
                        //This character doesn't match the metadata (ERROR!)
                        System.out.println(String.format("Content mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
                        return false;
                    } else if (chardata.getFontSizePx() != fontSize){
                        //The font is wrong.
                        System.out.println(String.format("Format mismatch at metadata position \%d: metadata=(\%s,\%d); input=\%c", start + i, chardata.getText(), chardata.getFontSizePx(), c));
                        return false;
                    }
                }
    
                //All characters check out.
                return true;
            }
    
            private void skipInput(String str){
                curpos += str.length();
                System.out.println("\t\tMoving metadata pointer ahead by " + str.length() + " to " + curpos);
            }
    }
    
    rule[List<MyChar> metadata]
        @init {
            this.metadata = metadata;
        }
        : entry+ EOF
        ;
    entry
        : title content
        {System.out.println("Finished reading entry.");}
        ;   
    title
        : line {isTitleInput($line.text)}? newline {System.out.println("Finished reading title " + $line.text);}
        ;
    content
        : line {isContentInput($line.text)}? newline {System.out.println("Finished reading content " + $line.text);}
        ;
    newline
        : (NEWLINE{skipInput($NEWLINE.text);})+
        ;
    line returns [String text]
        @init { 
            StringBuilder builder = new StringBuilder();
        }
        @after {
            $text = builder.toString();
        }
        : (ANY{builder.append($ANY.text);})+ 
        ;
    
    NEWLINE:'\r'? '\n';
    ANY: .; //whitespace can't be skipped because it's content.
    

    A title is a line that matches the title metadata (size 24 font) followed by one or more newline characters.

    A content is a line that matches the content metadata (size 12 font) followed by one or more newline characters. As mentioned above, I removed the check for a period for simplification.

    A line is a sequence of characters that does not include newline characters.

    A validating semantic predicate (the {...}? after line) is used to validate that the line matches the metadata.

    Here is the code I used to test the grammar (minus imports, for brevity):

    SysExGrammar.java

    public class SysExGrammar {
        public static void main(String[] args) throws Exception {
            //Create some metadata that matches our input.
            List<MyChar> matchingMetadata = new ArrayList<MyChar>();
            appendMetadata(matchingMetadata, "entryOne\r\n", 24);
            appendMetadata(matchingMetadata, "Here is some content one.\r\n", 12);
            appendMetadata(matchingMetadata, "entryTwo\r\n", 24);
            appendMetadata(matchingMetadata, "Here is some content two.\r\n", 12);
    
            parseInput(matchingMetadata);
    
            System.out.println("Finished example #1");
    
    
            //Create some metadata that doesn't match our input (negative test).
            List<MyChar> mismatchingMetadata = new ArrayList<MyChar>();
            appendMetadata(mismatchingMetadata, "entryOne\r\n", 24);
            appendMetadata(mismatchingMetadata, "Here is some content one.\r\n", 12);
            appendMetadata(mismatchingMetadata, "entryTwo\r\n", 12); //content font size!
            appendMetadata(mismatchingMetadata, "Here is some content two.\r\n", 12);
    
            parseInput(mismatchingMetadata);
    
            System.out.println("Finished example #2");
        }
    
        private static void parseInput(List<MyChar> metadata) throws Exception {
            //Test setup
            InputStream resource = SysExGrammar.class.getResourceAsStream("SysExTest.txt");
    
            CharStream input = new ANTLRInputStream(resource);
    
            resource.close();
    
            SysExLexer lexer = new SysExLexer(input);
            CommonTokenStream tokens = new CommonTokenStream(lexer);
    
            SysExParser parser = new SysExParser(tokens);
            parser.rule(metadata);
    
            System.out.println("Parsing encountered " + parser.getNumberOfSyntaxErrors() + " syntax errors");
        }
    
        private static void appendMetadata(List<MyChar> metadata, String string,
                int fontSize) {
    
            for (int i = 0, count = string.length(); i < count; ++i){
                metadata.add(new MyChar(string.charAt(i) + "", fontSize));
            }
        }
    }
    

    SysExTest.txt (note this uses Windows newlines (\r\n)

    entryOne
    Here is some content one.
    entryTwo
    Here is some content two.
    

    Test output (trimmed; the second example has deliberately-mismatched metadata):

    Parsing encountered 0 syntax errors
    Finished example #1
    Parsing encountered 2 syntax errors
    Finished example #2
    

    This solution requires that each MyChar corresponds to a character in the input (including newline characters, although you can remove that limitation if you like — I would remove it if I didn’t already have this answer written up 😉 ).

    As you can see, it’s possible to tie the metadata to the parser and everything works as expected. I hope this helps.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I plan to process large compressed files and I would like to memory map
I have over 300 questions/prompts that I plan to include in the program. The
I've been using pear mail package 1.20 to send plan text email. I have
I'm trying to read an xml file which I plan to include as part
I am trying to modify my query to include record w/ Plan.id == 848
Model: class Plan include Mongoid::Document # Fields field :name, type: String # Relationships references_many
I created a test plan with JMeter, and this plan include a Thread Group
I plan to include google map using javascript api to show search results for
Possible Duplicate: Why does the Execution Plan include a user-defined function call for a
I'm looking for an introduction/ some documentation of System.Reactive.Joins, which includes the Pattern, Plan,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.