Creating a grammar to parse search engine like grammar with antlr3 is a task

Question

0

Editorial Team

Asked: June 15, 20262026-06-15T04:41:50+00:00 2026-06-15T04:41:50+00:00

Creating a grammar to parse search engine like grammar with antlr3 is a task

0

Creating a grammar to parse search engine like grammar with antlr3 is a task I need help with.

The grammer should allow for:

ommiting AND between terms: Example dog cat = dog AND cat
AND should have precedence for OR: cat dog or boat = (cat AND dog) or
boat
arbitrary encapsulation of terms in parentheses : cat OR( dog and
(fish cow) OR bird)

To implement all the criteria above is a challenge (for me). Please have a look at my grammar an advice errors and fixes as properly satisfying all criteria was not achievable.

Grammar

tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
ANDNODE;
}
startExpression  : orEx;

expressionLevel4    
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression;

expressionLevel3    
: (fieldExpression) | expressionLevel4 ;

expressionLevel2    
: (nearExpression) | expressionLevel3 ;

expressionLevel1    
: (countExpression) | expressionLevel2 ;

notEx   : (NOT^)? expressionLevel1;

andEx   : (notEx        -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;

orEx    : andEx (OR^  andEx)*;

countExpression  : COUNT LPARENTHESIS WORD RPARENTHESIS (LESSTHEN|MORETHEN) EQUAL? NUMBERS -> ^(COUNT WORD ^(RELATION LESSTHEN? MORETHEN? EQUAL?) NUMBERS);

nearExpression  : NEAR^ LPARENTHESIS! (WORD|PHRASE) MULTIPLESEPERATOR! (WORD|PHRASE) MULTIPLESEPERATOR! NUMBERS RPARENTHESIS!;

fieldExpression : WORD PROPERTYSEPERATOR WORD -> ^(FIELDSEARCH ^(TARGETFIELD WORD));

atomicExpression 
: WORD
| PHRASE ;


LPARENTHESIS : '(';
RPARENTHESIS : ')';

LESSTHEN : '<';
MORETHEN : '>';
EQUAL : '=';

AND    : ('A'|'a')('N'|'n')('D'|'d');
OR     : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT    : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';

fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE     : ('"');

fragment SPACE     : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');

//fragment UNICODENOSPACES  :  ('\u0000'..'\u0008'|'\u0010'..'\u0019'|'\u0021'..'\009F'|'\u00A1'..'\009F');
fragment UNICODENOSPACES  :  ('\u0021'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF');

WS     : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
WORD   : (UNICODENOSPACES)+;

Given the input:

title:cats AND  fish OR Bird AND (bird and dirt) OR (bart or title:bard OR bird AND title:dort)

This AST is created, note the ( ) which got captured in a WORD term.
enter image description here

There might be other errors or goofy implementation details. Its my first shot at using antlr.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T04:41:54+00:00

For a first go at ANTLR, you did more than a good job.

The fact there’s '(' and ')' in your WORD tokens is because the range '\u0021'..'\u0039' contains parenthesis. ANTLR’s lexer matches characters greedy, and tries to match as much as possible (!). Because of that last rule (matching as much chars as possible), it will create a single token from input like "(bird" (a WORD token), and not two tokens (a LPARENTHESIS and a WORD). Just make sure parenthesis are not included in whatever WORD needs to match.

If I copy your grammar and change WORD into:

WORD : CHARACTER+;

your input is parsed as this:

enter image description here

EDIT

Is it possible to have parenthesis as a normal part of term? e.g have blabla(bla( a)blabla recognized as 2 WORDS? The parser would have to decide if parenthesis introduce a subterm or are just normal characters forming a WORD.

You could do it like this on a lexer-level, assuming the parenthesis in (a... and ...a) are part of the expression, not part of a WORD:

WORD : UNICODENOSPACES ((UNICODENOSPACES | '(' | ')')* UNICODENOSPACES)?

Only parenthesis inside a WORD are now permitted. You could go further by allowing a ( at the end of a WORD to be valid too, but I’m not sure if that’d be a good idea.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Creating a grammar to parse search engine like grammar with antlr3 is a task

Leave an answerCancel reply

1 Answer

EDIT

Leave an answer
Cancel reply