Creating a grammar to parse search engine like grammar with antlr3 is a task I need help with.
The grammer should allow for:
- ommiting AND between terms: Example dog cat = dog AND cat
- AND should have precedence for OR: cat dog or boat = (cat AND dog) or
boat - arbitrary encapsulation of terms in parentheses : cat OR( dog and
(fish cow) OR bird)
To implement all the criteria above is a challenge (for me). Please have a look at my grammar an advice errors and fixes as properly satisfying all criteria was not achievable.
Grammar
tokens {
FOR;
END;
FIELDSEARCH;
TARGETFIELD;
RELATION;
ANDNODE;
}
startExpression : orEx;
expressionLevel4
: LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression;
expressionLevel3
: (fieldExpression) | expressionLevel4 ;
expressionLevel2
: (nearExpression) | expressionLevel3 ;
expressionLevel1
: (countExpression) | expressionLevel2 ;
notEx : (NOT^)? expressionLevel1;
andEx : (notEx -> notEx)
(AND? a=notEx -> ^(ANDNODE $andEx $a))*;
orEx : andEx (OR^ andEx)*;
countExpression : COUNT LPARENTHESIS WORD RPARENTHESIS (LESSTHEN|MORETHEN) EQUAL? NUMBERS -> ^(COUNT WORD ^(RELATION LESSTHEN? MORETHEN? EQUAL?) NUMBERS);
nearExpression : NEAR^ LPARENTHESIS! (WORD|PHRASE) MULTIPLESEPERATOR! (WORD|PHRASE) MULTIPLESEPERATOR! NUMBERS RPARENTHESIS!;
fieldExpression : WORD PROPERTYSEPERATOR WORD -> ^(FIELDSEARCH ^(TARGETFIELD WORD));
atomicExpression
: WORD
| PHRASE ;
LPARENTHESIS : '(';
RPARENTHESIS : ')';
LESSTHEN : '<';
MORETHEN : '>';
EQUAL : '=';
AND : ('A'|'a')('N'|'n')('D'|'d');
OR : ('O'|'o')('R'|'r');
ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t');
NOT : ('N'|'n')('O'|'o')('T'|'t');
COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t');
NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r');
PROPERTYSEPERATOR : ':';
MULTIPLESEPERATOR : ',';
fragment NUMBER : ('0'..'9');
fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?');
fragment QUOTE : ('"');
fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0');
//fragment UNICODENOSPACES : ('\u0000'..'\u0008'|'\u0010'..'\u0019'|'\u0021'..'\009F'|'\u00A1'..'\009F');
fragment UNICODENOSPACES : ('\u0021'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF');
WS : (SPACE) { $channel=HIDDEN; };
NUMBERS : (NUMBER)+;
PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE);
WORD : (UNICODENOSPACES)+;
Given the input:
title:cats AND fish OR Bird AND (bird and dirt) OR (bart or title:bard OR bird AND title:dort)
This AST is created, note the ( ) which got captured in a WORD term.

There might be other errors or goofy implementation details. Its my first shot at using antlr.
For a first go at ANTLR, you did more than a good job.
The fact there’s
'('and')'in yourWORDtokens is because the range'\u0021'..'\u0039'contains parenthesis. ANTLR’s lexer matches characters greedy, and tries to match as much as possible (!). Because of that last rule (matching as much chars as possible), it will create a single token from input like"(bird"(aWORDtoken), and not two tokens (aLPARENTHESISand aWORD). Just make sure parenthesis are not included in whateverWORDneeds to match.If I copy your grammar and change
WORDinto:your input is parsed as this:
EDIT
You could do it like this on a lexer-level, assuming the parenthesis in
(a...and...a)are part of the expression, not part of aWORD:Only parenthesis inside a
WORDare now permitted. You could go further by allowing a(at the end of aWORDto be valid too, but I’m not sure if that’d be a good idea.