I begin with an otherwise well formed (and well working) grammar for a language. Variables,
binary operators, function calls, lists, loops, conditionals, etc. To this grammar I’d like to add what I’m calling the object construct:
object
: object_name ARROW more_objects
;
more_objects
: object_name
| object_name ARROW more_objects
;
object_name
: IDENTIFIER
;
The point is to be able to access scalars nested in objects. For example:
car->color
monster->weapon->damage
pc->tower->motherboard->socket_type
I’m adding object as a primary_expression:
primary_expression
: id_lookup
| constant_value
| '(' expression ')'
| list_initialization
| function_call
| object
;
Now here’s a sample script:
const list = [ 1, 2, 3, 4 ];
for var x in list {
send "foo " + x + "!";
}
send "Done!";
Prior to adding the nonterminal object as a primary_expression everything is sunshine and puppies. Even after I add it, Bison doesn’t complain. No shift and/or reduce conflicts reported. And the generated code compiles without a sound. But when I try to run the sample script above, I get told error on line 2: Attempting to use undefined symbol '{' on line 2.
If I change the script to:
var list = 0;
for var x in [ 1, 2, 3, 4 ] {
send "foo " + x + "!";
}
send "Done!";
Then I get error on line 3: Attempting to use undefined symbol '+' on line 3.
Clearly the presence of object in the grammar is messing up how the parser behaves [SOMEhow], and I feel like I’m ignoring a rather simple principle of language theory that would fix this in a jiff, but the fact that there aren’t any shift/reduce conflicts has left me bewildered.
Is there a better way (grammatically) to write these rules? What am I missing? Why aren’t there any conflicts?
(And here’s the full grammar file in case it helps)
UPDATE: To clarify, this language, which compiles into code being run by a virtual machine, is embedded into another system – a game, specifically. It has scalars and lists, and there are no complex data types. When I say I want to add objects to the language, that’s actually a misnomer. I am not adding support for user-defined types to my language.
The objects being accessed with the object construct are actually objects from the game which I’m allowing the language processor to access through an intermediate layer which connects the VM to the game engine. This layer is designed to decouple as much as possible the language definition and the virtual machine mechanics from the implementation and details of the game engine.
So when, in my language I write:
player->name
That only gets codified by the compiler. “player” and “name” are not traditional identifiers because they are not added to the symbol table, and nothing is done with them at compile time except to translate the request for the name of the player into 3-address code.
So I spent a reasonable amount of time picking over the grammar (and the bison output) and can’t see what is obviously wrong here. Without having the means to execute it, I can’t easily figure out what is going on by experimentation. Therefore, here are some concrete steps I usually go through when debugging grammars. Hopefully you can do any of these you haven’t already done and then perhaps post follow-ups (or edit your question) with any results that might be revealing:
objectandmore_objectrules, joined by ARROW. Does this work as you expect?objectwith some other very simple rule (using some tokens not occuring elsewhere) and seeing if you can include those tokens without it breaking everything else.--report=all. Inspect the output to try to trace the rules you’ve added and the states that they affect. Try removing those rules and repeat the process – what has changed? This is extremely time consuming often, and is a giant pain, but it’s a good last resort. I recommend a pencil and some paper.Looking at the structure of your error output – ‘+’ is being recognised as an identifier token, and is therefore being looked up as a symbol. It might be worth checker your lexer to see how it is processing identifier tokens. You might just accidentally be grabbing too much. As a further debugging technique, you might consider turning some of those token literals (e.g. ‘+’, ‘{‘, etc) into real tokens so that bison’s error reporting can help you out a little more.
EDIT: OK, the more I’ve dug into it, the more I’m convinced that the lexer is not necessarily working as it should be. I would double-check that the stream of tokens you are getting from yylex() matches your expectations before proceeding any further. In particular, it looks like a bunch of symbols that you consider special (e.g. ‘+’ and ‘{‘) are being captured by some of your regular expressions, or at least are being allowed to pass for identifiers.