I have an EBNF grammar that has a few rules with this pattern:
sequence ::=
item
| item extra* sequence
Is the above equivalent to the following?
sequence ::=
item (extra* sequence)*
Edit
Due to some of you observing bugs or ambiguities in both sequences, I’ll give a specific example. The SVG specification provides a grammar for path data. This grammar has several producers with this pattern:
lineto-argument-sequence:
coordinate-pair
| coordinate-pair comma-wsp? lineto-argument-sequence
Could the above be rewritten as the following?
lineto-argument-sequence:
coordinate-pair (comma-wsp? lineto-argument-sequence)*
Not really, they seem to have different bugs. The first sequence is ambiguous around “item” seeing that “extra” is optional. You could rewrite it as the following to remove ambiguity:
The second one is ambigous around “extra”, seeing as it is basically two nested loops both starting with “extra”. You could rewrite it as the following to remove ambiguity:
Your first version will likely choke on an input sequence consisting of a single “item” (it depends on the parser implementation) because it won’t disambiguate.
My rewrites assume you want to match a sequence starting with “item” and optionally followed by a series of (0 or more) “item” or “extra” in any order.
e.g.
Without additional information I would be personally inclined towards the option I labled “sequence4” as all the other options are merely using recursion as an expensive loop construct. If you are willing to give me more information I may be able to give a better answer.
EDIT: based on Jorn’s excellent observation (with a small mod).
If you rewrite “sequence3” to remove recursion you get the following:
It think this will be my prefered version, not “sequence4”.
I have to point out that all three versions above are functionally equivalent (as recognizers or generators). The parse trees for 3 would be different to 4 and 5, but I cannot think that that would affect anything other than perhaps performance.
EDIT:
Concerning the following:
What this production says is that a
lineto-argument-sequenceis composed of at least onecoordinate-pairfollowed by zero or morecoordinate-pairs seperated by optional white/comma. Any of the following would constitute alineto-argument-sequence(read -> as ‘becomes’):So a
coordinate-pairis really any 2 consecutivenumbers.I have mocked up a grammar in ANTLR that seems to work. Note the pattern used for
lineto_argument_sequenceis similar to the one Jorn and I recommended previously.Given the following input:
it produces this parse tree.
alt text http://www.freeimagehosting.net/uploads/85fc77bc3c.png