I am writing a parser for delphi’s dfm’s files. The lexer looks like this:

Question

0

Asked: May 20, 20262026-05-20T05:53:04+00:00 2026-05-20T05:53:04+00:00

I am writing a parser for delphi’s dfm’s files. The lexer looks like this:

0

I am writing a parser for delphi’s dfm’s files. The lexer looks like this:

EXP ([Ee][-+]?[0-9]+)

%%

("#"([0-9]{1,5}|"$"[0-9a-fA-F]{1,6})|"'"([^']|'')*"'")+ { 
                                                 return tkStringLiteral; }
"object" { return tkObjectBegin; }
"end" { return tkObjectEnd; }
"true" { /*yyval.boolean = true;*/ return tkBoolean; }
"false" { /*yyval.boolean = false;*/ return tkBoolean; }

"+" | "." | "(" | ")" | "[" | "]" | "{" | "}" | "<" | ">" | "=" | "," | 
":" { return yytext[0]; }

[+-]?[0-9]{1,10} { /*yyval.integer = atoi(yytext);*/ return tkInteger; }
[0-9A-F]+ { return tkHexValue; }
[+-]?[0-9]+"."[0-9]+{EXP}? { /*yyval.real = atof(yytext);*/ return tkReal; }
[a-zA-Z_][0-9A-Z_]* { return tkIdentifier; }
"$"[0-9A-F]+ { /* yyval.integer = atoi(yytext);*/ return tkHexNumber; }

[ \t\r\n] { /* ignore whitespace */ }
. { std::cerr << boost::format("Mystery character %c\n") % *yytext; }

<<EOF>> { yyterminate(); }

%%

and the bison grammar looks like

%token tkInteger
%token tkReal
%token tkIdentifier
%token tkHexValue
%token tkHexNumber
%token tkObjectBegin
%token tkObjectEnd
%token tkBoolean
%token tkStringLiteral

%%object:
    tkObjectBegin tkIdentifier ':' tkIdentifier 
          property_assignment_list tkObjectEnd
  ;

property_assignment_list:
    property_assignment
  | property_assignment_list property_assignment
  ;

property_assignment:
    property '=' value
  | object
  ;

property:
    tkIdentifier
  | property '.' tkIdentifier
  ;

value:
    atomic_value
  | set
  | binary_data
  | strings
  | collection
  ;

atomic_value:
    tkInteger
  | tkReal
  | tkIdentifier
  | tkBoolean
  | tkHexNumber
  | long_string
  ;

long_string:
    tkStringLiteral
  | long_string '+' tkStringLiteral
  ;

atomic_value_list:
    atomic_value
  | atomic_value_list ',' atomic_value
  ;

set:
    '[' ']'
  | '[' atomic_value_list ']'
  ;

binary_data:
    '{' '}'
  | '{' hexa_lines '}'
  ;

hexa_lines:
    tkHexValue
  | hexa_lines tkHexValue
  ;

strings:
    '(' ')'
  | '(' string_list ')'
  ;

string_list:
    tkStringLiteral
  | string_list tkStringLiteral
  ;

collection:
    '<' '>'
  | '<' collection_item_list '>'
  ;

collection_item_list:
    collection_item
  | collection_item_list collection_item
  ;

collection_item:
    tkIdentifier property_assignment_list tkObjectEnd
  ;

%%

void yyerror(const char *s, ...) {...}

The problem with this grammar occurs while parsing the binary data. Binary data in the dfm’s files is nothing
but a sequence of hexadecimal characters which never spans more than 80 characters per line. An example of
it is:

Picture.Data = {
      055449636F6E0000010001002020000001000800A80800001600000028000000
      2000000040000000010008000000000000000000000000000000000000000000

      ...

      FF00000000000000000000000000000000000000000000000000000000000000
      00000000FF000000FF000000FF00000000000000000000000000000000000000
      00000000}

As you can see, this element lacks any markers, so the strings clashes with other elements. In the example
above the first line is returns the proper token tkHexValue. The second however returns a tkInteger token
and the third a tkIdentifier token. So when the parsing comes, it fails with an syntax error because
binary data is composed only of tkHexValue tokens.

My first workaround was to require integers to have a maximum length (which helped in all but the last line
of the binary data). And the second was to move the tkHexValue token above the tkIdentifier but it means
that now I will not have identifiers like F0

I was wondering if there is any way to fix this grammar?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T05:53:05+00:00

Editorial Team

2026-05-20T05:53:05+00:00Added an answer on May 20, 2026 at 5:53 am

Ok, I solved this one. I needed to define a state so tkHexValue is only returned while reading binary data. In the preamble part of the lexer I added

%x BINARY

and modify the following rules

"{" {BEGIN BINARY; return yytext[0];}
<BINARY>"}" {BEGIN INITIAL; return yytext[0];}
<BINARY>[ \t\r\n] { /* ignore whitespace */ }

And that was all!

0

Reply
Share
Share

- Report

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing a parser for delphi’s dfm’s files. The lexer looks like this:

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply