I’m looking to speed along my discovery process here quite a bit, as this

Question

0

Asked: May 13, 20262026-05-13T20:40:14+00:00 2026-05-13T20:40:14+00:00

I’m looking to speed along my discovery process here quite a bit, as this

0

I’m looking to speed along my discovery process here quite a bit, as this is my first venture into the world of lexical analysis. Maybe this is even the wrong path. First, I’ll describe my problem:

I’ve got very large properties files (in the order of 1,000 properties), which when distilled, are really just about 15 important properties and the rest can be generated or rarely ever change.

So, for example:

general {
  name = myname
  ip = 127.0.0.1
}

component1 {
   key = value
   foo = bar
}

This is the type of format I want to create to tokenize something like:

property.${general.name}blah.home.directory = /blah
property.${general.name}.ip = ${general.ip}
property.${component1}.ip = ${general.ip}
property.${component1}.foo = ${component1.foo}

into

property.mynameblah.home.directory = /blah
property.myname.ip = 127.0.0.1
property.component1.ip = 127.0.0.1
property.component1.foo = bar

Lexical analysis and tokenization sounds like my best route, but this is a very simple form of it. It’s a simple grammar, a simple substitution and I’d like to make sure that I’m not bringing a sledgehammer to knock in a nail.

I could create my own lexer and tokenizer, or ANTlr is a possibility, but I don’t like re-inventing the wheel and ANTlr sounds like overkill.

I’m not familiar with compiler techniques, so pointers in the right direction & code would be most appreciated.

Note: I can change the input format.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T20:40:15+00:00

There’s an excellent article on Using Regular Expressions for Lexical Analysis at effbot.org.

Adapting the tokenizer to your problem:

import re

token_pattern = r"""
(?P<identifier>[a-zA-Z_][a-zA-Z0-9_]*)
|(?P<integer>[0-9]+)
|(?P<dot>\.)
|(?P<open_variable>[$][{])
|(?P<open_curly>[{])
|(?P<close_curly>[}])
|(?P<newline>\n)
|(?P<whitespace>\s+)
|(?P<equals>[=])
|(?P<slash>[/])
"""

token_re = re.compile(token_pattern, re.VERBOSE)

class TokenizerException(Exception): pass

def tokenize(text):
    pos = 0
    while True:
        m = token_re.match(text, pos)
        if not m: break
        pos = m.end()
        tokname = m.lastgroup
        tokvalue = m.group(tokname)
        yield tokname, tokvalue
    if pos != len(text):
        raise TokenizerException('tokenizer stopped at pos %r of %r' % (
            pos, len(text)))

To test it, we do:

stuff = r'property.${general.name}.ip = ${general.ip}'
stuff2 = r'''
general {
  name = myname
  ip = 127.0.0.1
}
'''

print ' stuff '.center(60, '=')
for tok in tokenize(stuff):
    print tok

print ' stuff2 '.center(60, '=')
for tok in tokenize(stuff2):
    print tok

for:

========================== stuff ===========================
('identifier', 'property')
('dot', '.')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'name')
('close_curly', '}')
('dot', '.')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('open_variable', '${')
('identifier', 'general')
('dot', '.')
('identifier', 'ip')
('close_curly', '}')
========================== stuff2 ==========================
('newline', '\n')
('identifier', 'general')
('whitespace', ' ')
('open_curly', '{')
('newline', '\n')
('whitespace', '  ')
('identifier', 'name')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('identifier', 'myname')
('newline', '\n')
('whitespace', '  ')
('identifier', 'ip')
('whitespace', ' ')
('equals', '=')
('whitespace', ' ')
('integer', '127')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '0')
('dot', '.')
('integer', '1')
('newline', '\n')
('close_curly', '}')
('newline', '\n')

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m looking to speed along my discovery process here quite a bit, as this

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply