I am writing lexer rules for a custom description language using pyLR1 which shall

Question

0

Asked: June 6, 20262026-06-06T21:19:48+00:00 2026-06-06T21:19:48+00:00

I am writing lexer rules for a custom description language using pyLR1 which shall

0

I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:

10h30m     # meaning 10 hours + 30 minutes
5m30s      # meaning 5 minutes + 30 seconds
10h20m15s  # meaning 10 hours + 20 minutes + 15 seconds
15.6s      # meaning 15.6 seconds

The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course).
As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.

So I have for all but the last group a number match like:

([0-9]+)

And for the last group even:

([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)  # to allow for .5 and 0.5 and 5.0 and 5

Going through all the combinations of h, m and s a cute little python script gives me the following regex:

(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)

Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-06T21:19:49+00:00

You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:

h|hm|hms|ms|m|s

which is what you have currently. You can break this into:

(h|hm|hms)|(ms|m)|s

and then pulling out h from the first expression and m from the second we get (using (x|) == x?):

h(m|ms)?|ms?|s

Continuing on we get to

h(ms?)?|ms?|s

which is probably simpler (and probably the simplest).

Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as

h(d|m(d|sd?)?)?|m(d|sd?)?|sd?

(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)

This would result in something like (for just hours and minutes):

[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m

Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am writing lexer rules for a custom description language using pyLR1 which shall

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply