Thinking about my other problem, i decided I can’t even create a regular expression that will match roman numerals (let alone a context-free grammar that will generate them)
The problem is matching only valid roman numerals. Eg, 990 is NOT ‘XM’, it’s ‘CMXC’
My problem in making the regex for this is that in order to allow or not allow certain characters, I need to look back. Let’s take thousands and hundreds, for example.
I can allow M{0,2}C?M (to allow for 900, 1000, 1900, 2000, 2900 and 3000). However, If the match is on CM, I can’t allow following characters to be C or D (because I’m already at 900).
How can I express this in a regex?
If it’s simply not expressible in a regex, is it expressible in a context-free grammar?
You can use the following regex for this:
Breaking it down,
M{0,4}specifies the thousands section and basically restrains it to between0and4000. It’s a relatively simple:You could, of course, use something like
M*to allow any number (including zero) of thousands, if you want to allow bigger numbers.Next is
(CM|CD|D?C{0,3}), slightly more complex, this is for the hundreds section and covers all the possibilities:Thirdly,
(XC|XL|L?X{0,3})follows the same rules as previous section but for the tens place:And, finally,
(IX|IV|V?I{0,3})is the units section, handling0through9and also similar to the previous two sections (Roman numerals, despite their seeming weirdness, follow some logical rules once you figure out what they are):Just keep in mind that that regex will also match an empty string. If you don’t want this (and your regex engine is modern enough), you can use positive look-ahead:
This is a "check to match but discard" operation, meaning it looks ahead to check that the first character exists (
.) after the start marker (^) but doesn’t absorb that first character. For example, if the string wasM, that would match the.but still be available for the next section of the regex,M{0,4}. However, the empty string would not match the look-ahead so would fail.Another alternative, if you are not restricted to just a regex, would be to check that the length is not zero beforehand.