I’m trying to create a regex to recognize English numerals, such as one, nineteen, twenty, one hundred and twenty two, et cetera, all the way to the millions. I want to reuse some parts of the regular expression, so the regex is being constructed by parts, like so:
// replace <TAG> with the content of the variable
ONE_DIGIT = (?:one|two|three|four|five|six|seven|eight|nine)
TEEN = (?:ten|eleven|twelve|(?:thir|for|fif|six|seven|eigh|nine)teen)
TWO_DIGITS = (?:(?:twen|thir|for|fif|six|seven|eigh|nine)ty(?:\s+<ONE_DIGIT>)?|<TEEN>)
// HUNDREDS, et cetera
I was wondering if anyone has already done the same (and would like to share), as these regexes are quite long and it’s possible that they have something that they shouldn’t, or something that I may be missing. Also, I want them to be as efficient as possible so I’m looking forward for any optimization tips. I’m using the Java regex engine, but any regex flavour is acceptable.
See Perl’s Lingua::EN::Words2Nums and Lingua::EN::FindNumber.
In particular, the source code for
Lingua::EN::FindNumbercontains:subject to Perl’s Artistic License.
You can use Regex::PreSuf to automatically factor out common pre- and suffixes:
Output:
I am afraid it gets harder after this 😉