I am trying to write some code that will function like google calendars quick add feature . You know the One where you can input any of the following :
1) 24th sep 2010 , Johns Birthday
2) John’s Birthday , 24/9/10
3) 24 September 2010 , Birthday of John Doe
4) 24-9-2010 : John Does Birthday
5) John Does Birthday 24th of September 2010
And it can figure out that we want an event on a date 24/9/2010 have the rest of the material as the event text.
I want to do this is python .
I am thinking of a design where I write regular expressions that may match all of the cases listed above and extract the date. But I am sur there is a smarter way to approach this problem . Since I clearly am not trained in lexical analysis or the many types of parsers styles. I am looking for whats a good way to approach this problem.
NOTE: The python code here is not correct! It is just a rough pseudo-code of how it might look.
Regular Expressions are good at finding and extracting data from text in a fixed format (e.g. a DD/MM/YYYY date).
A lexer/parser pair is good at processing data in a structured, but somewhat variable format. Lexers split text into tokens. These tokens are units of information of a given type (number, string, etc.). Parsers take this series of tokens and does something depending on the order of the tokens.
Looking at the data, you have a basic (subject, verb, object) structure in different combinations for the relation (person, ‘birthday’, date):
I would handle 29/9/10 and 24-9-2010 as a single token using a regex, returning it as a date type. You could probably do the same for the other dates, with a map to convert September and sep to 9.
You could then return the everything else as strings (separated by whitespace).
You then have:
NOTE: ‘birthday’, ‘,’, ‘:’ and ‘of’ here are keywords, so:
All except 3 use the possessive form of the person (
'sif the last character is a consonant,sif it is a vowel). This can be tricky, as ‘Alexis’ could be the plural form of ‘Alexi’, but since you are restricting where plural forms can be, it is easy to detect:Now, name can either be
first-nameorfirst-name last-name(yes, I know Japan swaps these around, but from a processing perspective, the above problem does not need to differentiate first and last names). The following will handle these two forms:Finally, you can process forms 1-5 using something like this: