I have a problem, and besides it sounds trivial, it’s not simple (for me) to find a straight forward, scalable and performatic solution. I have one input text where the website user can search for locations.
Today the location can be a city, a address in a city or a neighborhood in a city, and the user must separate the address or the neighborhood from the city using a comma, then it’s easy for me to split the string and find if the first block is a address, a neighborhood or a city. If the user fails to fill the input with all the needed information, putting a address without a city, and I match more than a street with the same name, we show all the locations for him to choose the correct one.
Using the search log we find out that most of the users don’t use the comma, even with all the tool tips pointing how to use the location search (thx google :p).
So, a new requirement for the location search is needed, to accept non comma separated addresses, like:
1. "5th Avenue"
2. "Manhattan"
3. "New York"
4. "5th Avenue Manhattan"
5. "5th Avenue Manhattan New York"
6. "Manhattan New York"
7. "5th Avenue New York"
But I can’t find a way to find the meaning of each block or a dynamic way to make this work. Ie, if I get a string like “New Yok”, “new” can be a address, and “york” can be a city.
My question is, is there some kind of technique or framework to achieve what I need or I will need to work my way in a algorithm (based on the number of words, commas, etc) to do that specifically?
Edit1:
Because I use SQL Server, I’m thinking about full text search multiple columns search, doing a exact match before and a non exact later. But I think some incomplete addresses will return thousands of rows.
Isn’t the key that specificity decreases from left to right? That is, the right-most semantic element (whether “New York” or “Manhattan”) is always the least-specific (if it’s a Borough, then we don’t have to worry about City, if it’s a Street, we don’t have to worry about Borough, etc.)
So reverse the tokens and recurse through, seeking either a complete hit (“Manhattan”) or a keyword (“Avenue”, “Street”, “New”) that indicates either the beginning or end of a semantic element. So after a pass, you might have:
Which ought to give you enough to pattern-match against.
UPDATE:
OK, to expand on the general strategy: