I have an interesting problem that I need help with. I am currently working on a feature of my program and stumbled into this issues
-
I have a huge list of street names in Indonesia ( > 100k rows ) stored in database,
Each street name may have more than 1 word. For example : “Sudirman”, “Gatot Subroto”, or “Jalan Asia Afrika” are all legit street names -
have a bunch of texts ( > 1 Million rows ) in databases, that I split into sentences. Now, the features ( function to be exact ) that I need to do , is to test whether there are street names inside the sentences or no, so just a true / false test
I have tried to solve it by doing these steps:
a. Putting the street names into a Key,Value Hash
b. Split each sentences into words
c. Test whether words are in the hash
This is fast, but will not work with multiple words
Another alternatives that I thought of is to do these steps:
a. Split each sentences into words
b. Query the database with LIKE statement ( i,e. SELECT #### FROM street_table WHERE name like ‘%word%’ )
c. If query returned a row, it means that the sentence contains street names
Now, this solution is going to be a very IO intensive.
So my question is “What is the most efficient way to do this test” ? regardless of the programming language. I do this in python mainly, but any language will do as long as I can grasp the concepts
============EDIT 1 =================
Will this be periodical ?
Yes, I will call this feature / function with an interval of 1 minute. Each call will take 100 row of texts at least and test them against the street name database
A simple solution would be to create a dictionary/multimap with first-word-of-street-name=>full-street-name(s). When you iterate each word in your sentence you’ll look up potential street names, and check if you have a match (by looking at the next words).
This algorithm should be fairly easy to implement and should perform pretty good too.