Thanks for amazingly quick response. Stackoverflow is awesome!
I need to check if a word (or rather thousands of them) is matching a dict containing keywords.
For example, say I have a string: "The fluffy fox jumped the friggin fence." I need to check each word of the string against a dict of keywords, and if there’s a match, return all values.
I’ve created a dict filters: (uniqueid means ie. “lk2m3lk4m2”, rest is ‘static’.)
filters:
{ "fox" : [
{ 'subscription' : 'uniqueid', 'link' : 'uniqueid' },
{ 'subscription' : 'uniqueid', 'link' : 'uniqueid' }
]},
{ "fence" : [
{ 'subscription' : 'uniqueid', 'link' : 'uniqueid' }
]}
…and plan to iterate over filters for each word in string (and I have to do this with perhaps 5000 words / second. In other words, performance is the issue ABOVE ALL.
The number of filter-keywords may grow to thousands, while the strings will never be more than a normal sentence long (ie. 5-20 words). I’ll therefore iterate over each word in the string and check if it’s contained in the filter-list. However, at 500 sentences / second, I’m still looking at a lot of computation.
Is it possible to sort the list (ie. the key of dict’s in list) and thus drastically improve performance, for example? And are there C-implementations I should use (like I’m using cjson with great performance gain)?
Sorry for the somewhat fluid question — but how should I go about this task?
Edit:
Expected input:
"The fluffy fox jumped the friggin fence."
Expected output: { 'subscription' : 'flskdmfslk32232', 'link' : 'sfdksmfls22323' },
{ 'subscription' : '3023940fsdf', 'link' : 'sdflsfm223' }
(ie. the subscriptions listed under each matching keyword.)
You can determine if a word is a key in filters by simply by doing filters.has_key(word) or by doing:
or:
It isn’t necessary to iterate over each entry in filters. Rather you will want to split your input string, add each word to a Set (to eliminate duplicates), and then iterate over your set to look up each word in your filters dictionary.