I have a code which takes as input two files:
(1) a dictionary/lexicon
(2) a text file (one sentence per line)
The first part of my code reads the dictionary in tuples so outputs something like:
('mthy3lkw', 'weakBelief', 'U')
('mthy3lkm', 'firmBelief', 'B')
('mthy3lh', 'notBelief', 'A')
The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it’s type.
So given the sentence mthy3lkw ana mesh 3arif , desired output is:
[“mthy3lkw ana mesh 3arif”, ‘mthy3lkw‘, ‘weakBelief’, ‘U’] given that the highlighted word is found in the dictionary.
The second part of my code – the matching part – is TOO slow. How do I make it faster?
Here is my code
findings = []
for sentence in data: # I open the sentences file with .readlines()
for word in tuples: # similar to the ones mentioned above
p1 = re.compile('\\b%s\\b'%word[0]) # get the first word in every tuple
if p1.findall(sentence) and word[1] == "firmBelief":
findings.append([sentence, word[0], "firmBelief"])
print findings
Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict: