I currently have python code that compares two texts using the cosine similarity measure. I got the code here.
What I want to do is take the two texts and pass them through a dictionary (not a python dictionary, just a dictionary of words) first before calculating the similarity measure. The dictionary will just be a list of words, although it will be a large list. I know it shouldn’t be hard and I could maybe stumble my way through something, but I would like it to be efficient too. Thanks.
If the dictionary fites in memory, use a Python set:
If it doesn’t fit in memory, you can use shelve