The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?
My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy
Positions
1
2
3
4
5
6
For example, using a sample string “This is a test” the tree would look like
t
h
i
s
e
s
t
i
s
a
Notice the ‘t’ in the first level is there only once. The first dictionary would contain the keys {‘t’,’i’,’a’}. There would be three second level dictionaries containing the keys {‘h’}{‘e’}{‘s’}.
This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is how do I dynamically build a dictionary like this, specifically accessing the correct level
So far I have something to the effect of
def addTerm(self, term):
current_level = 0;
for character in list(term):
character = character.lower()
if re.match("[a-z]",character):
self.tree[character] = {}
current_level += 1
I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like
tree = [{}, None]where None indicates if the current node is the end of a word.Your addTerm method could then be something like:
You could set
node[1]to True if you don’t care about what word is at the node.Searching if a word is in the trie would be something like