The goal is to be able to compare words in a document to the

Question

0

Asked: May 20, 20262026-05-20T23:44:26+00:00 2026-05-20T23:44:26+00:00

The goal is to be able to compare words in a document to the

0

The goal is to be able to compare words in a document to the words in a set of documents as fast as possible (create a term-document matrix). If possible can this be done (and will it be fast) by using Lucene?

My thought (if done by me) is to create a tree of terms in each document and then combine the trees together to make the set. To create the trees, I would use nested dictionaries with each dictionary key being a character. Each position in the term would be a different level in the heirarchy

Positions

For example, using a sample string “This is a test” the tree would look like

t
 h
  i
   s
 e
  s
   t
i
 s
a

Notice the ‘t’ in the first level is there only once. The first dictionary would contain the keys {‘t’,’i’,’a’}. There would be three second level dictionaries containing the keys {‘h’}{‘e’}{‘s’}.

This should make look up extremly fast. The max number of steps in a loop would be the number of characters in the longest word. The part that is making my brain fold in on itself, is how do I dynamically build a dictionary like this, specifically accessing the correct level

So far I have something to the effect of

def addTerm(self, term):
   current_level = 0;
   for character in list(term):
      character = character.lower()
      if re.match("[a-z]",character):
         self.tree[character] = {}
         current_level += 1

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-20T23:44:26+00:00

I can see a few problems with your current implementation. How do you mark if a node in the trie is a word? A better implementation would be to initialize tree to something like tree = [{}, None] where None indicates if the current node is the end of a word.

Your addTerm method could then be something like:

def addTerm(self, term):
   node = self.tree
   for c in term:
      c = c.lower()
      if re.match("[a-z]",c):
         node = node[0].setdefault(c,[{},None])
   node[1] = term

You could set node[1] to True if you don’t care about what word is at the node.

Searching if a word is in the trie would be something like

def findTerm(self, term):
    node = self.tree
    for c in term:
        c = c.lower()
        if re.match("[a-z]",c):
            if c in node[0]:
                node = node[0][c]
            else:
                return False
    return node[1] != None

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

The goal is to be able to compare words in a document to the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply