Getting started with python. I am trying to implement positional index using nested dictionary.

Question

0

Asked: May 29, 20262026-05-29T20:28:28+00:00 2026-05-29T20:28:28+00:00

Getting started with python. I am trying to implement positional index using nested dictionary.

0

Getting started with python. I am trying to implement positional index using nested dictionary. However I am not sure if thats the way to go. Index should contain term/term frequency/doc id/term position.

Example:

dict = {term: {termfreq: {docid: {[pos1,pos2,...]}}}}

My question is: am i on the right track here or is there a better solution to my problem. If nested dictionary is the way to go i have one additional question: how do I get single items out of the dictionary: for example term frequency for a term (without all the additional infromation about the term).
Help on this is greatly appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-29T20:28:30+00:00

Each term seems to have a term frequency, a doc id, and a list of positions. Is that right? If so, you could use a dict of dicts:

dct = { 'wassup' : {
            'termfreq' : 'daily',
            'docid' : 1,
            'pos' : [3,4] }}

Then, given a term, like ‘wassup’, you could look up the term frequency with

dct['wassup']['termfreq']
# 'daily'

Think of a dict as being like a telephone book. It is great at looking up values (phone numbers) given keys (names). It is not so hot at looking up keys given values. Use a dict when you know you need to look things up in a one-way direction. You may need some other data structure (a database perhaps?) if your lookup patterns is more complex.

You might also want to check out the Natural Language Toolkit (nltk). It has a method for calculating tf_idf built in:

import nltk

# Given a corpus of texts
text1 = 'Lorem ipsum FOO dolor BAR sit amet'
text2 = 'Ut enim ad FOO minim veniam, '
text3 = 'Duis aute irure dolor BAR in reprehenderit '
text4 = 'Excepteur sint occaecat BAR cupidatat non proident'

# We split the texts into tokens, and form a TextCollection
mytexts = (
    [nltk.word_tokenize(text) for text in [text1, text2, text3, text4]])
mycollection = nltk.TextCollection(mytexts)

# Given a new text
text = 'et FOO tu BAR Brute'
tokens = nltk.word_tokenize(text)

# for each token (roughly, word) in the new text, we compute the tf_idf
for word in tokens:
    print('{w}: {s}'.format(w = word,
                            s = mycollection.tf_idf(word,tokens)))

yields

et: 0.0
FOO: 0.138629436112
tu: 0.0
BAR: 0.0575364144904
Brute: 0.0

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Getting started with python. I am trying to implement positional index using nested dictionary.

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply