There are some documents to be indexed, that means I need to read the

Question

0

Asked: June 2, 20262026-06-02T15:27:28+00:00 2026-06-02T15:27:28+00:00

There are some documents to be indexed, that means I need to read the

0

There are some documents to be indexed, that means I need to read the docs and extract the words and index them by storing at which document they appear and at which position.

For each word initially I am creating a separate file. Consider 2 documents:

document 1

The Problem of Programming Communication with

document 2

Programming of Arithmetic Operations

So there will be 10 words, 8 unique. So I create 8 files.

the
problem
of
programming
communications
with
arithmetic
operations

at each file i will store at which document they appear and at what position. The actual structure I am implementing has lot more information but this basic structure will serve the purpose.

file name file content

the 1 1

problem 1 2

of 1 3 2 2

programming 1 4 2 1

communications 1 5

with 1 6

arithmetic 2 3

operations 2 4

Meaning. the word is located ar 1st document-3rd position and 2nd document-2nd position.

After the initial index is done I will concatenate all the files into a single index file and in another file I store the offset where a particular word will be found.

index file:

1 1 1 2 1 3 2 2 1 4 2 1 1 5 1 6 2 3 2 4

offset file:

the 1 problem 3 of 5 programming 9 communications 13  with 15 arithmetic 17 operations 19

So if i need index info of communications I will goto 13th position of the file and read upto (excluding) 15th position, in other words the offset of the next word.

This is all fine for static indexing. But if I change a single index the whole file will need to be rewritten. Can I use a b-tree as the index file’s structure, so that I can dynamically change the file content and update the offset somehow ? If so can someone guide me to some tutorial or library how this works, or explain a bit about how I can implement this?

Thank you very much for taking the time to read such a long post.

EDIT: I was not aware of the difference between B-tree and binary tree. So I asked the question originally using binary tree. It is fixed now.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-02T15:27:29+00:00

Basically you’re trying to build an inverted index. Why is it necessary to use so many files? You could use a persistent object and dictionaries to do the job for you. Later, when an index changes, you just reload the persistent object and change a given entry and re-save the object.

Here’s an example code that does that:

import shelve

DOC1 = "The problem of Programming Communication with"
DOC2 = "Programming of Arithmetic Operations"

DOC1 = DOC1.lower()
DOC2 = DOC2.lower()

all_words = DOC1.split()
all_words.extend(DOC2.split())
all_words = set(all_words)

inverted_index = {}

def location(doc, word):
    return doc[:doc.find(word)].count(' ') + 1


for word in all_words:
    if word in DOC1:
        if word in inverted_index:
            inverted_index[word].append(('DOC1', location(DOC1, word)))
        else:
            inverted_index[word] = [('DOC1', location(DOC1, word))]
    if word in DOC2:
        if word in inverted_index:
            inverted_index[word].append(('DOC2', location(DOC2, word)))
        else:
            inverted_index[word] = [('DOC2', location(DOC2, word))]

# Saving to persistent object
inverted_index_file = shelve.open('temp.db')
inverted_index_file['1'] = inverted_index
inverted_index_file.close()

Then you can see the saved object like this (and you can modify it using the same strategy):

>>> import shelve
>>> t = shelve.open('temp.db')['1']
>>> print t
{'operations': [('DOC2', 4)], 'of': [('DOC1', 3), ('DOC2', 2)], 'programming': [('DOC1',   4), ('DOC2', 1)], 'communication': [('DOC1', 5)], 'the': [('DOC1', 1)], 'with': [('DOC1', 6)], 'problem': [('DOC1', 2)], 'arithmetic': [('DOC2', 3)]}

My point is once you build this once, while your other code is running you could have the shelve object in memory as a dictionary and change it dynamically.

If it does not suit you, then I would support using a database, especially sqlite3 because it is lightweight.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

There are some documents to be indexed, that means I need to read the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply