There are some documents to be indexed, that means I need to read the docs and extract the words and index them by storing at which document they appear and at which position.
For each word initially I am creating a separate file. Consider 2 documents:
document 1
The Problem of Programming Communication with
document 2
Programming of Arithmetic Operations
So there will be 10 words, 8 unique. So I create 8 files.
the
problem
of
programming
communications
with
arithmetic
operations
at each file i will store at which document they appear and at what position. The actual structure I am implementing has lot more information but this basic structure will serve the purpose.
file name file content
the 1 1
problem 1 2
of 1 3 2 2
programming 1 4 2 1
communications 1 5
with 1 6
arithmetic 2 3
operations 2 4
Meaning. the word is located ar 1st document-3rd position and 2nd document-2nd position.
After the initial index is done I will concatenate all the files into a single index file and in another file I store the offset where a particular word will be found.
index file:
1 1 1 2 1 3 2 2 1 4 2 1 1 5 1 6 2 3 2 4
offset file:
the 1 problem 3 of 5 programming 9 communications 13 with 15 arithmetic 17 operations 19
So if i need index info of communications I will goto 13th position of the file and read upto (excluding) 15th position, in other words the offset of the next word.
This is all fine for static indexing. But if I change a single index the whole file will need to be rewritten. Can I use a b-tree as the index file’s structure, so that I can dynamically change the file content and update the offset somehow ? If so can someone guide me to some tutorial or library how this works, or explain a bit about how I can implement this?
Thank you very much for taking the time to read such a long post.
EDIT: I was not aware of the difference between B-tree and binary tree. So I asked the question originally using binary tree. It is fixed now.
Basically you’re trying to build an inverted index. Why is it necessary to use so many files? You could use a persistent object and dictionaries to do the job for you. Later, when an index changes, you just reload the persistent object and change a given entry and re-save the object.
Here’s an example code that does that:
Then you can see the saved object like this (and you can modify it using the same strategy):
My point is once you build this once, while your other code is running you could have the
shelveobject in memory as a dictionary and change it dynamically.If it does not suit you, then I would support using a database, especially
sqlite3because it is lightweight.