Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7839487
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 2, 20262026-06-02T15:27:28+00:00 2026-06-02T15:27:28+00:00

There are some documents to be indexed, that means I need to read the

  • 0

There are some documents to be indexed, that means I need to read the docs and extract the words and index them by storing at which document they appear and at which position.

For each word initially I am creating a separate file. Consider 2 documents:

document 1

The Problem of Programming Communication with

document 2

Programming of Arithmetic Operations

So there will be 10 words, 8 unique. So I create 8 files.

the
problem
of
programming
communications
with
arithmetic
operations

at each file i will store at which document they appear and at what position. The actual structure I am implementing has lot more information but this basic structure will serve the purpose.

file name file content

the 1 1

problem 1 2

of 1 3 2 2

programming 1 4 2 1

communications 1 5

with 1 6

arithmetic 2 3

operations 2 4

Meaning. the word is located ar 1st document-3rd position and 2nd document-2nd position.

After the initial index is done I will concatenate all the files into a single index file and in another file I store the offset where a particular word will be found.

index file:

1 1 1 2 1 3 2 2 1 4 2 1 1 5 1 6 2 3 2 4

offset file:

the 1 problem 3 of 5 programming 9 communications 13  with 15 arithmetic 17 operations 19

So if i need index info of communications I will goto 13th position of the file and read upto (excluding) 15th position, in other words the offset of the next word.

This is all fine for static indexing. But if I change a single index the whole file will need to be rewritten. Can I use a b-tree as the index file’s structure, so that I can dynamically change the file content and update the offset somehow ? If so can someone guide me to some tutorial or library how this works, or explain a bit about how I can implement this?

Thank you very much for taking the time to read such a long post.

EDIT: I was not aware of the difference between B-tree and binary tree. So I asked the question originally using binary tree. It is fixed now.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-02T15:27:29+00:00Added an answer on June 2, 2026 at 3:27 pm

    Basically you’re trying to build an inverted index. Why is it necessary to use so many files? You could use a persistent object and dictionaries to do the job for you. Later, when an index changes, you just reload the persistent object and change a given entry and re-save the object.

    Here’s an example code that does that:

    import shelve
    
    DOC1 = "The problem of Programming Communication with"
    DOC2 = "Programming of Arithmetic Operations"
    
    DOC1 = DOC1.lower()
    DOC2 = DOC2.lower()
    
    all_words = DOC1.split()
    all_words.extend(DOC2.split())
    all_words = set(all_words)
    
    inverted_index = {}
    
    def location(doc, word):
        return doc[:doc.find(word)].count(' ') + 1
    
    
    for word in all_words:
        if word in DOC1:
            if word in inverted_index:
                inverted_index[word].append(('DOC1', location(DOC1, word)))
            else:
                inverted_index[word] = [('DOC1', location(DOC1, word))]
        if word in DOC2:
            if word in inverted_index:
                inverted_index[word].append(('DOC2', location(DOC2, word)))
            else:
                inverted_index[word] = [('DOC2', location(DOC2, word))]
    
    # Saving to persistent object
    inverted_index_file = shelve.open('temp.db')
    inverted_index_file['1'] = inverted_index
    inverted_index_file.close()
    

    Then you can see the saved object like this (and you can modify it using the same strategy):

    >>> import shelve
    >>> t = shelve.open('temp.db')['1']
    >>> print t
    {'operations': [('DOC2', 4)], 'of': [('DOC1', 3), ('DOC2', 2)], 'programming': [('DOC1',   4), ('DOC2', 1)], 'communication': [('DOC1', 5)], 'the': [('DOC1', 1)], 'with': [('DOC1', 6)], 'problem': [('DOC1', 2)], 'arithmetic': [('DOC2', 3)]}
    

    My point is once you build this once, while your other code is running you could have the shelve object in memory as a dictionary and change it dynamically.

    If it does not suit you, then I would support using a database, especially sqlite3 because it is lightweight.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Is there some means of querying the system tables to establish which tables are
I have indexed some documents that have title, content and keyword (multi-value). I want
Let's say we have a Lucene index having few documents indexed using StopAnalyzer.ENGLISH_STOP_WORDS_SET .
Is there some way to preview documents in browser? Specifically in say an iframe
I've got some Documents (and a DocumentsController), which are sorted using limited, fixed set
I am trying to manually fix some documents in my Mongo database which contain
I have a Solr/Lucene index file of approximately 700 Gb. The documents that I
I'm reading some documents about both Servlets and PHP. These document have the same
I am writing a document with Emacs. As you know, there are some code
Is there some standard email validator code sample for Java ME or BlackBerry?

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.