Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8122307
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 6, 20262026-06-06T05:47:42+00:00 2026-06-06T05:47:42+00:00

I am building a data structure that helps indexing a collection of S documents

  • 0

I am building a data structure that helps indexing a collection of S documents of total length n, such that it supports the following query: Given two words P1 and P2, count all the documents that contain P1 but not P2. I want the answer to be complete (not to miss results).

I’ve built a generalized suffix tree and pick every sqrt(n)-th leaf and its ancestors (and delete every one-childed node). For each internal node v I pre-calculate the answer for the query against node u.

But with this, if the query contains words that appear in the tree in nodes v and u, I can have the answer in O(1), but what can I do when the words are not in one of the nodes that we picked?

I can do it easily by keeping a O(n2) data structure with pre-processing and having all the possible answers ready for O(1) time retrieval, but the goal is to build this data structure in O(n) space and make the queries as efficient as possible.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-06T05:47:45+00:00Added an answer on June 6, 2026 at 5:47 am

    It sounds like an inverted index would still be useful to you. It’s a map of words onto ordered lists of documents containing them. The documents need to have a common, total ordering, and it is in this order in which they appear in their per-word buckets.

    Assuming your n is total length of the corpus in word occurrences (and not vocabulary size), it can be constructed in O(n log n) time and linear space.

    Given P1 and P2, you make two separate queries to get the documents containing the two terms respectively. Since the two lists share a common ordering, you can do a linear merge-like algorithm and select just those documents containing P1 but not P2:

    c1 <- cursor to first element of list of docs containing P1
    c2 <- cursor to first element of list of docs containing P2
    results <- [] # our return value
    
    while c1 not exhausted
      if c2 exhausted or *c1 < *c2
        results.append(c1++)
      else if *c1 == *c2
        c1++
        c2++
      else # *c1 > *c2
        c2++
    
    return results
    

    Notice every pass of the loop iterates at least one cursor; it runs in linear time in the sum of the sizes of the two initial queries. Since only things from the c1 cursor enter results, we know all results contain P1.

    Finally, note we always advance only the “lagging” cursor: this (and the total document ordering) guarantees that if a document appears in both initial queries, there will be a loop iteration where both cursors point to that document. When this iteration occurs, the middle clause necessarily kicks in and the document is skipped by advancing both cursors. Thus documents containing P2 necessarily do not get added to results.

    This query is an example of a general class called Boolean queries; it’s possible to extend this algorithm to cover most any boolean expression. Certain queries break the efficiency of the algorithm (by forcing it to walk over the entire vocabulary space) but basically so long as you don’t negate each term (i.e. don’t ask for not P1 and not P2) you’re fine. See this for an in-depth treatment.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm building a tree-based data structure and overloaded [ ] so that I can
I'm building a gui component that has a tree-based data model (e.g. folder structure
I'm building an application and need a data structure of interconnected objects that can
I'm building a self hosted WCF service. I'm building a special data structure for
I am building a website to capture data. I have many spreadsheets that are
I am building a query to gather data from multiple tables. I want to
Could someone help me with building the following query. I have a table called
I am building a database that contains public, private(limited to internal) and confidential data
I have a photoblog built on CakePHP 2.0 with a data structure that looks
I have a data structure designed like this: employee . department . building .

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.