Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 280517
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 12, 20262026-05-12T05:07:31+00:00 2026-05-12T05:07:31+00:00

I have the very common problem of creating an index for an in-disk array

  • 0

I have the very common problem of creating an index for an in-disk array of strings. In short, I need to store the position of each string in the in-disk representation. For example, a very naive solution would be an index array as follows:

uint64 idx[] = { 0, 20, 500, 1024, …, 103434 };

Which says that the first string is at position 0, the second at position 20, the third at position 500 and the nth at position 103434.

The positions are always non-negative 64 bits integers in sequential order. Although the numbers could vary by any difference, in practice I expect the typical difference to be inside the range from 2^8 to 2^20. I expect this index to be mmap’ed in memory, and the positions will be accessed randomly (assume uniform distribution).

I was thinking about writing my own code for doing some sort of block delta encoding or other more sophisticated encoding, but there are so many different trade-offs between encoding/decoding speed and space that I would rather get a working library as a starting point and maybe even settle for something without any customizations.

Any hints? A c library would be ideal, but a c++ one would also allow me to run some initial benchmarks.

A few more details if you are still following. This will be used to build a library similar to cdb (http://cr.yp.to/cdb/cdbmake.html) on top the library cmph (http://cmph.sf.net). In short, it is for a large disk based read only associative map with a small index in memory.

Since it is a library, I don’t have control over input, but the typical use case that I want to optimize have millions of hundreds of values, average value size in the few kilobytes ranges and maximum value at 2^31.

For the record, if I don’t find a library ready to use I intend to implement delta encoding in blocks of 64 integers with the initial bytes specifying the block offset so far. The blocks themselves would be indexed with a tree, giving me O(log (n/64)) access time. There are way too many other options and I would prefer to not discuss them. I am really looking forward ready to use code rather than ideas on how to implement the encoding. I will be glad to share with everyone what I did once I have it working.

I appreciate your help and let me know if you have any doubts.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-12T05:07:31+00:00Added an answer on May 12, 2026 at 5:07 am

    I use fastbit (Kesheng Wu LBL.GOV), it seems you need something good, fast and NOW, so fastbit is a highly competient improvement on Oracle’s BBC (byte aligned bitmap code, berkeleydb). It’s easy to setup and very good gernally.

    However, given more time, you may want to look at a gray code solution, it seems optimal for your purposes.

    Daniel Lemire has a number of libraries for C/++/Java released on code.google, I’ve read over some of his papers and they are quite nice, several advancements on fastbit and alternative approaches for column re-ordering with permutated grey codes’s.

    Almost forgot, I also came across Tokyo Cabinet, though I do not think it will be well suited for my current project, I may of considered it more if I had known about it before ;), it has a large degree of interoperability,

    Tokyo Cabinet is written in the C
    language, and provided as API of C,
    Perl, Ruby, Java, and Lua. Tokyo
    Cabinet is available on platforms
    which have API conforming to C99 and
    POSIX.

    As you referred to CDB, the TC benchmark has a TC mode (TC support’s several operational constraint’s for varying perf) where it surpassed CDB by 10 times for read performance and 2 times for write.

    With respect to your delta encoding requirement, I am quite confident in bsdiff and it’s ability to out-perform any file.exe content patching system, it may also have some fundimental interfaces for your general needs.

    Google’s new binary compression application, courgette may be worth checking out, in case you missed the press release, 10x smaller diff’s than bsdiff in the one test case I have seen published.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a very common situation. I have a file, and I need to
i have very simple problem. I need to create model, that represent element of
I have a very common situation here. And for years I haven't found if
We have very strange problem, one of our applications is continually querying server by
I have a very strange problem. Under some elusive circumstances I fail to apply
I have a very simple problem which requires a very quick and simple solution
I have a very tough problem for me to solve, and I thought and
This is a very common problem when Excel Worksheet or Chart is embedded into
I have a very large Form with many date fields that need to be
I have very little experience building software for Windows, and zero experience using the

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.