Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 7687517
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 31, 20262026-05-31T19:43:16+00:00 2026-05-31T19:43:16+00:00

I have a system with roughly a 100 million documents, and I’d like to

  • 0

I have a system with roughly a 100 million documents, and I’d like to keep track of their modifications between mirrors. In order to exchange information about modifications effectively, I want to send information about modified documents by days, not by each separate document. Something like this:

[ 2012/03/26, cs26],
[ 2012/03/25, cs25],
[ 2012/03/24, cs24],
...

where each cs is the checksum of timestamps of all documents created on a particular day.

Now, the problem I’m running into is that I don’t know of an algorithm that could “subtract” data from the checksum when a document is being deleted. None of the cryptographic hashes fit the need, for obvious reasons, and I couldn’t find any algorithms for CRC that would do this.

One option I considered was to have deletes add extra information to the hash, but this would lead to even more problems, as nodes can receive delete requests in different order, and when a node would restart it would re-read all the timestamps from the documents, and thus the information about the deletes would be lost.

I also wouldn’t like using a hash tree with all document hashes in-memory, as that would use roughly 8 gigs of memory, and I think it’s a bit of overkill for just this need.

For now the best option seems to regenerate these hashes completely from time to time in background, but that is also a lot of needless overhead, and wouldn’t provide immediate information on changes.

So, do you guys know of a checksum algorithm that would let me “remove” some data from the checksum? I need the algorithm to be somewhat fast and the checksum that would strongly indicate the smallest of changes (that’s why I can’t really use plain XOR).

Or maybe you have better ideas about the whole design?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-31T19:43:17+00:00Added an answer on May 31, 2026 at 7:43 pm

    How about

    hash = X(documents, 0, function(document) { ... })
    

    where X is an aggregate XOR (javascript-y pseudocode follows):

    function X(documents, x, f)
    {
       for each (var document in documents)
       {
          x ^= f(document);
       }
       return x;
    }
    

    and f() is a hash of individual document information? (whether timestamp or filename or ID or whatever)

    The use of XOR would allow you to “subtract” out documents, but using a hash on a per-document basis allows you to preserve a hash-like quality of detecting small changes.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have a system where I query a REST / Atom server for documents.
Let's say we have a 100x100 coordinate system, like the one below. 0,0 is
We have a system whose behavior is defined by a number of XML files.(Roughly
I have a c#-class that looks roughly like this: class ImageContainer { Image image;
I have just short of 2 million XML documents sitting on 16gb of file
We have system here that uses Java JNI to call a function in a
I have System.Collections.Generic.Dictionary<A, B> dict where A and B are classes, and an instance
Let's say we have system A comprising a MySQL database, with several tables. After
I have a system that combines the best and worst of Java and PHP.
I have a system in place which applies calculations to a set of numbers

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.