Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 516485
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T07:44:10+00:00 2026-05-13T07:44:10+00:00

A project I’m working on requires detection of duplicate files. Under normal circumstances I

  • 0

A project I’m working on requires detection of duplicate files. Under normal circumstances I would simply compare the file bytes in blocks or hash value of the entire file contents. However, the system does not have access to the entire file – only the first 50KB or so. It also knows the total file size of the original file.

I was thinking of implementing the following: each time a file is added, I would look for possible duplicates using both the total file size and a hash calculation of (file-size)+(first-20KB-of-file). The hash algorithm itself is not the issue at this stage, but will likely be MurmurHash2.

Another option is to also store, say, bytes 9000 through 9020 and use that as a third condition when looking up a duplicate copy or alternatively to compare byte-to-byte when the aforementioned lookup method returns possible duplicates in a last attempt to discard false positives.

How naive is my proposed implementation? Is there a reliable way to predict the amount of false positives? What other considerations should I be aware of?

Edit: I forgot to mention that the files are generally going to be compressed archives (ZIP,RAR) and on occasion JPG images.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T07:44:10+00:00Added an answer on May 13, 2026 at 7:44 am

    You can use file size, hashes and partial-contents to quickly detect when two files are different, but you can only determine if they are exactly the same by comparing their contents in full.

    It’s up to you to decide whether the failure rate of your partial-file-check will be low enough to be acceptable in your specific circumstances. Bearing in mind that even an “exceedingly unlikely” event will happen frequently if you have enough volume. But if you know the type of data that the files will contain, you can judge the chances of two near-identical files (idenitcal in the first 50kB) cropping up.

    I would think that if a partial-file-match is acceptable to you, then a hash of those partial file contents is probably going to be pretty acceptable too.

    If you have access to 50kB then I’d use all 50kB rather than only the first 20kB in my hash.

    Picking an arbitrary 20 bytes probably won’t help much (your file contents will either be very different in which case hash+size clashes will be unlikely, or they will be very similar in which case the chances of a randomly chosen 20 bytes being different will be quite low)

    In your circumstances I would check the size, then a hash of the available daa (50kB), then if this suggests a file match, a brute-force comparison of the available data just to minimise the risks, if you don’t expect to be adding so many duplicates that this would bog the system down.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Project file here if you want to download: http://files.me.com/knyck2/918odc So I am working on
The project i am working at right now requires some declarative way of defining
My project has two cpp files and one header file. One cpp file contains
Project: http://design.vitalbmx.com/fave/news.html When I click Add to favorites button (under main pic), in Chrome
The project that I am working on (Node.js) implies lots of operations with the
My project directory contains a CMakeLists.txt file and src and include directories at its
Project I'm working on uses jQuery. I have a series of Ajax calls being
Project Euler has a paging file problem (though it's disguised in other words). I
Current project is an Mvc4 application, I had Ioc working and recently it just
Project is to create exe file. If we run exe file it will open

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.