Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 199645
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T17:02:29+00:00 2026-05-11T17:02:29+00:00

Further to this question: Algorithm for determining a file’s identity Recap : I’m looking

  • 0

Further to this question: Algorithm for determining a file’s identity

Recap: I’m looking for a cheap algorithm for determining a files identity which works the vast majority of the time.

I went ahead and implemented an algorithm that gives me a “pretty unique” hash per file.

The way my algorithm works is:

  • For files smaller than a certain threshold I use the full files content for the identity hash.

  • For files larger than the threshold I take random N samples of X size.

  • I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash)

Questions:

  • What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive)

  • The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash)

  • The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).

  • Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files)

Relevant information:

The algorithm I implemented

Thanks for your help!

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T17:02:29+00:00Added an answer on May 11, 2026 at 5:02 pm
    • Always include 1st and last block of file in hash.

    This is because they’re most likely to be different from file to file. If you consider BMP, it may have fairly standard header (like 800×600 image, 24bit, null rest), so you may want to overshoot the header a bit to get to the differentiating data. The problem is that headers vary wildly in size.

    Last block is for fileformats that append data to original.

    • Read in blocks of size that is native to the filesystem you use, or at least divisible by 512.
    • Always read blocks at offset that is divisible by blocksize.
    • If you get same has for same sized file, do the deep scan of it (hash all data) and memorize filepath to not scan it again.

    Even then unless you’re lucky you will misidentify some files as same (for example SQL Server database file and it’s 1:1 backup copy after only a few insertions; except that SS does write a timestamp..)

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 167k
  • Answers 167k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer You (the caller) don't need to check the pointer for… May 12, 2026 at 1:30 pm
  • Editorial Team
    Editorial Team added an answer When you install Xcode 3.2 on Snow Leopard, the installer… May 12, 2026 at 1:30 pm
  • Editorial Team
    Editorial Team added an answer You tell your class to log at level DEBUG but… May 12, 2026 at 1:30 pm

Related Questions

Further to this question: Algorithm for determining a file’s identity Recap : I'm looking
So a recent question made me aware of the rather cool apriori algorithm .
I asked a question similar to this one a couple of weeks ago, but
I've developed an equation parser using a simple stack algorithm that will handle binary
I'm interested in data mining projects, and have always wanted to create a classification

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.