Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 106603
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T01:37:22+00:00 2026-05-11T01:37:22+00:00

For an open source project I have I am writing an abstraction layer on

  • 0

For an open source project I have I am writing an abstraction layer on top of the filesystem.

This layer allows me to attach metadata and relationships to each file.

I would like the layer to handle file renames gracefully and maintain the metadata if a file is renamed / moved or copied.

To do this I will need a mechanism for calculating the identity of a file. The obvious solution is to calculate an SHA1 hash for each file and then assign metadata against that hash. But … that is really expensive, especially for movies.

So, I have been thinking of an algorithm that though not 100% correct will be right the vast majority of the time, and is cheap.

One such algorithm could be to use file size and a sample of bytes for that file to calculate the hash.

Which bytes should I choose for the sample? How do I keep the calculation cheap and reasonably accurate? I understand there is a tradeoff here, but performance is critical. And the user will be able to handle situations where the system makes mistakes.

I need this algorithm to work for very large files (1GB+ and tiny files 5K)

EDIT

I need this algorithm to work on NTFS and all SMB shares (linux or windows based), I would like it to support situations where a file is copied from one spot to another (2 physical copies exist are treated as one identity). I may even consider wanting this to work in situations where MP3s are re-tagged (the physical file is changed, so I may have an identity provider per filetype).

EDIT 2

Related question: Algorithm for determining a file’s identity (Optimisation)

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. 2026-05-11T01:37:23+00:00Added an answer on May 11, 2026 at 1:37 am

    How about storing some random integers ri, and looking up bytes (ri mod n) where n is the size of file? For files with headers, you can ignore them first and then do this process on the remaining bytes.

    If your files are actually pretty different (not just a difference in a single byte somewhere, but say at least 1% different), then a random selection of bytes would notice that. For example, with a 1% difference in bytes, 100 random bytes would fail to notice with probability 1/e ~ 37%; increasing the number of bytes you look at makes this probability go down exponentially.

    The idea behind using random bytes is that they are essentially guaranteed (well, probabilistically speaking) to be as good as any other sequence of bytes, except they aren’t susceptible to some of the problems with other sequences (e.g. happening to look at every 256-th byte of a file format where that byte is required to be 0 or something).

    Some more advice:

    • Instead of grabbing bytes, grab larger chunks to justify the cost of seeking.
    • I would suggest always looking at the first block or so of the file. From this, you can determine filetype and such. (For example, you could use the file program.)
    • At least weigh the cost/benefit of something like a CRC of the entire file. It’s not as expensive as a real cryptographic hash function, but still requires reading the entire file. The upside is it will notice single-byte differences.
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 108k
  • Answers 108k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer I suspect that the "SSL VPN Relay Loader" actually starts… May 11, 2026 at 9:18 pm
  • Editorial Team
    Editorial Team added an answer The xsd.exe tool (%netsdk20%\bin\xsd.exe) infers a type from an XML… May 11, 2026 at 9:18 pm
  • Editorial Team
    Editorial Team added an answer Thanks for your answer, but by yor code I got… May 11, 2026 at 9:18 pm

Related Questions

Update2: Thanks for the input. I have implemented the algorithm and it is available
At the company I work for, I have created a Error Logging class to
I am writing in request of some suggestions for some well-designed open source Java
I am looking to get on with contributors on an ASP.NET MVC Project. I

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.