Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6881989
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T05:11:54+00:00 2026-05-27T05:11:54+00:00

I am about to embark on a programming journey, which undoubtedly will end in

  • 0

I am about to embark on a programming journey, which undoubtedly will end in failure and/or throwing my mouse through my Mac, but it’s an interesting problem.

I want to build an app, which scans starting at some base directory and recursively loops down through each file, and if it finds an exact duplicate file, it deletes it, and makes a symbolic link in its place. Basically poor mans deduplication. This actually solves a real problem for me, since I have a bunch of duplicate files on my Mac, and I need to free up disk space.

From what I have read, this is the strategy:

  1. Loop through recursively, and generate a hash for each file. The hash need to be extremely unique. This is the first problem. What hash should I use? How do I run the entire binary contents of each file through this magical hash?

  2. Store each files hash and full-path in a key/value store. I’m thinking redis is an excellent fit because of its speed.

  3. Iterate through the key/value store, find duplicate hashes, delete the duplicate file, create the symbolic link, and flag the row in the key/value store as a copy.

My questions therefore are:

  • What hashing algorithm should I use for each file? How is this done?
  • I’m thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.
  • What other gotchas am I missing here?
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T05:11:55+00:00Added an answer on May 27, 2026 at 5:11 am

    What hashing algorithm should I use for each file? How is this done?

    Use SHA1. Git uses SHA1 to generate unique hash for files. It’s almost impossible to have a collision. There is no known collision of standard SHA1.

    I’m thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.

    Your application will have 2 kinds of operation:

    • Reading file (IO bound).
    • Calculating hash (CPU bound).

    My suggestion is: don’t calculate hash in scripting language (Ruby or JavaScript) unless it has native hashing library. You can just invoke other executables such as sha1sum. It’s written in C and should be blazing fast.

    I don’t think you need NodeJS. NodeJS is fast in event-driven IO, but it cannot boost your I/O speed. I don’t think you need to implement event-driven IO here.

    What other gotchas am I missing here?

    My suggestions: Just implement with a language which you are familiar with. Don’t over-engineering too early. Optimize it only when you really hit performance issue.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I am about to embark on a project using Apache Hadoop/Hive which will involve
im a bit new at extension development, but im about to embark upon a
I am about to embark on something I have not yet done but I
I am about to embark on an epic journey of Web development. Epic for
I'm about to embark on a new project within which we require the ability
I am about to embark on a project mostly using C# that will involve
I'm about to embark on a journey to build a multilingual Drupal site, where
I am about to embark on a jquery journey I have not ever done.
I'm about to embark upon extending and modifying PyUnit. For instance, I will add
I'm about to embark on a new app that will offer downloads of digital

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.