Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9147513
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T11:00:26+00:00 2026-06-17T11:00:26+00:00

In linux environment I would need to remove duplicate images by md5 of the

  • 0

In linux environment I would need to remove duplicate images by md5 of the file, but before deleting, I want to write in a file some CSV list of

Deleted File -> Linked First File
Deleted File -> Linked File

Etc.

The problem is that I have a structure of

Main Folder
Subfolder
Sub-Sub Folder
Sub-Sub-Sub Folder
Images

With more than 200.000 Files

So Script should be quite nice not to hang and to be fast.

Which direction would you suggest?

I have ubuntu under hand.

UPDATE:

I have found a script which does with small modification what I need. It search and find the md5 duplicates and removes the duplicates. Only last step needed is to make a file with list of removed file -> duplicate that stays

#!/bin/bash

DIR="/home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test"

find $DIR -type f -exec md5sum {} \; | sort > /home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test/sums-sorted.txt

OLDSUM=""
IFS=$'\n'
for i in `cat /home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test/sums-sorted.txt`; do
 NEWSUM=`echo "$i" | sed 's/ .*//'`
 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
 if [ "$OLDSUM" == "$NEWSUM" ]; then
  echo rm  "$NEWFILE"
 else
  OLDSUM="$NEWSUM"
  OLDFILE="$NEWFILE"
 fi
done
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T11:00:27+00:00Added an answer on June 17, 2026 at 11:00 am

    I find Python a nice tool for these tasks, and is more portable too (although you have restricted the question to Linux). The code below will keep the oldest file (by creation time) among the duplicates, if that doesn’t matter to you then it can be simplified. To use it, save it as, for example, “remove_dups.py”, and run as python remove_dumps.py startdir. From startdir, it will look for directories that 3 levels deep, and calculate the md5 sum of the contents there. It stores a list of file names per hash. The text file you are after is printed to stdout, so you actually want to run it as python remove_dumps.py startdir > myoutputfile.txt. It will also store the starting directory in this output file. Each other line is formatted as: md5sum: file1, file2, file3, ... for duplicate files. The first of these is kept, the others are removed.

    import os
    import sys
    import glob
    import hashlib
    from collections import defaultdict
    
    BIG_ENOUGH_CTIME = 2**63-1
    
    start_dir = sys.argv[1]
    
    hash_file = defaultdict(list)
    level3_files = glob.glob(os.path.join(start_dir, "*", "*", "*", "*"))
    for name in level3_files:
        try:
            md5 = hashlib.md5(open(name).read()).hexdigest()
        except Exception, e:
            sys.stderr.write("Failed for %s. %s\n" % (name, e))
        else:
            # If you don't care about keeping the oldest between the duplicates,
            # the following files can be simplified.
            try:
                ctime = os.stat(name).st_ctime
            except Exception, e:
                sys.stderr.write("%s\n" % e)
                hash_file[md5].append((BIG_ENOUGH_CTIME, name))
            else:
                hash_file[md5].append((ctime, name))
    
    print "base: %s" % (os.path.abspath(start_dir))
    for md5, l in hash_file.items():
        if len(l) == 1:
            continue
    
        # Keep the oldest file between the duplicates.
        l = sorted(l)
        name = [data[1] for data in l]
    
        # md5sum: list of files. The first in the list is kept, the others are
        # removed.
        print "%s: %s" % (md5, ','.join('"%s"' % n for n in name))
    
        original = name.pop(0)
        for n in name:
            print "%s->%s" % (n, original)
            sys.stderr.write("Removing %s\n" % n)
            try:
                os.remove(n)
            except Exception, e:
                sys.stderr.write("%s\n" % e)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

In Linux environment, I want to create a file and write text into it:
Using Linux environment with java,I'm having the config file which should be configured before
Actually I have a file . I am working in linux environment. I need
I would like to create a program in a linux/unix environment that runs from
I want to learn to program in a UNIX/Linux environment. I'll be using the
In a Linux environment working in C++, I need to convert a time_t value
I need to write some scripts to carry out some tasks on my server
I would like to write my own LDAP client under Linux, specific to our
Is it possible in some way to have dynamic environment variables in Linux? I
Possible Duplicate: Piping data to Linux program which expects a TTY (terminal) I want

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.