In linux environment I would need to remove duplicate images by md5 of the

Question

0

Asked: June 17, 20262026-06-17T11:00:26+00:00 2026-06-17T11:00:26+00:00

In linux environment I would need to remove duplicate images by md5 of the

0

In linux environment I would need to remove duplicate images by md5 of the file, but before deleting, I want to write in a file some CSV list of

Deleted File -> Linked First File
Deleted File -> Linked File

Etc.

The problem is that I have a structure of

Main Folder
Subfolder
Sub-Sub Folder
Sub-Sub-Sub Folder
Images

With more than 200.000 Files

So Script should be quite nice not to hang and to be fast.

Which direction would you suggest?

I have ubuntu under hand.

UPDATE:

I have found a script which does with small modification what I need. It search and find the md5 duplicates and removes the duplicates. Only last step needed is to make a file with list of removed file -> duplicate that stays

#!/bin/bash

DIR="/home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test"

find $DIR -type f -exec md5sum {} \; | sort > /home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test/sums-sorted.txt

OLDSUM=""
IFS=$'\n'
for i in `cat /home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test/sums-sorted.txt`; do
 NEWSUM=`echo "$i" | sed 's/ .*//'`
 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
 if [ "$OLDSUM" == "$NEWSUM" ]; then
  echo rm  "$NEWFILE"
 else
  OLDSUM="$NEWSUM"
  OLDFILE="$NEWFILE"
 fi
done

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T11:00:27+00:00

I find Python a nice tool for these tasks, and is more portable too (although you have restricted the question to Linux). The code below will keep the oldest file (by creation time) among the duplicates, if that doesn’t matter to you then it can be simplified. To use it, save it as, for example, “remove_dups.py”, and run as python remove_dumps.py startdir. From startdir, it will look for directories that 3 levels deep, and calculate the md5 sum of the contents there. It stores a list of file names per hash. The text file you are after is printed to stdout, so you actually want to run it as python remove_dumps.py startdir > myoutputfile.txt. It will also store the starting directory in this output file. Each other line is formatted as: md5sum: file1, file2, file3, ... for duplicate files. The first of these is kept, the others are removed.

import os
import sys
import glob
import hashlib
from collections import defaultdict

BIG_ENOUGH_CTIME = 2**63-1

start_dir = sys.argv[1]

hash_file = defaultdict(list)
level3_files = glob.glob(os.path.join(start_dir, "*", "*", "*", "*"))
for name in level3_files:
    try:
        md5 = hashlib.md5(open(name).read()).hexdigest()
    except Exception, e:
        sys.stderr.write("Failed for %s. %s\n" % (name, e))
    else:
        # If you don't care about keeping the oldest between the duplicates,
        # the following files can be simplified.
        try:
            ctime = os.stat(name).st_ctime
        except Exception, e:
            sys.stderr.write("%s\n" % e)
            hash_file[md5].append((BIG_ENOUGH_CTIME, name))
        else:
            hash_file[md5].append((ctime, name))

print "base: %s" % (os.path.abspath(start_dir))
for md5, l in hash_file.items():
    if len(l) == 1:
        continue

    # Keep the oldest file between the duplicates.
    l = sorted(l)
    name = [data[1] for data in l]

    # md5sum: list of files. The first in the list is kept, the others are
    # removed.
    print "%s: %s" % (md5, ','.join('"%s"' % n for n in name))

    original = name.pop(0)
    for n in name:
        print "%s->%s" % (n, original)
        sys.stderr.write("Removing %s\n" % n)
        try:
            os.remove(n)
        except Exception, e:
            sys.stderr.write("%s\n" % e)

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

In linux environment I would need to remove duplicate images by md5 of the

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply