In linux environment I would need to remove duplicate images by md5 of the file, but before deleting, I want to write in a file some CSV list of
Deleted File -> Linked First File
Deleted File -> Linked File
Etc.
The problem is that I have a structure of
Main Folder
Subfolder
Sub-Sub Folder
Sub-Sub-Sub Folder
Images
With more than 200.000 Files
So Script should be quite nice not to hang and to be fast.
Which direction would you suggest?
I have ubuntu under hand.
UPDATE:
I have found a script which does with small modification what I need. It search and find the md5 duplicates and removes the duplicates. Only last step needed is to make a file with list of removed file -> duplicate that stays
#!/bin/bash
DIR="/home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test"
find $DIR -type f -exec md5sum {} \; | sort > /home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test/sums-sorted.txt
OLDSUM=""
IFS=$'\n'
for i in `cat /home/gevork/Desktop/webserver/maps.am/all_tiles/dubai_test/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo rm "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done
I find Python a nice tool for these tasks, and is more portable too (although you have restricted the question to Linux). The code below will keep the oldest file (by creation time) among the duplicates, if that doesn’t matter to you then it can be simplified. To use it, save it as, for example, “remove_dups.py”, and run as
python remove_dumps.py startdir. Fromstartdir, it will look for directories that 3 levels deep, and calculate the md5 sum of the contents there. It stores a list of file names per hash. The text file you are after is printed to stdout, so you actually want to run it aspython remove_dumps.py startdir > myoutputfile.txt. It will also store the starting directory in this output file. Each other line is formatted as:md5sum: file1, file2, file3, ...for duplicate files. The first of these is kept, the others are removed.