I have over 10K files for products, the problem is is that many of

Question

0

Editorial Team

Asked: May 16, 20262026-05-16T01:34:18+00:00 2026-05-16T01:34:18+00:00

I have over 10K files for products, the problem is is that many of

0

I have over 10K files for products, the problem is is that many of the images are duplicates.

If there is no image, there is a standard image that says ‘no image’.

How can I detect if the image is this standard ‘no image’ image file?

Update
The image is a different name, but it is exactly the same image otherwise.

People are saying Hash, so would I do this?

im = cStringIO.StringIO(file.read())
img = im.open(im)
md5.md5(img)

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-16T01:34:19+00:00

I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:

images[some_size] = ['x/a.jpg', 'b/f.jpg', 'n/q.jpg']
images[some_other_size] = ['q/b.jpg']

Then, for each key (image size) where there’s more than 1 element in the dictionary, I’d read some fixed amount of the file and do a hash. Something like:

possible_dupes = [size for size in images if len(images[size]) > 1]
for size in possible_dupes:
    hashes = defaultdict(list)
    for fname in images[size]:
        m = md5.new()
        hashes[ m.update( file(fname,'rb').read(10000) ).digest() ] = fname
    for k in hashes:
       if len(hashes[k]) <= 1: continue
       for fname in hashes[k][1:]:
           os.remove(fname)

This is all off the top of my head, haven’t tested the code, but you get the idea.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have over 10K files for products, the problem is is that many of

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply