I have over 10K files for products, the problem is is that many of the images are duplicates.
If there is no image, there is a standard image that says ‘no image’.
How can I detect if the image is this standard ‘no image’ image file?
Update
The image is a different name, but it is exactly the same image otherwise.
People are saying Hash, so would I do this?
im = cStringIO.StringIO(file.read())
img = im.open(im)
md5.md5(img)
I wrote a script for this a while back. First it scans all files, noting their sizes in a dictionary. You endup with:
Then, for each key (image size) where there’s more than 1 element in the dictionary, I’d read some fixed amount of the file and do a hash. Something like:
This is all off the top of my head, haven’t tested the code, but you get the idea.