I’ve got a really big problem with my image storage server.
There are about 2,000,000 product images on it and keep increasing, but a lots of them are very similar. For example: an iPad photo with many similar sizes 120 * 120, 118 * 120, 131 * 125 … etc. they took a lots of unnecessary disk space and bad user experience in my website (similar images in gallery).
Those images has indexed in database, I can find them with some conditions, like by product, category etc. I need to find a way to mark these similar images in database and remove them.
What I have done:
found a library named pHash can calculate two image’s similarity, I can use it calculate images one by one. But in this way it will take a lots of time to find those images. Now I don’t know how to make this process be more faster.
Any ideas?
1 Answer