I have a Perl script I wrote for my own personal use that fetches image files from a website periodically. It then saves these images to a folder. These image files are quite often the same from fetch to fetch, and I’d like to not save duplicates if I can get around it.
My question: What would be the best way to compare/check if they are the same?
My only real thought so far is to open a file handle to existing one, md5 it, md5 the $response->content from the fetch and then compare them. Would that work?
Is there a better way?
EDIT:
Wow, already tons of great suggestions. Does it help if I tell you that this script runs daily via cron? I.e. it is guaranteed to always run at the exact same time everyday? Also: I’m looking at the last-modified headers on some of these, and they don’t look 100% accurate, i.e. there are some that have a last-modified of over a week ago when I know the image is more recent than that. I’m assuming that’s because the image file itself hasn’t been modified on the server since then… which doesn’t help me much…
Don’t open and hash the stored image each time – stash the hash alongside the image when you store it. Compare sizes as well.
Don’t issue a GET request straight away, do a HEAD first and compare the size, last modification date and any Etags to what you got last time.