I need the MD5 sums of 3 million strings or so in a bash script on ubuntu. 3 million strings -> 3 million MD5 hashes. The trivial implementation takes about 0.005sec per string. That’s over 4 hours. What faster alternatives exist? Is there a way to pump groups of strings into md5sum?
#time md5sum running 100 times on short strings
#each iteration is ~0.494s/100 = 0.005s
time (for i in {0..99}; do md5sum <(echo $i); done) > /dev/null
real 0m0.494s
user 0m0.120s
sys 0m0.356s
A good solution will include a bash/Perl script that takes a list of strings from stdin and outputs a list of their MD5 hashes.
It’s not hard to do in C (or Perl or Python) using any of the many md5 implementations — at its heart md5 is a hash function that goes from a character vector to a character vector.
So just write a outer program that reads your 3 million strings, and then feed them one by one to the md5 implementation of your choice. That way you have one program startup rather than 3 million, and that alone will save you time.
FWIW in one project I used the md5 implementation (in C) by Christophe Devine, there is OpenSSL’s as well and I am sure CPAN will have a number of them for Perl too.
Edit: Ok, couldn’t resist. The md5 implementation I mentioned is e.g. inside this small tarball. Take the file
md5.cand replace the (#ifdef’ed out)main()at the bottom with thisbuild a simple standalone program as e.g. in
and then you get this:
So that’s about a second for 300,000 (short) strings.