Background:
We are building a mailing tool and have currently separated emailaddresses into a separate table, so that a single emailaddress is only stored once and is instead referenced by its id. We find this to be a good idea as the number of recipients per email can be huge and it’s likely that most emailaddresses will receive well in excess of 100 emails.
However, when a user is importing emailaddresses into a list or similar operations, we first need to bulk-insert to make sure all emailaddresses have ids, we just ignore collisions, that works. However, when we then want to insert them into a list we must fetch the emailaddresses one-by-one or with a huge IN-query with emailaddresses (as a list references an emailaddress by id), not very tempting!
EDIT: Users may be importing 100.000+ emailaddresses, for 1000 or so emailaddresses it’s not a real issue to query one-by-one of course.
Question:
So one idea is to hash each emailaddress and use that as an id instead. Meaning, we can predict the id for all emailaddresses without having to query for them. But are there any good algorithms out there as storing 16bytes/128bit+ kind of defeats the purpose… 64bit should be enough no? Which would be appreciated given that this should all be indexed too.
Any recommendations? What if we were to simply take the first 8bytes from the MD5? Is 8bytes from SHA1 better? Perhaps there are more specialized algorithms? I’m not all read up on collision probability, but I’m curious as to how well existing algorithms works when trimmed down and as emails are lower-case and mostly letters or numbers. (Note the dataset will potentially be huge)
PS. We are using PHP so it somewhat limits our ability to implement special algorithms.
Not sure i understand your use-case, but put a unique key constraint on the email address column…