First, please be gentle, I know very little about DB design. I am working with Splunk to generate records based on customer call information (call detail records). Due to the volume, I can’t really use splunk for the processing of some of the data as it’s nonrelational. So I will take the data in, use splunk to do some simple alerting, monitor for weird patterns, and do some other more advanced things. The data source I get most readily is available in near real time on the system already. What I would like to do is take the incoming SIP call Id (by RFC definition must be globally unique), the outgoing SIP call id (again, must be globally unique by definition), the current unix epoch time, and then a randomly generated number from 1-2^31, concatenate them together, and then take the md5 hash of the result and use that as a primary key. How likely is it we’d encounter a collision? Any advice on other approaches would be very much appreciated.
Share
The likelihood of a collision would be about 1 in 2^128 but since md5 is somewhat broken, an adversary could possibly theoretically create collisions more often by somehow crafting suitable call IDs if they know something about your RNG. I think you could just use that created value as is without hashing, or otherwise consider the risks that a collision might cause and plan for that.