Why do developers chaffify IDs for their “user” objects, or why, for instance, does Twitter use Snowflake for message IDs…? In other words: why is it bad for sequential IDs to be apparent in the browser? Does it represent a security flaw or just a privacy issue? If it’s a security flaw, what vulnerability do sequential IDs expose? If it’s a privacy issue, how is privacy violated if sequential IDs are discernible by the end user?
Why do developers chaffify IDs for their user objects, or why, for instance, does
Share
Three common approaches for creating unique IDs are
Security Aspects
This is certainly a security concern if you associate things like a session with the ID. In that case you don’t want any malicious user to be able to predict such an ID. Sequential IDs are trivially predictable, UUIDs need a bit more effort but are also not a good idea, which leaves random numbers. And even for them, you have to make sure to use a cryptographically secure random number generator, otherwise there is still room for predictability.
As an example why this is serious, consider the good old "jsessionid" or any other typical session ID included in the URL. An attacker would log in and behave like a normal user, write down the session ID that was assigned to him, and would then start to predict further IDs, and by entering them in the address bar, effectively hijacking other users’ sessions.
Concurrency/Scaling Issues
But judging from what Snowflake says in its description it seems as if there is no inherent security concern associated with it, the approach seems to fall under the third, the UUID category. In the text, it says that they are moving away from MySQL to Cassandra and that they were using MySQL sequential IDs in the past. But if you think about it, this soon becomes a bottleneck when you try to scale your system: every ID generation needs to be synchronized to prevent race conditions.
If you do not synchronize this process, an example for such a race condition could be that two independent instances increase the IDs at the same time, thus effectively incrementing the counter only by one where it should have been actually incremented by two. Now typically, if you just have one database instance, the database will perform the synchronization for you. But obviously this does not scale, too many clients will be waiting idle, while the database is under heavy load. Multiple databases are an option, but replicating the IDs might put you back in the same situation.
Lock-free Unique IDs
So if you want IDs generated without the need for synchronization (lock-free), you either learn to live with non-unique IDs (which is more or less an Oxymoron and not really the solution), or you must figure something out to eliminate the bottleneck. What we once did, and what works nicely for a few database instances:
But for many instances this will become a real number-theoretic problem, so you have to go for a different solution. One way out is to go the UUID route, which is generally OK, but has the downside of completely depending on external factors that might change over time. From what I’ve seen, my guess is this is what Snowflake is aiming at.
For completeness’ sake, I want to mention another solution that scales beautifully and is IMO beautiful in itself. It is not also not subject to external factors and will work anywhere, despite being counter-intuitive at first. The idea is to choose sufficiently large (let’s say 20 bytes) cryptographically secure random numbers. It has to be those, non-cryptographic random number generators typically repeat after a certain amount of numbers generated, and we don’t want that, of course. Other than that, that’s all you need.
At first, I thought this can never work, what if we get the same number? But if you do the math, you will realize what the odds are. The Birthday Paradox tells us that you will find a collision in time in the order of O(2^(n/2)) where n is the number of bits of your random number. So 20 bytes = 160 bits, and you should find a collision in 2^80 time. That’s the same security margin as for SHA-1, and so far nobody has ever found a collision there. The thing is that it’s not even slightly likely that you get lucky and find a collision in let’s say 2^30 by "chance" or anything like that. The probabilities are against you. It’s roughly in the same ballpark as winning multiple lotteries at the same time while becoming president on the same day.