I want to understand how to build a large site database architecture for chat messages.(example facebook.com or gmail.com)
I think that messages is redistributed in different tables because having all the messages in one table is impossible, the reason is they have huge quantity right? (and here partitioning can’t I think)
So, what logic is used to redistribute messages in different tables? I have several variants but I think none of them is an optimal variant.
So generally, I’m interested in what you may think about this? and also, If you know some good articles about this, please post the link.
OK, well the problem is how to partition the dataset. The easiest (and often the best) way to think about this is to consider the access pattern. what messages are needed quickly, which ones can be slow, and how to manage each of them.
Generally older messages can be held on low network speed/low memory/very large storage nodes (multi-terabyte).
New messages should be on high bandwidth network/high memory/low storage nodes (gigabytes are enough).
As traffic grows, so you’ll need to add storage to the slow nodes, and add nodes to the fast nodes (scale horizontally).
Each night (or more often) you can copy old messages to the historical database, and remove the messages from the current database. Queries may need to address two databases, but this is not too much trouble.
As you scale out, the data will probably need to be sharded i.e. split by some data value. User-id splits makes sense. To make life easy, all sides of a conversation can be stored with each user. I would recommend using time bucketed text for this (disk access is usually on 4k boundaries) though this may be too complicated for you initially.
Queries now need to be user-aware so they query against the correct database. A simple lookup table will help there.
The other thing to do is to compress the messages on the way in, and decompress on the way out. Text is easily compressed and may double your throughput for a small cpu increase.
Many NoSQL databases do a lot of this hard work for you, but until you’ve run out of capacity on your current system, you may wish to stick to the technologies you know.
Good luck!