I’ve been having some difficulty scaling up the application and decided to ask a question here.
Consider a relational database (say mysql). Let’s say it allows users to make posts and these are stored in the post table (has fields: postid, posterid, data, timestamp). So, when you go to retrieve all posts by you sorted by recency, you simply get all posts with posterid = you and order by date. Simple enough.
This process will use timestamp as the index since it has the highest cardinality and correctly so. So, beyond looking into the indexes, it’ll take literally 1 row fetch from disk to complete this task. Awesome!
But let’s say it’s been 1 million more posts (in the system) by other users since you last posted. Then, in order to get your latest post, the database will peg the index on timestamp again, and it’s not like we know how many posts have happened since then (or should we at least manually estimate and set preferred key)? Then we wasted looking into a million and one rows just to fetch a single row.
Additionally, a set of posts from multiple arbitrary users would be one of the use cases, so I cannot make fields like userid_timestamp to create a sub-index.
Am I seeing this wrong? Or what must be changed fundamentally from the application to allow such operation to occur at least somewhat efficiently?
Indexing
If you have a query:
... WHERE posterid = you ORDER BY timestamp [DESC], then you need a composite index on {posterid, timestamp}.To understand why, take a look at Anatomy of an SQL Index.
Clustering
The leafs of a “normal” B-Tree index hold “pointers” (physical addresses) to indexed rows, while the rows themselves reside in a separate data structure called “table heap”. The heap can be eliminated by storing rows directly in leafs of the B-Tree, which is called clustering. This has its pros and cons, but if you have one predominant kind of query, eliminating the table heap access through clustering is definitely something to consider.
In this particular case, the table could be created like this:
The MySQL/InnoDB clusters all its tables and uses primary key as clustering key. We haven’t used the surrogate key (
postid) since secondary indexes in clustered tables can be expensive and we already have the natural key. If you really need the surrogate key, consider making it alternate key and keeping the clustering established through the natural key.