I am developing a system that will eventually have millions of users. Each user of the system may have acces to different ‘tabs’ in the system. I am tracking this with a table called usertabs. There are two ways to handle this.
Way 1: A single row for each user containing userid and tab1-tab10 as int columns.
The advantage of this system is that the query to get a single row by userid is very fast while the disadvantage is that the ’empty’ columns take up space. Another disadvantage is that when I needed to add a new tab, I would have to re-org the entire table which could be tedious if there are millions of records. But this wouldn’t happen very often.
Way 2: A single row contains userid and tabid and that is all. There would be up to 10 rows per user.
The advantage of this system is easy sharding or other mechanism for optimized storage and no wasted space. Rows only exist when necessary. The disadvantage is up to 10 rows must be read every time I access a record. If these rows are scattered, they may be slower to access or maybe faster, depending on how they were stored?
My programmer side is leaning towards Way 1 while my big data side is leaning towards Way 2.
Which would you choose? Why?
Premature optimization, and all that…
Option 1 may seem “easier”, but you’ve already identified the major downside – extensibility is a huge pain.
I also really doubt that it would be faster than option 2 – databases are pretty much designed specifically to find related bits of data, and finding 10 records rather than 1 record is almost certainly not going to make a difference you can measure.
“Scattered” records don’t really matter, the database uses indices to be able to retrieve data really quickly, regardless of their physical location.
This does, of course, depend on using indices for foreign keys, as @Barmar comments.