I’m new to SQL and I don’t understand performance implications. It seems like SQL databases store everything in a single place. Doesn’t this mean that tables grow extremely large, very quickly? Won’t this hurt performance?
Example Stackoverflow model, but with threaded comments:
CREATE TABLE t_users (
name varchar(80) primary key,
email varchar(80)
);
CREATE TABLE t_posts (
id varchar(80) primary key,
userid varchar(80) references t_users(name),
title varchar(80),
description text,
topic varchar(80),
path text
);
Is this a valid design? All posts of every user ever are stored in the same table… So if I want to query for all comments that have topic “programming”, it would need to look through every single post, even through the posts that have different topics because they are all stored in the same table….this also means that if I make more complicated queries, they will exponentially grow slower the bigger my table on disk is.
Wouldn’t it be better to split every single post into a new table?
The design is quasi-valid, but not completely:
t_userswould be better of having an autoincrement unsignedint ID column. (A primary key on a name is almost ALWAYS a bad idea. People change names. People have the same names. Even countries change names sometimes! A numeric is almost always the best choice!)
t_postscan refer to that userID. Joins are now blazing fast.t_postshas an ID primary key column (good!), but it’s varchar (bad!). INT is better.BIGINTif you need it.You will find your posts later might have multiple topics (stackoverflow "tags"). Don’t put them CSV in a varchar field. Create a new table "topics" with ID, description, and a linking table "posts_to_topic" that links each post to one or more topics.
Indexes
What you need to read up on is indexes. If you want to query for all comments that have topic "programming", you’d usually have an index on the column "topic varchar(80)". This index is small (consider it a seperate table: it contains the indexed column(s) and the primary key), so your (R)DBMS can search it very quickly (tree-structure) and fetch all the primary keys it needs. Then, depending on what you select, the DBMS sends you the information:
Simplification
I lied. In the last paragraph, I made it all much simpler that it really is. There is an optimiser that will look at the query and determine what indexes can be used. It will check the indexes – depending on the cardinality, table size, columns it might use it, or decide to scan the table anyway. If your table has variable row lengths, fetching the X-th row is much slower than when all rows have the same length (no VARCHAR). And all that depends on what (R)DBMS (or in MySQL, even on what storage engine) you use.
But read about indexes first, on the what, the why, and later the how. After that, you can study the exceptions deeper.
Multiple tables for the same data
This is a very frequently made beginner mistake, and they go both ways:
Reading about indexes will tell you why this technically is a bad idea, but it’s also less elegant on a logical scale: one table is meant to represent one entity (Books. Users. Posts. Pages) – splitting those will result in some very ugly queries. And if you ask someone why they are doing this, the reason is often "for speed", while an extra index on their decision-column would have had the same effect.
Think about it: if you make a post title for each user, write the query that lists the 10 most used topics, and how many posts each of those has. You will have to name every table!