I’ve been learning Ruby on Rails over the past few months with no prior programming experience. Lately, I’ve been thinking about database optimization and table organization. I know there are great books on the subject, but I typically learn by example / as I go.
Here’s a hypothetical situation:
Let’s say I am building a social network for a niche community with 250,000 members (users). The users have the ability to attend events. Let’s say there are 50,000 past/present/future events. Much like Facebook events, a user can attend any number of events and an event can have any number of attendees.
In the database, there would be a table for users and a table for events. Somehow I would have to create an association between the users and events. I could create an “events” column in the users table such that each user row would contain a hash of event IDs, or I could create an “attendees” column in the events table such that each event row would contain a hash of user IDs.
Neither of these solutions seem ideal, however. On a user’s profile page, I want to display the list of events they are associated with, which would require scanning the 50,000 event rows for the user ID of said user if I include an “attendees” column in the events table. Likewise, on an event page, I want to display a list of attendees for the event, which would require scanning the 250,000 user rows for the event ID of said event if I include an “events” column in the users table.
Option 3 would be to create a third table that contains the attendee information for each and every event – but I don’t see how this would solve any problems.
Are these non-issues? Rails makes accessing all of this information easy, but I guess I’m worried about scale. It is entirely possible that I am under-estimating the speed and processing power of modern databases / servers / etc. How long would it take to scan 250,000 user rows for specific event IDs – 10ms? 100ms? 1,000ms? I guess that’s not that bad. Am I just over-thinking this?
This is a typical many-to-many relationship between users and events.
You need a third table (say UserEvent or better UserAttendsEvent or just Attends) which will have a row for every user and every event the user attends.
So it will have at least a userID and an eventID, both as foreign keys to the User and Event table.
Adding indexes on these 2 fields will probably be good for your queries, since you plan to have millions of rows.
The UserEvent may also have other data, like when a user registered for an event, money she spent on the event, if she enjoyed it or not, etc.
The catch is that every row has information regarding “Attends”. Who attended (userID), what attended (eventID), when he arrived, amount spent during, etc. You don’t want to put this info in neither the User table nor the Event table.
Since you are worried about performance, I’ll add an example of how the database would search for a specific query. Lets say we want to find all users that attend (or plan to) event “U2 concert in Athens, July 2011” and have same birthday as me.
So, the database access the disks only to search indexes and read the data we need. Not to read all data and search in them.
In more complex queries or because the query requires all data in a table or if an index needed hasn’t been created or some index is not useful or if db query optimizer decides it’s faster, it may scan a table or part of it and then search for the data. But if “proper” indexes have been defined (proper for you planned use), the queries will be fast.