I’m fairly new to working with relational databases, but have read a few books and know the basics of good design.
I’m facing a design decision, and I’m not sure how to continue. Here’s a very over simplified version of what I’m building: People can rate photos 1-5, and I need to display the average votes on the picture while keeping track of the individual votes. For example, 12 people voted 1, 7 people voted 2, etc. etc.
The normalization freak of me initially designed the table structure like this:
Table pictures
id* | picture | userID |
Table ratings
id* | pictureID | userID | rating
With all the foreign key constraints and everything set as they shoudl be. Every time someone rates a picture, I just insert a new record into ratings and be done with it.
To find the average rating of a picture, I’d just run something like this:
SELECT AVG(rating) FROM ratings WHERE pictureID = '5' GROUP by pictureID
Having it setup this way lets me run my fancy statistics to. I can easily find who rated a certain picture a 3, and what not.
Now I’m thinking if there’s a crapload of ratings (which is very possible in what I’m really designing), finding the average will became very expensive and painful.
Using a non-normalized version would seem to be more efficient. e.g.:
Table picture
id | picture | userID | ratingOne | ratingTwo | ratingThree | ratingFour | ratingFive
To calculate the average, I’d just have to select a single row. It seems so much more efficient, but so much more uglier.
Can someone point me in the right direction of what to do? My initial research shows that I have to “find the right balance”, but how do I go about finding that balance? Any articles or additional reading information would be appreciated as well.
Thanks.
Your normalized approach makes a lot of sense, the denormalized one doesn’t.
In my experience (Telco Performance Management, hundreds of thousands of datapoints per 1/4 hour) we would do the following:
For the telco the pictures rating would be re-calculated once daily, you should do it periodical (e.g. hourly )or every time you insert (re-calc for the picture rated, not the entire table). This depends on the amounts of ratings you get.
In the telco we also keep the rating-date in what is your ‘pictures’ table and a 1/4h timestamp in the ratings table, but I don’t think you need that level of detail.
The ‘denormalization’ is to move a calculateable fact (count (rating) and avg(rating)) to the pictures table. This saves CPU cycles, but costs more storage.