Excuse me if this has been asked before, but I tried looking for something similar but couldn’t find anything.
I have three tables: users, hobbies and user_hobbies (linking the first two). I want to calculate the similarity betweet two users based on their hobbies. For this, I need, first of all, two sets: User A hobbies and user B hobbies, which I can acquire with two simple queries. I have to calculate these two sets for other reasons too, in a php file, so they are available to me, in two arrays, for the next step:
I have to calculate their common hobbies (i.e. the intersection of the sets).
Idea #1: Having two arrays, I can calculate through some method the common elements.
Idea #2: I can make a third query (e.g. SELECT hobby FROM user_hobbies WHERE user_id IN ('uid_A', 'uid_B') GROUP BY hobby HAVING COUNT (*) = 2) and not bother myself.
I suppose my question is about performance. Is it quicker to calculate manually or are mysql queries much faster?
You already have a normalized table to hold the user-hobbies table, so why not go with that?
Generally speaking, SQL will be much faster, at least for the first 100k records or so. Then you’ll see a performance drop on queries that vet through columns that aren’t indexed, or from queries that use the ‘filesort’ to order large datasets brought on by the ORDER BY keyword.
For scalability, I recommend using an inner join to narrow down the possibilities for starters.
Think critically about this. Are there any other columns not mentioned could indicate that the user could have more than one hobby? These are the things you consider when looking to scale your application.
Otherwise, you should be fine for starters, lest you should be optimizing prematurely.
I would go with Option #2.