I am using MySQL ver 5.5.8.
Lets say I have the table,entries, structure like so:
entry_id int PK
member_id FK
there can be multiple entries for each member. I want to get 10 of them at random but I need to fetch them in a way that allows for the odds of being selected increase with the number of entries a member has. I know I could just do something like:
SELECT member_id
FROM entries
GROUP BY member_id
ORDER BY RAND()
LIMIT 10
But I’m not sure if that will do what I want. Will MySQL group the records THEN select 10? If that were the case then every member would have the same chance to get picked, which is not what I want. I have done some testing and searching but can’t come up with a definitive answer. Does anyone know if this will do what I want or do I have to do things a different way? Any help would be appreciated. Thanks much!
LIMIT 10will choose 10 records base in (in this case) a random order. This is indeed after the grouping.Maybe you can
ORDER BY RAND() / count(*). That way, the number is likely to be smaller for users with more questions, thus they are more likely to be in the top 10.[edit]
By the way, it seems that over time (as the data grows)
ORDER BY RAND()becomes slower. There are a couple of ways to work around that. Mediawiki (software behind Wikipedia) has an interesting method: It generates a random number for each page, so when you select ‘random page’, it generates one random number between 0 and 1 and selects the page that is closest to that number:That saves having to generate that temporary table for each query. You will need to periodically re-generate the numbers if your data grows, and you must make sure the numbers are evenly generated. That is easy enough: For new records, you can just generate a random number. Periodically the entire list is updated: All records are queried. Then, each record in that order is assigned a number between 0 and 1, but in an incrementing number, that increments
1 / recordCount. That way, the records are evenly spaced, and the change of finding them is the same for each one of them.You could use that method too. It will make your query faster in the long run, and you could make the distribution smarter: 1) Instead of using ‘memberCount’, you can use ‘totalEntryCount’. 2) Instead of incrementing by
1 / 'memberCount', you could useentryCountForMember / totalEntryCount. That way, the gap before members with more entries will be bigger, therefor, the chance of them matching the random number will be bigger as well. For instance, your members may look like this:The delta isn’t saved, of course, but it shows the added number. In the Mediawiki example, this delta would be the same for each page, but in your case, it could depend on the number of entries. Now you see, there’s only a small gap between bob and john, so the chance that you pick a random number between 0 and bob is ten times as large as picking a random number between bob and john. So, chances of picking bob are ten times as large as picking john.
You will need a (cron) job to periodically redistribute the numbers, because you don’t want to do that on each modification, but for the kind of data you’re dealing with, it doesn’t have to be real-time, and it makes your queries a lot faster if you got many members and many entries.