My table:
CREATE TABLE `beer`.`matches` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`hashId` int(10) unsigned NOT NULL,
`ruleId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
If a hash has matched a rule, there’s an entry in this table.
1) Count how many hashIds there are for each unique ruleId (AKA “how many hashes matched each rule”)
SELECT COUNT(*), ruleId FROM `beer`.`matches` GROUP BY ruleId ORDER BY COUNT(*)
2) Select the 10 best rules (ruleIds), that is, select the 10 rules that combined matches the greatest number of unique hashes. This means that a rule that matches a lot of hashes is not neccessarily a good rule, if another rule covers all the same hashes. Basically I want to select the 10 ruleIds that catches the most unique hashIds.
?
EDIT: Basically I have a sub-optimal solution in PHP/SQL here, but depending on the data it doesn’t necessarily give me the best answer to question 2). I’d be interested in a better solution. Read the comments for more information.
If you really want to find the best solution (optimal solution), the problem is that you have to check all the possible combinations of 10 ruleIds, and find how many hashIds are returned by each of this possible combination. The problem is that the number of combinations is grossly the different number of ruleids ^ 10 (in fact, the number is smaller, if you consider that you cannot repeat the same ruleIds in the combinations… its a combination of m elements taken in groups of 10).
NOTE: To be exact, the number of possible combinations is
m!/(n!(m-n)!) => m!/(10!(m-10!)) where ! is factorial: m! = m * m-1 * m-2… * 3 * 2 * 1
To do this combinations you have to join your table with itself, 10 times, excluding the previous combinations of ruleids, somewhat like this:
Then you have to find the highest count of
This gigantic query would take a lot of time to run.
There can be much faster procedures that will give you sub-optimal results.
SOME OPTIMIZATION:
This could be somewhat optimized, depending on the data shape, looking for groups which are equal to or included in other groups. This would require less than (m*(m+1))/2 operations, which compared to the other number, it’s a big deal, specially if it’s quite probable to find several groups which can be discarded, which will lower m. Anyway, the main has still a gigantic cost.