I’ve built a UI widget that allows me to create a set of nested rules. For example, I could specify the following rules:
Match ALL of these rules
- Document Status == Open
- Has Tag = 'sales'
- Has Tag = 'question'
- Match ANY of these rules
- Has Tag = 'important'
- Has Tag = 'high-priority'
- Has Tag = 'critical-priority'
In english, this would translate to this query:
Find Documents where status = Open AND has tag 'sales' AND has tag 'question'
AND has at least one of these tags: 'important', 'high-priority', 'critical-priority'
The table structure looks similar to this.
Documents {id, title, status}
Tags {document_id, tag_value}
Now, at this point I need to translate this set of rules in to an SQL query. It can be done fairly easily using subqueries, but Id rather avoid them because of performance reasons. The Documents and tags table could potentially contain millions of records each.
SELECT
d.id
FROM
Documents d
WHERE
d.status = 'open'
AND EXISTS (SELECT * FROM Tags t WHERE t.doc_id = d.id AND t.value = 'sales')
AND EXISTS (SELECT * FROM Tags t WHERE t.doc_id = d.id AND t.value = 'question')
AND (
EXISTS (SELECT * FROM Tags t WHERE t.doc_id = d.id AND t.value = 'important')
OR EXISTS (SELECT * FROM Tags t WHERE t.doc_id = d.id AND t.value = 'high-priority')
OR EXISTS (SELECT * FROM Tags t WHERE t.doc_id = d.id AND t.value = 'critical-priority')
)
How do I rewrite this query to use more efficient joins?
I could add the first two Tag rules as INNER joins, but how do I process the later part of the rule set? What if there are further rules that require a tag to be present for the document to appear?
Keep in mind that a rule set can be set to match ALL or ANY of the rules in it, and that it could theoretically nest many times over.
Any ideas on a general direction to take to tackle this problem?
Update:
I’ve optimized my tables, and found a method of querying the tables that seems very quick (apart from COUNTing the number of matching records, which is another problem). I won’t ever be selecting more than 100 documents at a time, and with a document set of ~600k and ~2 million tags, this solution returns the results in ~0.02s, which is much better than before.
The tables in question…
CREATE TABLE `app_documents` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`account_id` int(11) NOT NULL,
`status_id` int(11) DEFAULT NULL,
`subject` varchar(255) COLLATE utf8_unicode_ci NOT NULL,
`created` datetime NOT NULL,
`updated` datetime NOT NULL,
PRIMARY KEY (`id`),
KEY `IDX_B91B1DB99B6B5FBA` (`account_id`),
KEY `IDX_B91B1DB96BF700BD` (`status_id`),
KEY `created_idx` (`created`),
KEY `updated_idx` (`updated`),
CONSTRAINT `FK_B91B1DB96BF700BD` FOREIGN KEY (`status_id`) REFERENCES `app_statuses` (`id`),
CONSTRAINT `FK_B91B1DB99B6B5FBA` FOREIGN KEY (`account_id`) REFERENCES `app_accounts` (`id`),
) ENGINE=InnoDB AUTO_INCREMENT=500001 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `app_tags` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`value` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
PRIMARY KEY (`id`),
KEY `value_idx` (`value`)
) ENGINE=InnoDB AUTO_INCREMENT=8 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
CREATE TABLE `app_documents_tags` (
`document_id` int(11) NOT NULL,
`tag_id` int(11) NOT NULL,
PRIMARY KEY (`document_id`,`tag_id`),
KEY `IDX_A849587A700047D2` (`document_id`),
KEY `IDX_A849587ABAD26311` (`tag_id`),
CONSTRAINT `FK_A849587ABAD26311` FOREIGN KEY (`tag_id`) REFERENCES `app_tags` (`id`) ON DELETE CASCADE,
CONSTRAINT `FK_A849587A700047D2` FOREIGN KEY (`document_id`) REFERENCES `app_documents` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
And the query I was testing against…
This query finds all documents and their tags that have both tags “blue” and “green” but not “red”.
SELECT
d.*
FROM
app_documents d
LEFT JOIN
app_documents_tags dtg ON ttg.document_id = d.id
LEFT JOIN
app_tags tg ON tg.id = dtg.tag_id
WHERE
d.account_id = 1
AND EXISTS (
SELECT
*
FROM
app_tags t1
CROSS JOIN
app_tags t2
CROSS JOIN
app_tags t3
INNER JOIN
app_documents_tags dtg1 ON t1.id = ttg1.tag_id
INNER JOIN
app_documents_tags dtg2 ON dtg1.ticket_id = dtg2.ticket_id AND dtg2.tag_id = t2.id
LEFT JOIN
app_documents_tags dtg3 ON dtg2.ticket_id = dtg3.ticket_id AND dtg3.tag_id = t3.id
WHERE
t1.value = 'blue' AND t2.value = 'green' AND t3.value = 'red' AND dtg3.ticket_id IS NULL AND dtg2.document_id = t.id
)
ORDER BY
d.created
LIMIT 45
I’m sure this can be improved using better indexes though.
Forumlate the query from the Question as follows:
Given that description, here is the resulting query:
Make sure you have this Index on the Documents
Give it a Try !!!