I currently have a closure table used for hierarchical data that has 5 million nodes which results in ~75 million rows in the closure table. Using SqLite my query time is rising exponentially due to the size of the closure table.
CREATE TABLE `Closure` (`Ancestor` INTEGER NOT NULL ,`Descendant` INTEGER NOT NULL ,`Depth` INTEGER, PRIMARY KEY (`Ancestor`,`Descendant`) )
CREATE INDEX `Closure_AncestorDescendant` ON `Closure` (`Ancestor` ASC, `Descendant` ASC);
CREATE INDEX `Closure_DescendantAncestor` ON `Closure` (`Descendant` ASC, `Ancestor` ASC);
CREATE TABLE `Nodes` (`Node` INTEGER PRIMARY KEY NOT NULL, `Root` BOOLEAN NOT NULL, `Descendants` INTEGER NOT NULL);
My query to find the nodes that are roots takes about 20 minutes with this many nodes even though only about 5 or 6 nodes meet the query.
SELECT `Closure`.`Ancestor` FROM `Closure`
LEFT OUTER JOIN `Closure` AS `Anc` ON `Anc`.`Descendant` = `Closure`.`Descendant`
AND `Anc`.`Ancestor` <> `Closure`.`Ancestor` WHERE `Anc`.`Ancestor` IS NULL;
20 minutes is to long so right now I’m storing a bool for if the node is a root and modifying the Nodes.Root column when the node is moved.. I’m not exactly happy with the duplicate data but my query times are now in the single digit milliseconds for every query.
I also have a lot of queries that require knowledge of how many descendants a given node has (mostly if Descendants > 1 to know if this object can be virtualized/expanded in a tree view). I used to query this every time I needed it but across a gigantic database like I have even with indexes the queries seemed to take to long (more than 1 second) so I also reduced them to the Nodes.Descendants column which I also update every time a node is moved. Unfortunate this is another duplication of data I would like to avoid.
The query I used to use was like below. If anyone can explain how to increase the performance of this (consider that I already have an index starting with Ancestor) I would appreciate it.
SELECT COUNT(*) FROM `Closure` WHERE `Ancestor`=@Node
Does the version of SQLite you’re developing on support Foreign Keys? If so, your closure table design should have a FK referencing the hierarchy table you’re supporting with the closure table. In TSQL:
You’ll have to look up the relevant SQLite syntax, sorry.
Since you are already maintaining a depth field, which is the distance between the descendant and its ancestor, you could make use of it to tell if a given node has children.
That should come back fairly quick regardless of the size of your closure table. If you get an empty set from that, then your given node cannot be expanded any more, because it has no children. Exists returns true as soon as it finds one instance that meets your criteria, and you’re only taking the top 1 so you don’t return a row for every row in your closure table for the passed @Node.
As for improving the performance of finding the roots, try something like the below. It’s what I use for finding roots, but my closure table is only ~200,000 rows. I compared the plans generated for each though, and your code uses a Hash, which could be impacting performance due to processor requirements on the device (I’m assuming here that SQLite is for iPhone/iPad or sometype of small distribution on devices). The below uses less processing power and more reads from indexes in its plan and makes use of the relationship of the hierarchy to the closure table. I cannot be certain that it will improve your performance woes but it’s worth a shot.