I have around 3500 flood control facilities that I would like to represent as a network to determine flow paths (essentially a directed graph). I’m currently using SqlServer and a CTE to recursively examine all the nodes and their upstream components and this works as long as the upstream path doesn’t fork alot. However, some queries take exponentially longer than others even when they are not much farther physically down the path (i.e. two or three segments ‘downstream’) because of the added upstream complexity; in some cases I’ve let it go over ten minutes before killing the query. I’m using a simple two-column table, one column being the facility itself and the other being the facility that is upstream from the one listed in the first column.
I tried adding an index using the current facility to help speed things up but that made no difference. And, as for the possible connections in the graph, any nodes could have multiple upstream connections and could be connected to from multiple ‘downstream’ nodes.
It is certainly possible that there are cycles in the data but I have not yet figured out a good way to verify this (other than when the CTE query reported a maximum recursive count hit; those were easy to fix).
So, my question is, am I storing this information wrong? Is there a better way other than a CTE to query the upstream points?
I know nothing about flood control facilities. But I would take the first facility. And use a temp table and a while loop to generate the path.
DECLARE @intN INT SET @intN = 1
INSERT INTO TempTable(LastNode, CurrentNode, N) — Insert first item in list with no up stream items…call this initial condition SELECT LastNode, CurrentNode, @intN FROM your table WHERE node has nothing upstream
WHILE @intN <= 3500 BEGIN SEt @intN = @intN + 1 INSERT INTO TempTable(LastNode, CurrentNode, N) SELECT LastNode, CurrentNode, @intN FROM your table WHERE LastNode IN (SELECT CurrentNode FROM TempTable WHERE N = @intN-1)
END
If we assume that every node points to one child. Then this should take no longer than 3500 iterations. If multiple nodes have the same upstream provider then it will take less. But more importantly, this lets you do this…
SELECT LastNode, CurrentNode, N FROM TempTable ORDER BY N
And that will let you see if there are any loops or any other issues with your provider. Incidentally 3500 rows is not that much so even in the worst case of each provider pointing to a different upstream provider, this should not take that long.