I have a graph consisting of about 35,000 nodes represented in plain text:
node1 -> node35000
node29420 -> node35000
node2334 -> node4116
...
I’d like to trim it down by removing nodes that are not part of a chain at least three long. So if I had only
1 -> 2;
2 -> 3;
3 -> 4;
0 -> 4;
I’d like to keep 1, 2, 3, and 4 (since 1 -> 2 -> 3 -> 4 is four nodes long) but discard 0, that is, remove 0 -> 4.
Any idea of a good way to do this? I tried a combination of Perl and shell functions but I think I need a better approach. Unless maybe there are tools to do this already? The data is in graphviz format but I didn’t see any tools in that suite relevant to the task at hand.
Oh, and if there’s an easy way to do something like this I’m open to suggestions — it doesn’t need to be exactly the task I suggested. I’m just looking for a way to remove most of the noise surrounding the big clumps (which are rare and mostly just a few intersecting chains).
Supposing that any given node can have arbitrarily many predecessors or successors, then in-degree and out-degree of nodes is irrelevant to solving the problem.
Following is a simple O(N+E) algorithm for all graphs of N nodes and E edges, under the path-length-3 criterion. This algorithm can be easily implemented in Perl or C. The method is based on a definition and an assertion: Define a “made node” as any node that has a parent and child (predecessor and successor). Every node that will be kept is a made node or is a parent or child of a made node.
Initialize a status array S[Nmax] to all zeroes. Nmax is the maximum node number. If Nmax is not known at outset, read all the data and find it out.
Read in the given list of edges. Each input item specifies a directed edge (p, q) from node p to node q. For each (p, q) item that is read in: Set S[p] to S[p] | 1 to denote that p has a child, and set S[q] to S[q] | 2 to denote that q has a parent. (After this step, every made node n has S[n] == 3.)
Read the list of edges again. For each (p, q) item that is read in: If (S[p]==3) or (S[q] == 3) output edge (p,q).
To extend this method to path length K other than 3, keep the edge list in memory, maintain Sp[] and Sc[] with lengths of parent chains and child chains, and perform K/2 extra passes. Might be possible to do in time O(N+K*E).
The problem does not specify whether the graph is a DAG (directed acyclic graph) but the example given is a DAG. For K>3, it may make a difference.
Update 1 Here’s a more precise statement of a K>3 algorithm, with H[i].p and H[i].q being endpoints of edge #i, and pc[j], cc[j] being lengths of predecessor and successor chains about node j. Also, let E = # of edges; N = # of nodes; and K = desired min chain length for keeping an edge.
Read E edge data entries into H[ ] array. Set all pc[j], cc[j] entries to 0.
For i = 1 to E, set cc[H[i].p]=1 and pc[H[i].q]=1.
For j = 1 to K+1, { for i = 1 to E, { Let p=H[i].p and q=H[i].q. Set cc[p] = max(cc[p], 1+cc[q]) and pc[q] = max(pc[q], 1+pc[p]). } }
For i = 1 to E, { Let p=H[i].p and q=H[i].q. Output edge (p,q) if pc[p]+cc[p]+1 >= K and pc[q]+cc[q]+1 >= K.}
This method can make mistakes if graph is not a DAG and contains short looped paths. For example, if graph edges include (1,2) and (2,1) and no other nodes link to nodes 1 or 2, then neither of those edges should be output; but we end up with K+2 for cc[] and pc[] of those nodes, so they get output anyway.