I am trying to get the list of connected components in a graph with 100 million nodes. For smaller graphs, I usually use the connected_components function of the Networkx module in Python which does exactly that. However, loading a graph with 100 million nodes (and their edges) into memory with this module would require ca. 110GB of memory, which I don’t have. An alternative would be to use a graph database which has a connected components function but I haven’t found any in Python. It would seem that Dex (API: Java, .NET, C++) has this functionality but I’m not 100% sure. Ideally I’m looking for a solution in Python. Many thanks.
Share
SciPy has a connected components algorithm. It expects as input the adjacency matrix of your graph in one of its sparse matrix formats and handles both the directed and undirected cases.
Building a sparse adjacency matrix from a sequence of
(i, j)pairsadj_listwhereiandjare (zero-based) indices of nodes can be done withYou’ll have to do some extra work for the undirected case.
This approach should be efficient if your graph is sparse enough.