I have a bipartite graph and I’m looking for the most efficient iterative way to divide it into connected components. My recursive version has started overflowing the stack on large data sets. I’m willing to port from any language/pseudocode but for completeness I’ll be coding in C#.
My existing code is specialized for my data types. One partition is proteins, the other is spectra. Map and Set are C++ stdlib workalikes.
void recursivelyAssignProteinToCluster (long proteinId,
long clusterId,
Set<long> spectrumSet,
Map<long, Set<long>> spectrumSetByProteinId,
Map<long, Set<long>> proteinSetBySpectrumId,
Map<long, long> clusterByProteinId)
{
// try to assign the protein to the current cluster
var insertResult = clusterByProteinId.Insert(proteinId, clusterId);
if (!insertResult.WasInserted)
return;
// recursively add all "cousin" proteins to the current cluster
foreach (long spectrumId in spectrumSet)
foreach (var cousinProteinId in proteinSetBySpectrumId[spectrumId])
{
if (proteinId != cousinProteinId)
{
Set<long> cousinSpectrumSet = spectrumSetByProteinId[cousinProteinId];
recursivelyAssignProteinToCluster(cousinProteinId,
clusterId,
cousinSpectrumSet,
spectrumSetByProteinId,
proteinSetBySpectrumId,
clusterByProteinId);
}
}
}
Map<long, long> calculateProteinClusters (NHibernate.ISession session)
{
var spectrumSetByProteinId = new Map<long, Set<long>>();
var proteinSetBySpectrumId = new Map<long, Set<long>>();
var query = session.CreateQuery("SELECT pi.Protein.id, psm.Spectrum.id " + GetFilteredQueryString(FromProtein, ProteinToPeptideSpectrumMatch));
foreach (var queryRow in query.List<object[]>())
{
long proteinId = (long) queryRow[0];
long spectrumId = (long) queryRow[1];
spectrumSetByProteinId[proteinId].Add(spectrumId);
proteinSetBySpectrumId[spectrumId].Add(proteinId);
}
var clusterByProteinId = new Map<long, long>();
int clusterId = 0;
foreach (var pair in spectrumSetByProteinId)
{
long proteinId = pair.Key;
// for each protein without a cluster assignment, make a new cluster
if (!clusterByProteinId.Contains(proteinId))
{
++clusterId;
recursivelyAssignProteinToCluster(proteinId,
clusterId,
pair.Value,
spectrumSetByProteinId,
proteinSetBySpectrumId,
clusterByProteinId);
}
}
return clusterByProteinId;
}
As ShinTakezou suggested I refactored to put the stack on the heap and it works great. I used the DepthFirstSearch approach from digEmAll’s example.
var clusterByProteinId = new Map<long, long>();
int clusterId = 0;
var clusterStack = new Stack<KeyValuePair<long, Set<long>>>();
foreach (var pair in spectrumSetByProteinId)
{
long proteinId = pair.Key;
if (clusterByProteinId.Contains(proteinId))
continue;
// for each protein without a cluster assignment, make a new cluster
++clusterId;
clusterStack.Push(new KeyValuePair<long, Set<long>>(proteinId, spectrumSetByProteinId[proteinId]));
while (clusterStack.Count > 0)
{
var kvp = clusterStack.Pop();
// try to assign the protein to the current cluster
var insertResult = clusterByProteinId.Insert(kvp.Key, clusterId);
if (!insertResult.WasInserted)
continue;
// add all "cousin" proteins to the current cluster
foreach (long spectrumId in kvp.Value)
foreach (var cousinProteinId in proteinSetBySpectrumId[spectrumId])
if (!clusterByProteinId.Contains(cousinProteinId))
clusterStack.Push(new KeyValuePair<long, Set<long>>(cousinProteinId, spectrumSetByProteinId[cousinProteinId]));
}
}
Here’s an example of an helper class that holds an undirected Graph and allows to get the connected components of it (iteratively):
Usage:
You could use the
Graph<>class instead of your maps, or if you want to stick with your maps have a look at the code that is quite easy to understand (inside the class it is used aDictionary<T,HashSet<T>>to hold nodes and arcs, so is very similar to your approach)