I have some data like this: 1 2 3 4 5 9 2 6

Question

0

Asked: May 17, 20262026-05-17T15:06:43+00:00 2026-05-17T15:06:43+00:00

I have some data like this: 1 2 3 4 5 9 2 6

0

I have some data like this:

and am looking for an output like this (group-id and the members of that group):

1: 1 2 6
2: 3 4 7
3: 5 9

First row because 1 is “connected” to 2 and 2 is connected to 6.
Second row because 3 is connected to 4 and 3 is connected to 7

This looked to me like a graph traversal but the final order does not matter so I was wondering if someone can suggest a simpler solution that I can use on a large dataset (billions of entries).

From the comments:

The problem is to find the set of disjoint sub-graphs given a set of edges.
The edges are not directed; the line ‘1 2’ means that 1 is connected to 2 and 2 is connected to 1.
The ‘1:’ in the sample output could be ‘A:’ without changing the meaning of the answer.

EDIT 1:

Problem looks solved now. Thanks to everyone for their help. I need some more help picking the best solution that can be used on billions of such entries.

EDIT 2:

Test Input file:

Benchmarks:

I tried out everything and the version posted by TokenMacGuy is the fastest on the sample dataset that I tried. The dataset has about 1 million entries for which it took me about 6 seconds on a Dual Quad-Core 2.4GHz machine. I haven’t gotten a chance to run it on the entire dataset yet but I will post the benchmark as soon as it is available.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T15:06:43+00:00

I’ve managed O(n log n).

Here is a (somewhat intense) C++ implementation:

#include <boost/pending/disjoint_sets.hpp>
#include <boost/property_map/property_map.hpp>

#include <map>
#include <set>
#include <iostream>


typedef std::map<int, int> rank_t;
typedef std::map<int, int> parent_t;

typedef boost::associative_property_map< rank_t > rank_pmap_t;
typedef boost::associative_property_map< parent_t > parent_pmap_t;

typedef boost::disjoint_sets< rank_pmap_t, parent_pmap_t > group_sets_t;

typedef std::set<int> key_set;
typedef std::map<int, std::set<int> > output;

With some typedefs out of the way, here’s the real meat. I’m using boost::disjoint_sets, which is just happens to be an exceptionally good representation for the problem. This first function checks to see if either of the keys given have been seen before, and adds them to the collections if needed. the important part is really the union_set(a, b) which links the two sets together. If one or the other of the sets are already in the groups collection, they get linked too.

void add_data(int a, int b, group_sets_t & groups, key_set & keys)
{
  if (keys.count(a) < 1) groups.make_set(a);
  if (keys.count(b) < 1) groups.make_set(b);
  groups.union_set(a, b);
  keys.insert(a);
  keys.insert(b);
}

This isn’t too exciting, it just iterates through all of the keys we’ve seen and gets the representative key for that key, then adds the pair (representative, key) to a map. Once that’s done, print out the map.

void build_output(group_sets_t & groups, key_set & keys)
{
  output out;
  for (key_set::iterator i(keys.begin()); i != keys.end(); i++)
    out[groups.find_set(*i)].insert(*i);

  for (output::iterator i(out.begin()); i != out.end(); i++)
  {
    std::cout << i->first << ": ";
    for (output::mapped_type::iterator j(i->second.begin()); j != i->second.end(); j++)
      std::cout << *j << " ";
    std::cout << std::endl;
  }
}

int main()
{

  rank_t rank;
  parent_t parent;
  rank_pmap_t rank_index(rank);
  parent_pmap_t parent_index(parent);


  group_sets_t groups( rank_index, parent_index );
  key_set keys;

  int a, b;
  while (std::cin >> a)
  {
    std::cin >> b;
    add_data(a, b, groups, keys);
  }  


  build_output(groups, keys);
  //std::cout << "number of sets: " << 
  //  groups.count_sets(keys.begin()), keys.end()) << std::endl;

}

I stayed up many hours learning how to use boost::disjoint_sets on this problem. There doesn’t seem to be much of any documentation on it.

About the performance. The disjoint_sets structure is O(α(n) ) for its key operations (make_set, find_set and union_set) which is pretty close to constant, and so if it were just a matter of building the structure, the whole algorithm would be O(n α(n) ) (which is effectively O(n) ) but we have to print it out. That means we have to build up some associative containers, which cannot perform better than O(n log n). It might be possible to get a constant factor speedup by choosing a different associative containers (say, hash_set etc), since once you populate the initial list, you can reserve an optimal amount of space.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have some data like this: 1 2 3 4 5 9 2 6

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply