I have a large(ish – >100K) collection mapping a user identifier (an int) to the count of different products that they’ve bought (also an int.) I need to re-organise the data as efficiently as possible to find how many users have different numbers of products. So for example, how many users have 1 product, how many users have two products etc.
I have acheived this by reversing the original data from a std::map into a std::multimap (where the key and value are simply reversed.) I can then pick out the number of users having N products using count(N) (although I also uniquely stored the values in a set so I could be sure of the exact number of values I was iterating over and their order)
Code looks like this:
// uc is a std::map<int, int> containing the original
// mapping of user identifier to the count of different
// products that they've bought.
std::set<int> uniqueCounts;
std::multimap<int, int> cu; // This maps count to user.
for ( map<int, int>::const_iterator it = uc.begin();
it != uc.end(); ++it )
{
cu.insert( std::pair<int, int>( it->second, it->first ) );
uniqueCounts.insert( it->second );
}
// Now write this out
for ( std::set<int>::const_iterator it = uniqueCounts.begin();
it != uniqueCounts.end(); ++it )
{
std::cout << "==> There are "
<< cu.count( *it ) << " users that have bought "
<< *it << " products(s)" << std::endl;
}
I just can’t help feeling that this is not the most efficient way of doing this. Anyone know of a clever method of doing this?
I’m limited in that I can’t use Boost or C++11 to do this.
Oh, also, in case anyone is wondering, this is neither homework, nor an interview question.
Assuming you know the maximum number of products that a single user could have bought, you might see better performance just using a vector to store the results of the operation. As it is you’re going to need an allocation for pretty much every entry in the original map, which likely isn’t the fastest option.
It would also cut down on the lookup overhead on a map, gain the benefits of memory locality, and replace the call to count on the multimap (which is not a constant time operation) with a constant time lookup of the vector.
So you could do something like this:
Even if you don’t know the maximum number of products, it seems like you could just guess a maximum and adapt this code to increase the size of the vector if required. It’s sure to result in less allocations than your original example anyway.
All this is assuming that you don’t actually require the user ids after you’ve processed this data of course (and as pointed out in the comments below, that the number of products bought for each user is a relatively small & contiguous set. Otherwise you might be better off using a map in place of a vector – you’ll still avoid calling the multimap::count function, but potentially lose some of the other benefits)