I have a list of about a hundreds unique strings in C++, I need to check if a value exists in this list, but preferrably lightning fast.
I am currenly using a hash_set with std::strings (since I could not get it to work with const char*) like so:
stdext::hash_set<const std::string> _items; _items.insert('LONG_NAME_A_WITH_SOMETHING'); _items.insert('LONG_NAME_A_WITH_SOMETHING_ELSE'); _items.insert('SHORTER_NAME'); _items.insert('SHORTER_NAME_SPECIAL'); stdext::hash_set<const std::string>::const_iterator it = _items.find( 'SHORTER_NAME' ) ); if( it != _items.end() ) { std::cout << 'item exists' << std::endl; }
Does anybody else have a good idea for a faster search method without building a complete hashtable myself?
The list is a fixed list of strings which will not change. It contains a list of names of elements which are affected by a certain bug and should be repaired on-the-fly when opened with a newer version.
I’ve built hashtables before using Aho-Corasick but I’m not really willing to add too much complexity.
I was amazed by the number of answers. I ended up testing a few methods for their performance and ended up using a combination of kirkus and Rob K.’s answers. I had tried a binary search before but I guess I had a small bug implementing it (how hard can it be…).
The results where shocking… I thought I had a fast implementation using a hash_set… well, ends out I did not. Here’s some statistics (and the eventual code):
Random lookup of 5 existing keys and 1 non-existant key, 50.000 times
My original algorithm took on average 18,62 seconds
A lineair search took on average 2,49 seconds
A binary search took on average 0,92 seconds.
A search using a perfect hashtable generated by gperf took on average 0,51 seconds.
Here’s the code I use now:
bool searchWithBinaryLookup(const std::string& strKey) { static const char arrItems[][NUM_ITEMS] = { /* list of items */ }; /* Binary lookup */ int low, mid, high; low = 0; high = NUM_ITEMS; while( low < high ) { mid = (low + high) / 2; if(arrAffectedSymbols[mid] > strKey) { high = mid; } else if(arrAffectedSymbols[mid] < strKey) { low = mid + 1; } else { return true; } } return false; }
NOTE: This is Microsoft VC++ so I’m not using the std::hash_set from SGI.
I did some tests this morning using gperf as VardhanDotNet suggested and this is quite a bit faster indeed.
If your list of strings are fixed at compile time, use gperf http://www.gnu.org/software/gperf/ QUOTE: gperf is a perfect hash function generator. For a given list of strings, it produces a hash function and hash table, in form of C or C++ code, for looking up a value depending on the input string. The hash function is perfect, which means that the hash table has no collisions, and the hash table lookup needs a single string comparison only.
The output of gperf is not governed by gpl or lgpl, afaik.