According to Riak’s docs (using Python bindings), get_keys() is extremely expensive and not suitable for production. My question is whether a very simple map query is suitable. For instance, using a map stage only with the function:
function(v) { return [v.key]; }
is this going to perform better than get_keys()? why wouldn’t Riak ship with this implementation instead of the current version of get_keys()? Is there a better way I should be listing keys for a bucket?
The
get_keys()function callslist_keysin the back end and is considered to be an expensive operation because it performs a full scan of the key space. Depending on your Riak back end, this could also involve a full scan of the data as stored on disk (InnoStore springs to mind). The default storage back end (Bitcask) stores all of your keys in memory, so performance shouldn’t be as much of a problem.The other reason
list_keyswas considered expensive is because it was formerly a blocking operation as it involved what the Basho developers refer to as a ‘fold’ over all of the keys.list_keysnow uses a snapshot of the bucket (instead of reading the live key space) and this makes it a lighter weight operation as well.This is made easier with an upgrade to Riak 1.0. If you’re using the LevelDB back end, you can enable secondary indexes on a bucket and use the
$keyindex (automatically provided by Riak) to get you a list of all keys in a bucket.As for why Riak doesn’t ship with a better implementation of something like this: ask what the functionality is for. In an RDBMS, getting all primary keys of a table involves a full table scan. In Riak, getting all keys from a bucket requires scanning all data in every node and then shipping the key names back to the originating node, combining that data, and then sending it to the calling client. Because of Riak’s distributed, unordered, state this operation is expensive no matter how you slice it. There are, as I outlined above, ways to make it better.