I need the SQL equivalent of an AUTO_INCREMENT id in hadoop.
When my reduce task identifies a new item, those items needs a unique ID assigned.
-
How can I share an atomic counter across the cluster? The reporter
counters seem to be just increment counters, there’s no
getAndIncrement feature that I see. -
How can I set that counter before the map/reduce phase of the job
starts?
To perform distributed id generation you can either just generate uuids or use functionality found in Apache Zookeeper, which can do distributed coordination on Hadoop clusters. Disclaimer: I have never used Zookeeper, so I don’t know if you can really (even theoretically) get a global contiguous set of ids, which is what the question seems to be asking.
Generating UUIDs does have a cost, though; they take some time to generate.
For good general information on distributed ID generation, see this Stack Overflow question.