I have the data set below. I want to get a unique list of the first column as the output. {9719,382 ..} there are integers in the end of the each line so checking if it starts and ends with a number is not a way and i couldn’t think of a solution. Can you show me how to do it? I’d really
appreciate it if you show it in detail.(with what to do in map and what to do in reduce step)
id - - [date] "URL"
In your mapper you should parse each line and write out the token that you are interested in from the beginning of the line (e.g. 9719) as the Key in a Key-Value pair (the Value is irrelevant in this case). Since the keys will be sorted before sending to the reducer, all you need to do in the reducer is iterate thru the values and each time a value changes, output it.
The WordCount example app that is packaged with Hadoop is very close to what you need.