The data looks like this, first field is a number,
3 ...
1 ...
2 ...
11 ...
And I want to sort these lines according to the first field numerically instead of alphabetically, which means after sorting it should look like this,
1 ...
2 ...
3 ...
11 ...
But hadoop keeps giving me this,
1 ...
11 ...
2 ...
3 ...
How do correct it?
Assuming you are using Hadoop Streaming, you need to use the KeyFieldBasedComparator class.
-D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator should be added to streaming command
You need to provide type of sorting required using mapred.text.key.comparator.options. Some useful ones are -n : numeric sort, -r : reverse sort
EXAMPLE :
Create an identity mapper and reducer with the following code
This is the mapper.py & reducer.py
This is the input.txt
This is the Streaming command
And you will get the required output
NOTE :
I have used a simple one key input. If however you have multiple keys and/or partitions, you will have to edit mapred.text.key.comparator.options as needed. Since I do not know your use case , my example is limited to this
Identity mapper is needed since you will need atleast one mapper for a MR job to run.
Identity reducer is needed since shuffle/sort phase will not work if it is a pure map only job.