I currently have streaming jobs run with mapper and reducer code written in ruby. I want to convert these to java. I do not know how to run a streaming job with EMR hadoop using java. The sample given in amazon’s EMR website of cloudburst is too complex. Following are the details of how I run the jobs currently.
Code to start a job:
elastic-mapreduce --create --alive --plain-output --master-instance-type m1.small
--slave-instance-type m1.xlarge --num-instances 2 --name "Job Name" --bootstrap-action
s3://bucket-path/bootstrap.sh
Code to add a step:
elastic-mapreduce -j <job_id> --stream --step-name "my_step_name"
--jobconf mapred.task.timeout=0 --mapper s3://bucket-path/mapper.rb
--reducer s3://bucket-path/reducerRules.rb --cache s3://bucket-path/cache/cache.txt
--input s3://bucket-path/input --output s3://bucket-path/output
Mapper code reads from a csv file which is mentioned above as EMR’s cache argument as well as it reads from the input s3 bucket which also has some csv files, does some calculations and prints a csv output lines to standard output.
//mapper.rb
CSV_OPTIONS = {
// some CSV options
}
begin
file = File.open("cache.txt")
while (line = file.gets)
// do something
end
file.close
end
input = FasterCSV.new(STDIN, CSV_OPTIONS)
input.each{
// do calculations and get result
puts (result)
}
//reducer.rb
$stdin.each_line do |line|
// do some aggregations and get aggregation_result
if(some_condition) puts(aggregation_result)
end
Since now I have a better stronghold on Hadoop and Mapreduce, here is what I had expected:
To start a cluster, the code will remain more or less same as in the question but we can add config parameters:
To add Job Steps:
Step 1:
ruby elastic-mapreduce --jobflow <jobflo_id> --jar s3://somepath/job-one.jar --arg s3://somepath/input-one --arg s3://somepath/output-one --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0Step2:
ruby elastic-mapreduce --jobflow <jobflo_id> --jar s3://somepath/job-two.jar --arg s3://somepath/output-one --arg s3://somepath/output-two --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0Now as for the Java code, There will be one Main class which would contain one implementation each of the following classes:
Each of these have to override methods map() and reduce() to do the desired job.
The Java class for problem in question would look like following: