Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8794439
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 13, 20262026-06-13T23:15:04+00:00 2026-06-13T23:15:04+00:00

I currently have streaming jobs run with mapper and reducer code written in ruby.

  • 0

I currently have streaming jobs run with mapper and reducer code written in ruby. I want to convert these to java. I do not know how to run a streaming job with EMR hadoop using java. The sample given in amazon’s EMR website of cloudburst is too complex. Following are the details of how I run the jobs currently.

Code to start a job:

        elastic-mapreduce --create --alive --plain-output --master-instance-type m1.small 
--slave-instance-type m1.xlarge --num-instances 2  --name "Job Name" --bootstrap-action 
    s3://bucket-path/bootstrap.sh

Code to add a step:

    elastic-mapreduce -j <job_id> --stream --step-name "my_step_name" 
--jobconf mapred.task.timeout=0 --mapper s3://bucket-path/mapper.rb 
--reducer s3://bucket-path/reducerRules.rb --cache s3://bucket-path/cache/cache.txt 
--input s3://bucket-path/input --output s3://bucket-path/output

Mapper code reads from a csv file which is mentioned above as EMR’s cache argument as well as it reads from the input s3 bucket which also has some csv files, does some calculations and prints a csv output lines to standard output.

//mapper.rb 
CSV_OPTIONS  = {
  // some CSV options
}

begin
    file = File.open("cache.txt")
    while (line = file.gets)
        // do something
    end
    file.close
end

input  = FasterCSV.new(STDIN, CSV_OPTIONS)
input.each{ 
// do calculations and get result
puts (result)
}

//reducer.rb

$stdin.each_line do |line|
// do some aggregations and get aggregation_result
if(some_condition) puts(aggregation_result)
end
  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-13T23:15:05+00:00Added an answer on June 13, 2026 at 11:15 pm

    Since now I have a better stronghold on Hadoop and Mapreduce, here is what I had expected:

    To start a cluster, the code will remain more or less same as in the question but we can add config parameters:

    ruby elastic-mapreduce --create --alive --plain-output --master-instance-type m1.xlarge --slave-instance-type m1.xlarge --num-instances 11  --name "Java Pipeline" --bootstrap-action s3://elasticmapreduce/bootstrap-actions/install-ganglia --bootstrap-action s3://elasticmapreduce/bootstrap-actions/configure-hadoop --args "--mapred-config-file, s3://com.versata.emr/conf/mapred-site-tuned.xml"
    

    To add Job Steps:

    Step 1:

    ruby elastic-mapreduce --jobflow <jobflo_id> --jar s3://somepath/job-one.jar --arg s3://somepath/input-one --arg s3://somepath/output-one --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0

    Step2:

    ruby elastic-mapreduce --jobflow <jobflo_id> --jar s3://somepath/job-two.jar --arg s3://somepath/output-one --arg s3://somepath/output-two --args -m,mapred.min.split.size=52880 -m,mapred.task.timeout=0

    Now as for the Java code, There will be one Main class which would contain one implementation each of the following classes:

    • org.apache.hadoop.mapreduce.Mapper;
    • org.apache.hadoop.mapreduce.Reducer;

    Each of these have to override methods map() and reduce() to do the desired job.

    The Java class for problem in question would look like following:

    public class SomeJob extends Configured implements Tool {
    
        private static final String JOB_NAME = "My Job";
    
        /**
         * This is Mapper.
         */
        public static class MapJob extends Mapper<LongWritable, Text, Text, Text> {
    
            private Text outputKey = new Text();
            private Text outputValue = new Text();
    
            @Override
            protected void setup(Context context) throws IOException, InterruptedException {
    
                // Get the cached file
                Path file = DistributedCache.getLocalCacheFiles(context.getConfiguration())[0];
    
                File fileObject = new File (file.toString());
                // Do whatever required with file data
            }
    
            @Override
            public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
                outputKey.set("Some key calculated or derived");
                outputVey.set("Some Value calculated or derived");
                context.write(outputKey, outputValue);
            }
            }
    
        /**
         * This is Reducer.
         */
        public static class ReduceJob extends Reducer<Text, Text, Text, Text> {
    
        private Text outputKey = new Text();
        private Text outputValue = new Text();
    
            @Override
            protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException,
                    InterruptedException {
                outputKey.set("Some key calculated or derived");
                outputVey.set("Some Value calculated or derived");
                context.write(outputKey, outputValue);
            }
        }
    
        @Override
        public int run(String[] args) throws Exception {
    
            try {
                Configuration conf = getConf();
                DistributedCache.addCacheFile(new URI(args[2]), conf);
                Job job = new Job(conf);
    
                job.setJarByClass(TaxonomyOverviewReportingStepOne.class);
                job.setJobName(JOB_NAME);
    
                job.setMapperClass(MapJob.class);
                job.setReducerClass(ReduceJob.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(Text.class);
    
                job.setInputFormatClass(TextInputFormat.class);
                job.setOutputFormatClass(TextOutputFormat.class);
                job.setMapOutputKeyClass(Text.class);
                job.setMapOutputValueClass(Text.class);
                FileInputFormat.setInputPaths(job, args[0]);
                FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
                boolean success = job.waitForCompletion(true);
                return success ? 0 : 1;
            } catch (Exception e) {
                e.printStackTrace();
                return 1;
            }
    
        }
    
        public static void main(String[] args) throws Exception {
    
            if (args.length < 3) {
                System.out
                        .println("Usage: SomeJob <comma sparated list of input directories> <output dir> <cache file>");
                System.exit(-1);
            }
    
            int result = ToolRunner.run(new TaxonomyOverviewReportingStepOne(), args);
            System.exit(result);
        }
    
    }
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Currently I am working on audio streaming on android. All method I have written
I own a video streaming website and currently I just have a simple hit
We have our own data streaming algorithm that include some metadata+records+fields values. Currently we
So i have a lack of knowledge issue with this. I'm currently streaming my
I currently have one project that currently contains multiple packages. These packages make up
I currently have code like this in a web based file called 'view_file.php' to
I have a bunch of large HTML files and I want to run a
I'm currently have a media player that is streaming an mp3 file. When that
I have created an app capable of live streaming, but I currently have no
I have this problem when I want to skip to a position while streaming

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.