Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 9184261
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 17, 20262026-06-17T19:00:07+00:00 2026-06-17T19:00:07+00:00

There is this sample record, 100,1:2:3 Which I want to normalize as, 100,1 100,2

  • 0

There is this sample record,
100,1:2:3

Which I want to normalize as,
100,1
100,2
100,3

A colleague of mine wrote a pig script to achieve this and my MapReduce code took more time. I was using the default TextInputformat before. But to improve performance, I decided to write a custom Input format class, with a custom RecordReader. Taking the LineRecordReader class as reference, I tried to write the following code.

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.util.LineReader;

import com.normalize.util.Splitter;

public class NormalRecordReader extends RecordReader<Text, Text> {

    private long start;
    private long pos;
    private long end;
    private LineReader in;
    private int maxLineLength;
    private Text key = null;
    private Text value = null;
    private Text line = null;

    public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException {
        FileSplit split = (FileSplit) genericSplit;
        Configuration job = context.getConfiguration();
        this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE);

        start = split.getStart();
        end = start + split.getLength();

        final Path file = split.getPath();

        FileSystem fs = file.getFileSystem(job);
        FSDataInputStream fileIn = fs.open(split.getPath());

        in = new LineReader(fileIn, job);
        this.pos = start;
    }

    public boolean nextKeyValue() throws IOException {
        int newSize = 0;
        if (line == null) {
            line = new Text();
        }

        while (pos < end) {
            newSize = in.readLine(line);
            if (newSize == 0) {
                break;
            }
            pos += newSize;
            if (newSize < maxLineLength) {
                break;
            }

            // line too long. try again
            System.out.println("Skipped line of size " + newSize + " at pos " + (pos - newSize));
        }
        Splitter splitter = new Splitter(line.toString(), ",");
        List<String> split = splitter.split();

        if (key == null) {
            key = new Text();
        }
        key.set(split.get(0));

        if (value == null) {
            value = new Text();
        }
        value.set(split.get(1));

        if (newSize == 0) {
            key = null;
            value = null;
            return false;

        } else {
            return true;
        }
    }

    @Override
    public Text getCurrentKey() {
        return key;
    }

    @Override
    public Text getCurrentValue() {
        return value;
    }

    /**
     * Get the progress within the split
     */
    public float getProgress() {
        if (start == end) {
            return 0.0f;
        } else {
            return Math.min(1.0f, (pos - start) / (float)(end - start));
        }
    }

    public synchronized void close() throws IOException {
        if (in != null) {
            in.close(); 
        }
    }
}

Though this works, but I haven’t seen any performance improvement. Here I am breaking the record at “,” and setting the 100 as key and 1,2,3 as value. I only call the mapper which does the following:

public void map(Text key, Text value, Context context) 
        throws IOException, InterruptedException {

    try {
        Splitter splitter = new Splitter(value.toString(), ":");
        List<String> splits = splitter.split();

        for (String split : splits) {
            context.write(key, new Text(split));
        }

    } catch (IndexOutOfBoundsException ibe) {
        System.err.println(value + " is malformed.");
    }
}

The splitter class is used to split the data, as I found String’s splitter to be slower. The method is:

public List<String> split() {

    List<String> splitData = new ArrayList<String>();
    int beginIndex = 0, endIndex = 0;

    while(true) {

        endIndex = dataToSplit.indexOf(delim, beginIndex);
        if(endIndex == -1) {
            splitData.add(dataToSplit.substring(beginIndex));
            break;
        }

        splitData.add(dataToSplit.substring(beginIndex, endIndex));
        beginIndex = endIndex + delimLength;
    }

    return splitData;
}

Can the code be improved in any way?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-17T19:00:09+00:00Added an answer on June 17, 2026 at 7:00 pm

    Let me summarize here what I think you can improve instead of in the comments:

    • As explained, currently you are creating a Text object several times per record (number of times will be equal to your number of tokens). While it may not matter too much for small input, this can be a big deal for decently sized jobs. To fix that, do the following:

      private final Text text = new Text();
      
      public void map(Text key, Text value, Context context) {
          ....
          for (String split : splits) {
              text.set(split);
              context.write(key, text);
          }
      }
      
    • For your splitting, what you’re doing right now is for every record allocating a new array, populating this array, and then iterating over this array to write your output. Effectively you don’t really need an array in this case since you’re not maintaining any state. Using the implementation of the split method you provided, you only need to make one pass on the data:

      public void map(Text key, Text value, Context context) {
          String dataToSplit = value.toString();
          String delim = ":";
      
          int beginIndex = 0;
          int endIndex = 0;
      
          while(true) {
              endIndex = dataToSplit.indexOf(delim, beginIndex);
              if(endIndex == -1) {
                  text.set(dataToSplit.substring(beginIndex));
                  context.write(key, text);
                  break;
              }
      
              text.set(dataToSplit.substring(beginIndex, endIndex));
              context.write(key, text);
              beginIndex = endIndex + delim.length();
          }
      }
      
    • I don’t really see why you write your own InputFormat, it seems that KeyValueTextInputFormat is exactly what you need and has probably been already optimized. Here is how you use it:

      conf.set("key.value.separator.in.input.line", ",");
      job.setInputFormatClass(KeyValueTextInputFormat.class);
      
    • Based on your example, the key for each record seems to be an integer. If that’s always the case, then using a Text as your mapper input key is not optimal and it should be an IntWritable or maybe even a ByteWritable depending on what’s in your data.

    • Similarly, you want want to use an IntWritable or ByteWritable as your mapper output key and output value.

    Also, if you want some meaningful benchmark, you should test on a bigger dataset, like a few Gbs if possible. 1 minute tests are not really meaningful, especially in the context of distributed systems. 1 job may run quicker than another one on a small input, but the trend may be reverted for bigger inputs.

    That being said, you should also know that Pig does a lot of optimizations behind the hood when translating to Map/Reduce, so I’m not too surprised that it runs faster than your Java Map/Reduce code and I’ve seen that in the past. Try the optimizations I suggested, if it’s still not fast enough here is a link on profiling your Map/Reduce jobs with a few more useful tricks (especially tip 7 on profiling is something I’ve found useful).

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

This is my sample twitter program it works but there are several errors if
I just want to design this very simple website. Basically there are multiple pages
Im writing this complex query to return a large dataset, which is about 100,000
There's this very simple bit a code <asp:TextBox ID=txbVipNumber runat=server /> <asp:TextBoxWatermarkExtender ID=tbweVipNumber runat=server
Is there any way to keep this simple JQuery animation from flashing? http://jsfiddle.net/v3DVf/6/
Sorry for this simple question, but I can't solve it... There is an example:
Can someone explain this simple, yet deceiving, anomaly? There are two models, where B
Hai. I was making this simple string class and was wondering if there was
This is a simple question : Is there any automatic indentation alignment on Eclipse
I know this is a simple question for someone out there, but I have

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.