Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6916831
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T09:39:14+00:00 2026-05-27T09:39:14+00:00

I wrote a MapReduce job that took ngram counts on a dataset. The results

  • 0

I wrote a MapReduce job that took ngram counts on a dataset. The results are in one hundred 300MB files in format, <ngram>\t<count>. I want to combine these into one result, but my few attempts at combining have crashed (“task tracker has gone away”). I had my timeout at 8 hours, and this crash occurred around 8.5 hours, so might be related. I had # reducers=5 (same as # of nodes). Maybe I just need to leave more time, although the error doesn’t seem to indicate that. I suspect my nodes are getting overloaded, and becoming unresponsive. My theory is that my reducer could use some optimization.

I’m using cat for my mapper, and the following python script for my reducer:

#!/usr/bin/env python
import sys

counts = {}
for line in sys.stdin:
    line = line.strip()
    key, count = line.split('\t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if key not in counts:
        counts[key] = 0
    counts[key] += count

for key in sorted(counts.keys()):
    print '%s\t%s'% (key, counts[key])

Update:
As I hinted at in one of my comments, I am confused on what sorting takes place by Hadoop automatically. In the web UI, the reducer status shows a few different phases which includes “sort” and “reduce”. From this I assume that Hadoop sorts the mapper output before sending it to reduce, but what isn’t clear is if the sorting is on all data sent to the reducer, or on each file before it is reduced. In other words, my mapper takes the 100 fiels, splits this into 400 outputs, each simply cat-ing them to the reducer, then the reducers (5 total) each receive these 80 streams. Does sort combine all 80, or does it sort 1, reduce it; etc? Based on the graphs, which could clearly be non-indicative of the actual behavior, the sort process takes place before any reducing. If the sorting does sort all the input files, then I can simplify my reducer to not store a dictionary of all counts, and just print out the key-totalCount pair once the key changes.

Regarding the use of a combiner, I don’t think this would be beneficial in my case since the data I’m reducing has already been reduced in the 100 files i’m trying to combine. Since my # nodes = # reducers (5 & 5), there is nothing to combine that the reducer isn’t already doing.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T09:39:15+00:00Added an answer on May 27, 2026 at 9:39 am

    The problem was my misunderstanding of how MapReduce works. All data going into Reducer is sorted. My code above was completely unoptimized. Instead, I simply keep track of current key, then print out the previous current when a new key shows up.

    #!/usr/bin/env python
    import sys
    
    cur_key = None
    cur_key_count = 0
    for line in sys.stdin:
        line = line.strip()
        key, count = line.split('\t', 1)
    
        try:
            count = int(count)
        except ValueError:
            continue
    
        # if new key, reset count, note current key, and output lastk key's result
        if key != cur_key:
            if cur_key is not None:
                print '%s\t%s'% (cur_key, cur_key_count)
            cur_key = key
            cur_key_count = 0
        cur_key_count += count
    # printing out final key if set
    if cur_key:
        print '%s\t%s'% (cur_key, cur_key_count)
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm building a Hadoop (0.20.1) mapreduce job that uses HBase (0.20.1) as both the
I wrote a simple map reduce job that would read in data from the
I created map/reduce functions to group tasks results in one result object. I wrote
I'm trying to write a shell script that will execute a Hadoop MapReduce job
I wrote some mapreduce jobs that reference a few external jars. so I added
I am writing a mapreduce program that uses multiple I/O pipes (one pipe per
I wrote a PHP script that retrieves values from a MySQL Query. I used
In HDFS processing after each job empty files are created with names like part-m-0000*.
I am currently working on a MapReduce Job which I am only using the
In all of the MongoDB MapReduce examples that I can find, output is formatted

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.