Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6903303
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 27, 20262026-05-27T07:54:32+00:00 2026-05-27T07:54:32+00:00

I’m running a large Hadoop streaming job where I process a large list of

  • 0

I’m running a large Hadoop streaming job where I process a large list of files with each file being processed as a single unit. To do this, my input to my streaming job is a single file with a list of all the file names on separate lines.

In general, this works well. However, I ran into an issue where I was partially through a large job (~36%) when Hadoop ran into some files with issues and for some reason it seemed to crash the entire job. If the job had completed successfully, what would have been printed to standard out would be a line for each file as it was completed along with some stats from my program that’s processing each file. However, with this failed job, when I try to look at the output that would have been sent to standard out, it is empty. I know that roughly 36% of the files were processed (because I’m saving the data to a database), but it’s not easy for me to generate a list of which files were successfully processed and which ones remain. Is there anyway to recover this logging to standard out?

One thing I can do is look at all of the log files for the completed/failed tasks, but this seems more difficult to me and I’m not sure how to go about retrieving the good/bad list of files this way.

Thanks for any suggestions.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-27T07:54:33+00:00Added an answer on May 27, 2026 at 7:54 am

    Hadoop captures system.out data here :

    /mnt/hadoop/logs/userlogs/task_id

    However, I’ve found this unreliable, and Hadoop jobs dont usually use standard out for debugging, rather – the convetion is to use counters.

    For each of your documents, you can summarize document characteristics : like length, number of normal ascii chars, number of new lines.

    Then, you can have 2 counters: a counter for “good” files, and a counter for “bad” files.

    It probably be pretty easy to note that the bad files have something in common [no data, too much data, or maybe some non printable chars].

    Finally, you obviously will have to look at the results after the job is done running.

    The problem, of course, with system.out statements is that the jobs running on various machines can’t integrate their data. Counters get around this problem – they are easily integrated into a clear and accurate picture of the overall job.

    Of course, the problem with counters is the information content is entirely numeric, but, with a little creativity, you can easily find ways to quantitatively describe the data in a meaningfull way.

    WORST CASE SCENARIO : YOU REALLY NEED TEXT DEBUGGING, and you dont want it in a temp file

    In this case, you can use MultipleOutputs to write out ancillary files with other data in them. You can emit records to these files in the same way as you would for the part-r-0000* data.

    In the end, I think you will find that, ironically, the restriction of having to use counters will increase the readability of your jobs : it is pretty intuitive, once you think about it, to debug using numerical counts rather than raw text — i find, quite often that much of my debugging print statements are, when cut down to their raw information content, are basically just counters…

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have thousands of HTML files to process using Groovy/Java and I need to
link Im having trouble converting the html entites into html characters, (&# 8217;) i
I have just tried to save a simple *.rtf file with some websites and
Basically, what I'm trying to create is a page of div tags, each has
I have a string like this: La Torre Eiffel paragonata all’Everest What PHP function
I have a French site that I want to parse, but am running into
I want use html5's new tag to play a wav file (currently only supported
In my XML file chapters tag has more chapter tag.i need to display chapters
I am trying to render a haml file in a javascript response like so:
I'm parsing an RSS feed that has an ’ in it. SimpleXML turns this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.