Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 628335
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 13, 20262026-05-13T19:37:32+00:00 2026-05-13T19:37:32+00:00

I am playing around with Hadoop and have set up a two node cluster

  • 0

I am playing around with Hadoop and have set up a two node cluster on Ubuntu. The WordCount example runs just fine.

Now I’d like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have plenty of data)

Each line in the log hast this format

<UUID> <Event> <Timestamp>

where event can be INIT, START, STOP, ERROR and some other. What I am interested in most is the elapsed time between START and STOP events for the same UUID.

For Example, my log contains entries like these

35FAA840-1299-11DF-8A39-0800200C9A66 START 1265403584
[...many other lines...]
35FAA840-1299-11DF-8A39-0800200C9A66 STOP 1265403777

My current, linear program reads through the files, remembers the start events in-memory, and writes the elapsed time to a file once it found the corresponding end event (lines with other events are currently ignored, ERROR events invalidate a UUID and it will be ignored, too)1

I would like to port this to an Hadoop/MapReduce program. But I am not sure how to do the matching of entries. Splitting/Tokenizing the file is easy, and I guess that finding the matches will be a Reduce-Class. But how would that look like? How do I find mathing entries in a MapReduce Job?

Please keep in mind that my main focus is to understand Hadopo/MapReduce; links to Pig and other Apache Programs are welcome, but I’d like to solve this one with pure Hadoop/MapReduce. Thank you.

1) Since the log is taken from a running application, some start events might not yet have corresponding end events and there will be end-events without startevents, due to logfile splitting

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-13T19:37:33+00:00Added an answer on May 13, 2026 at 7:37 pm

    If you emit the UUID in map as key: emit(<uuid>, <event, timestamp>) you’ll receive in your reduce all events of this UUID:
    key = UUID, values = {<event1, timestamp1>, <event2, timestamp2>}

    Then you can sort the events on timestamp and decide whether to emit them into a resulting file or not.

    Bonus: you can use job.setSortComparatorClass(); for setting your own sorting class, so you’ll get your entries already sorted on their timestamps in reduce:

    public static class BNLSortComparator extends Text.Comparator {
      public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
        String sb1, sb2;
        try {
          sb1 = Text.decode(b1, s1, l1);
          ...
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Just playing around with the now released Silverlight 2.0. I'm trying to put a
I am playing around with MVC and have started setting up an existing site
Playing around with Google Maps these days, with some directions. I have a map
Been playing around with XCode for about 2 weeks now, and reading about MVC
Started playing around with jQuery and the jsTree plugin yesterday, and have it successfully
Im playing around with NHibernate 3 alpha but struggling to set up my SessionFactory.
I'm just playing around with a grub-bootable C++ kernel in visual studio 2010. I've
I have been playing around with the dynamic abilities of powershell and I was
I have been playing around with the concept of 'module' that some mvc frameworks
I've been playing around with Mongo for about a week now and I still

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.