I am playing around with Hadoop and have set up a two node cluster

Question

0

Asked: May 13, 20262026-05-13T19:37:32+00:00 2026-05-13T19:37:32+00:00

I am playing around with Hadoop and have set up a two node cluster

0

I am playing around with Hadoop and have set up a two node cluster on Ubuntu. The WordCount example runs just fine.

Now I’d like to write my own MapReduce program to analyze some log data (main reason: it looks simple and I have plenty of data)

Each line in the log hast this format

<UUID> <Event> <Timestamp>

where event can be INIT, START, STOP, ERROR and some other. What I am interested in most is the elapsed time between START and STOP events for the same UUID.

For Example, my log contains entries like these

35FAA840-1299-11DF-8A39-0800200C9A66 START 1265403584
[...many other lines...]
35FAA840-1299-11DF-8A39-0800200C9A66 STOP 1265403777

My current, linear program reads through the files, remembers the start events in-memory, and writes the elapsed time to a file once it found the corresponding end event (lines with other events are currently ignored, ERROR events invalidate a UUID and it will be ignored, too)¹

I would like to port this to an Hadoop/MapReduce program. But I am not sure how to do the matching of entries. Splitting/Tokenizing the file is easy, and I guess that finding the matches will be a Reduce-Class. But how would that look like? How do I find mathing entries in a MapReduce Job?

Please keep in mind that my main focus is to understand Hadopo/MapReduce; links to Pig and other Apache Programs are welcome, but I’d like to solve this one with pure Hadoop/MapReduce. Thank you.

¹⁾ Since the log is taken from a running application, some start events might not yet have corresponding end events and there will be end-events without startevents, due to logfile splitting

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-13T19:37:33+00:00

If you emit the UUID in map as key: emit(<uuid>, <event, timestamp>) you’ll receive in your reduce all events of this UUID:
key = UUID, values = {<event1, timestamp1>, <event2, timestamp2>}

Then you can sort the events on timestamp and decide whether to emit them into a resulting file or not.

Bonus: you can use job.setSortComparatorClass(); for setting your own sorting class, so you’ll get your entries already sorted on their timestamps in reduce:

public static class BNLSortComparator extends Text.Comparator {
  public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
    String sb1, sb2;
    try {
      sb1 = Text.decode(b1, s1, l1);
      ...

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am playing around with Hadoop and have set up a two node cluster

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply