I have been trying to understand the MapReduce concept and apply it to my

Question

0

Asked: May 17, 20262026-05-17T22:12:02+00:00 2026-05-17T22:12:02+00:00

I have been trying to understand the MapReduce concept and apply it to my

0

I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.

So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product’s transformation functionality to different machines, there by, leveraging CPU power from all these machines.

I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product’s transformation functionality into MAP and REDUCE functions.

Here’s when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:

Unstructed data and one would like to perform aggregation/ sort/ or something of that kind.
Unstrucuted text and there is a need to perform mining
etc!

Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:

Have you used Hadoop in ETL
scenarios? If yes, could be specific
about how you handled MapReducing of
your transformation? Have you used
Hadoop purely for leveraging extra
CPU power?
Is MapReduce concept the
universal answer to distributed
computing? Are there other equally
good options?
My understanding is
that MapReduce applies to large
dataset for
sorting/analytics/grouping/counting/aggregation/etc,
is my understading correct?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-17T22:12:02+00:00

If you want to scale-out a processing problem over a lot of systems you must do two things:

Make sure you can process the information in independent parts.
There should be NO shared resource that is needed among these parts.

If there are dependencies then these will be the limit in your horizontal scalability.

So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the … when trying to scale-out.

The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.

If you cannot do such a jump you’re in trouble.

So coming back to your questions:

# Have you used Hadoop in ETL scenarios?

Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan’t put in a normal RDBMS!

# Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?

MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.

HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want “realtime” processing you are currently out of luck.

I don’t know of a better model at this moment that an actual implementation exists for.

# My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?

Yep, that is the most common application.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have been trying to understand the MapReduce concept and apply it to my

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply