Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 3314172
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 17, 20262026-05-17T22:12:02+00:00 2026-05-17T22:12:02+00:00

I have been trying to understand the MapReduce concept and apply it to my

  • 0

I have been trying to understand the MapReduce concept and apply it to my current situation. What is my situation? Well, I have an ETL tool here, in which data transformation happens outside of source and destination data sources (databases). Hence,the source data source is purely used for extract and destination for load.

So, this act of transformation today, say takes about X hours for a million records. I would like to address a scenario where I would have a billion records, but I would want the work done in the same X hours. So, here is the need, for my product to scale out (adding more commodity machines) based on the scale of data. As you can see, I am only worried about the ability of distributing my product’s transformation functionality to different machines, there by, leveraging CPU power from all these machines.

I started looking for options and I came across Apache Hadoop and then eventually the concept of MapReduce. I was pretty successful in settin up Hadoop quickly without running into issues in cluster mode and was happy to run a wordcount demo too. Soon, I realized that for implementing my own MapReduce model, I would have to redefine my product’s transformation functionality into MAP and REDUCE functions.

Here’s when trouble began. I read a copy of Hadoop: Definitive Guide, and I understood that many of the common use cases of Hadoop are in scenarios where one is faced with:

  • Unstructed data and one would like to perform aggregation/ sort/ or something of that kind.
  • Unstrucuted text and there is a need to perform mining
  • etc!

Here is my scenario where I extract from a database and load to a database (which has structured data), and my sole purpose is about bringing in more CPUs into play, in a reliable manner, and there by distribute my transformation. And redefining my transformation to fit a Map and Reduce model makes it a huge challenge in itself. So here are my questions:

  1. Have you used Hadoop in ETL
    scenarios? If yes, could be specific
    about how you handled MapReducing of
    your transformation? Have you used
    Hadoop purely for leveraging extra
    CPU power?

  2. Is MapReduce concept the
    universal answer to distributed
    computing? Are there other equally
    good options?

  3. My understanding is
    that MapReduce applies to large
    dataset for
    sorting/analytics/grouping/counting/aggregation/etc,
    is my understading correct?
  • 1 1 Answer
  • 1 View
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-17T22:12:02+00:00Added an answer on May 17, 2026 at 10:12 pm

    If you want to scale-out a processing problem over a lot of systems you must do two things:

    1. Make sure you can process the information in independent parts.
    2. There should be NO shared resource that is needed among these parts.

    If there are dependencies then these will be the limit in your horizontal scalability.

    So if you are starting from a relational model then the main obstruction is the fact that you have relationships. Having these relationships is a great asset in relational databases but is a pain in the … when trying to scale-out.

    The simplest way to go from relational to independent parts is to make a jump and de-normalize your data into records that have everything in them and are focussed around the part you want to do the processing around. Then you can disribute them over a huge cluster and after the processing has been completed you use the results.

    If you cannot do such a jump you’re in trouble.

    So coming back to your questions:

    # Have you used Hadoop in ETL scenarios?

    Yes, the input being Apache logfiles and the loading and transformation consisted of parsing, normalizing and filtering these loglines. The result wan’t put in a normal RDBMS!

    # Is MapReduce concept the universal answer to distributed computing? Are there other equally good options?

    MapReduce is a very simple processing model that will work great for any processing problem you are able to split into a lot of smaller 100% independent parts. The MapReduce model is so simple that as far as I know any problem that can be split into independent parts can be written as series of mapreduce steps.

    HOWEVER: It is important to note that at this moment only BATCH oriented processing can be done with Hadoop. If you want “realtime” processing you are currently out of luck.

    I don’t know of a better model at this moment that an actual implementation exists for.

    # My understanding is that MapReduce applies to large dataset for sorting/analytics/grouping/counting/aggregation/etc, is my understading correct?

    Yep, that is the most common application.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have been trying to understand how to create a MIB : here what
I have been trying to understand the basics of MapReduce in MongoDB and even
I have been trying to understand the Placement new concept. I searched on the
So I have been trying to understand the concept of 3D picking but as
I have been trying to understand the way ActionScript's events are implemented, but I'm
okay i have been trying to understand this for hours i am learning VB
So, I have been trying to understand Socket.io lately, but I am not a
I have been trying to create a ListView which I can sort using drag
I have been trying to understand why it will not work for me. Im
For last 48 hours, I have been trying to understand Multithreading and Socket Programming

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.