Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 232277
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 11, 20262026-05-11T20:00:14+00:00 2026-05-11T20:00:14+00:00

I’ve been looking at MapReduce for a while, and it seems to be a

  • 0

I’ve been looking at MapReduce for a while, and it seems to be a very good way to implement fault-tolerant distributed computing. I read a lot of papers and articles on that topic, installed Hadoop on an array of virtual machines, and did some very interesting tests. I really think I understand the Map and Reduce steps.

But here is my problem : I can’t figure out how it can help with http server logs analysis.

My understanding is that big companies (Facebook for instance) use MapReduce for the purpose of computing their http logs in order to speed up the process of extracting audience statistics out of these. The company I work for, while smaller than Facebook, has a big volume of web logs to compute everyday (100Go growing between 5 and 10 percent every month). Right now we process these logs on a single server, and it works just fine. But distributing the computing jobs instantly come to mind as a soon-to-be useful optimization.

Here are the questions I can’t answer right now, any help would be greatly appreciated :

  • Can the MapReduce concept really be applied to weblogs analysis ?
  • Is MapReduce the most clever way of doing it ?
  • How would you split the web log files between the various computing instances ?

Thank you.
Nicolas

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-11T20:00:14+00:00Added an answer on May 11, 2026 at 8:00 pm

    Can the MapReduce concept really be applied to weblogs analysis ?

    Yes.

    You can split your hudge logfile into chunks of say 10,000 or 1,000,000 lines (whatever is a good chunk for your type of logfile – for apache logfiles I’d go for a larger number), feed them to some mappers that would extract something specific (like Browser,IP Address, …, Username, … ) from each log line, then reduce by counting the number of times each one appeared (simplified):

      192.168.1.1,FireFox x.x,username1
      192.168.1.1,FireFox x.x,username1
      192.168.1.2,FireFox y.y,username1
      192.168.1.7,IE 7.0,username1
    

    You can extract browsers, ignoring version, using a map operation to get this list:

    FireFox
    FireFox
    FireFox
    IE
    

    Then reduce to get this :
    FireFox,3
    IE,1

    Is MapReduce the most clever way of doing it ?

    It’s clever, but you would need to be very big in order to gain any benefit… Splitting PETABYTES of logs.

    To do this kind of thing, I would prefer to use Message Queues, and a consistent storage engine (like a database), with processing clients that pull work from the queues, perform the job, and push results to another queue, with jobs not being executed in some timeframe made available for others to process. These clients would be small programs that do something specific.

    You could start with 1 client, and expand to 1000… You could even have a client that runs as a screensaver on all the PCs on a LAN, and run 8 clients on your 8-core servers, 2 on your dual core PCs…

    With Pull: You could have 100 or 10 clients working, multicore machines could have multiple clients running, and whatever a client finishes would be available for the next step. And you don’t need to do any hashing or assignment for the work to be done. It’s 100% dynamic.

    http://img355.imageshack.us/img355/7355/mqlogs.png

    How would you split the web log files between the various computing instances ?

    By number of elements or lines if it’s a text-based logfile.

    In order to test MapReduce, I’d like to suggest that you play with Hadoop.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Ask A Question

Stats

  • Questions 216k
  • Answers 216k
  • Best Answers 0
  • User 1
  • Popular
  • Answers
  • Editorial Team

    How to approach applying for a job at a company ...

    • 7 Answers
  • Editorial Team

    What is a programmer’s life like?

    • 5 Answers
  • Editorial Team

    How to handle personal stress caused by utterly incompetent and ...

    • 5 Answers
  • Editorial Team
    Editorial Team added an answer What a lot of downloadsites use(handango etc), is the RPN… May 12, 2026 at 11:06 pm
  • Editorial Team
    Editorial Team added an answer If by a dictionary of dictionaries you mean something approximately… May 12, 2026 at 11:06 pm
  • Editorial Team
    Editorial Team added an answer There is a GTK theme for GNUstep in development right… May 12, 2026 at 11:06 pm

Related Questions

I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out
I ran into a problem. Wrote the following code snippet: teksti = teksti.Trim() teksti
I have a French site that I want to parse, but am running into
I have text I am displaying in SIlverlight that is coming from a CMS
I want use html5's new tag to play a wav file (currently only supported

Trending Tags

analytics british company computer developers django employee employer english facebook french google interview javascript language life php programmer programs salary

Top Members

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.