Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6327931
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 24, 20262026-05-24T17:22:04+00:00 2026-05-24T17:22:04+00:00

These days I have to deal with extremely large log data (700GB after compressed

  • 0

These days I have to deal with extremely large log data (700GB after compressed as 7z), the performance issue is critical. Considering the environment i was working (8-Cores), I was thinking leveraging parallel programming to achieve better performance.
Currently I was using the built-in multiprocessing library, the performance improved but i wanted even better. I’ve heard there are many other parallel programming library for python, such as pp.

So my question is what is the differenece between those modules? Is there one better than the others?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-24T17:22:05+00:00Added an answer on May 24, 2026 at 5:22 pm

    First, just a few questions:

    • 700GB compressed so how much uncompressed?
    • How many files?
    • What are you trying to do with these logs? How can we divide and
      conquer?

    I think you should look into MapReduce for this volume of data.

    For the purposes of having an example task I’m just going to assume you have 800GB of compressed adserver event log data and you want to do something simple like count the number of unique users across that dataset. For this quantity of data and this sort of processing multiprocessing is going to help but you’ll get a lot further faster with MapReduce: I’d look into EMR and MrJob or Dumbo. Doing simple processing jobs like a user count will help validate the procedure and help you start thinking about the problem in terms of mappers and reducers. It takes a little more time to wrap your head around more complex tasks but I think if you’re going to be working with this volume of data for any real amount of time it’ll be well worth the investment.

    For example, counting unique users will start with a mapper that simple takes each row of adserver data and emits the userID (cookieID, IP Address, whatever we can use to differentiate between users). You’ll also have a reducer that takes these user ids as input and removes or counts duplicates.

    Of course, once you resolve to give this a try there’s still a fair amount of work to do. Prepping data (splitting large files or grouping small files into blobs so that you have efficient distribution of work, storing the data uncompressed or in a compression format EMR’s Hadoop flavor understands), tuning hadoop variables to work with the resources available and your algorithm, uploading data to s3, etc.

    On the plus side, you should actually be able to work with 800GB of data in a matter a couple hours.

    A simple mapreduce example in python:

    Here’s the log file format:

    AuctionID\tUserID\tSiteID\tURL\tUserAgent\tTimestamp
    

    It’s just a simple tab separated value (tsv) file.

    So we’ll write a simple mapper to read from rows like this from stdin and write UserIDs to stdout.

    import sys
    
    def usercount_mapper(input):
        for line in input:
            line = line.strip()
            parts = line.split("\t")
            user_id = parts[1]
            print "%s\t%s"%(user_id, 1)
    
    if __name__=="__main__":
        usercount_mapper(sys.stdin)
    

    And a simple implementation of the reducer to count unique userId’s:

    import sys
    
    user_ids = {}
    def usercount_reducer(input):
        for line in input:
            line = line.strip()
            user_id, count = line.split("\t")
            try:
                count = int(count)
            except ValueError:
                continue
            current_count = user_ids.get(user_id, 0)
            user_ids[user_id] = current_count + count
    
        for user_id, count in user_ids.iteritems():
            print "%s\t%s"%(user_id, count)
    
    if __name__=="__main__":
        usercount_reducer(sys.stdin)
    

    You can run this on a single chunk of data to test it locally by just doing:

    $ cat mydata.tsv | map.py | sort | reduce.py > result.tsv
    

    The mapreduce framework (hadoop if you use EMR) will be responsible for running multiple map and reduce tasks and sorting the data from the mappers before handing that data to the reducer. To allow the reducers to actually do their job the MR framework will also hash the key value (the first value in your tab separated output from the mapper (UserID in this case)) and distribute mapper with the same hash to the same reducer. This way, users with id 4 will always go to reducer 1, id 5 will go to reducer 2, etc.

    If you want to build something yourself you may look directly at Disco (Disco is Python and Erlang so if you’re allergic to java it may be a good choice :-)) or Hadoop to build out your own mapreduce infrastructure rather than using EMR. In the Hadoop/EMR world there are also some cool data processing platforms like Hive (SQL-like environment for describing data and mapreduce algorithms) or Pig (like grep and awk on steroids) that may be a better fit for you than scripts like the above.

    For instance, having expressed your schema in Hive you could write the following query to get unique users (assuming you’d previously defined a table users):

    SELECT DISTINCT users.user_id FROM users;
    
    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I have been using Rstudio a great deal these days but recently noticed that
I was reading these days about large projects implementation in python and Flex, and
Our application is interfacing with a lot of web services these days. We have
I'm mostly using Eclipse for Android development these days, and have developed good muscle
these days I have been studying about NP problems, computational complexity and theory. I
i have been working on plist these days and got stuck in adding a
i have a list of say courses and certificates and fun_days . These are
There have been several questions over the past few days about the proper use
These days web addresses can also include non-ASCII characters. So every modern browser and
These days I'm working on a VB.NET application which can be used to edit,

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.