These days I have to deal with extremely large log data (700GB after compressed as 7z), the performance issue is critical. Considering the environment i was working (8-Cores), I was thinking leveraging parallel programming to achieve better performance.
Currently I was using the built-in multiprocessing library, the performance improved but i wanted even better. I’ve heard there are many other parallel programming library for python, such as pp.
So my question is what is the differenece between those modules? Is there one better than the others?
First, just a few questions:
conquer?
I think you should look into MapReduce for this volume of data.
For the purposes of having an example task I’m just going to assume you have 800GB of compressed adserver event log data and you want to do something simple like count the number of unique users across that dataset. For this quantity of data and this sort of processing multiprocessing is going to help but you’ll get a lot further faster with MapReduce: I’d look into EMR and MrJob or Dumbo. Doing simple processing jobs like a user count will help validate the procedure and help you start thinking about the problem in terms of mappers and reducers. It takes a little more time to wrap your head around more complex tasks but I think if you’re going to be working with this volume of data for any real amount of time it’ll be well worth the investment.
For example, counting unique users will start with a mapper that simple takes each row of adserver data and emits the userID (cookieID, IP Address, whatever we can use to differentiate between users). You’ll also have a reducer that takes these user ids as input and removes or counts duplicates.
Of course, once you resolve to give this a try there’s still a fair amount of work to do. Prepping data (splitting large files or grouping small files into blobs so that you have efficient distribution of work, storing the data uncompressed or in a compression format EMR’s Hadoop flavor understands), tuning hadoop variables to work with the resources available and your algorithm, uploading data to s3, etc.
On the plus side, you should actually be able to work with 800GB of data in a matter a couple hours.
A simple mapreduce example in python:
Here’s the log file format:
It’s just a simple tab separated value (tsv) file.
So we’ll write a simple mapper to read from rows like this from stdin and write UserIDs to stdout.
And a simple implementation of the reducer to count unique userId’s:
You can run this on a single chunk of data to test it locally by just doing:
The mapreduce framework (hadoop if you use EMR) will be responsible for running multiple map and reduce tasks and sorting the data from the mappers before handing that data to the reducer. To allow the reducers to actually do their job the MR framework will also hash the key value (the first value in your tab separated output from the mapper (UserID in this case)) and distribute mapper with the same hash to the same reducer. This way, users with id 4 will always go to reducer 1, id 5 will go to reducer 2, etc.
If you want to build something yourself you may look directly at Disco (Disco is Python and Erlang so if you’re allergic to java it may be a good choice :-)) or Hadoop to build out your own mapreduce infrastructure rather than using EMR. In the Hadoop/EMR world there are also some cool data processing platforms like Hive (SQL-like environment for describing data and mapreduce algorithms) or Pig (like grep and awk on steroids) that may be a better fit for you than scripts like the above.
For instance, having expressed your schema in Hive you could write the following query to get unique users (assuming you’d previously defined a table users):