I’m trying to a parallelize an application using multiprocessing which takes in
a very large csv file (64MB to 500MB), does some work line by line, and then outputs a small, fixed size
file.
Currently I do a list(file_obj), which unfortunately is loaded entirely
into memory (I think) and I then I break that list up into n parts, n being the
number of processes I want to run. I then do a pool.map() on the broken up
lists.
This seems to have a really, really bad runtime in comparison to a single
threaded, just-open-the-file-and-iterate-over-it methodology. Can someone
suggest a better solution?
Additionally, I need to process the rows of the file in groups which preserve
the value of a certain column. These groups of rows can themselves be split up,
but no group should contain more than one value for this column.
list(file_obj)can require a lot of memory whenfileobjis large. We can reduce that memory requirement by using itertools to pull out chunks of lines as we need them.In particular, we can use
to split the file into processable chunks, and
to have the multiprocessing pool work on
num_chunkschunks at a time.By doing so, we need roughly only enough memory to hold a few (
num_chunks) chunks in memory, instead of the whole file.