I am new to Python. I have 2000 files each about 100 MB. I have to read each of them and merge them into a big matrix (or table). Can I use parallel processing for this so that I can save some time? If yes, how? I tried searching and things seem very complicated. Currently, it takes about 8 hours to get this done serially. We have a really big server with one Tera Byte RAM and few hundred processors. How can I efficiently make use of this?
Thank you for your help.
You make be able to preprocess the files in separate processes using the subprocess module; however, if the final table is kept in memory, then that process will end up being you bottleneck.
There is another possible approach using shared memory with mmap objects. Each subprocess can be responsible for loading the files into a subsection of the mapped memory.