I need some help on a parallel processing task that I am trying to complete asap.
It simply involves splitting a largeish dataframe into smaller chunks and running the same script on each chunk.
I think this is called embarassingly parallel.
I would be very grateful if there’s someone out there who could suggest a template to achieve this task using either amazon cloud services or picloud.
I have made initial forays into amazon ec2 and picloud (the script I will run on each data chunk is in python) but realise that I may
not figure out how to do it in either without some help.
So, any pointers would be greatly appreciated. I’m just looking for basic help (to those in the know), such as the main steps involved in setting up parallel cores or cpus using either ec2 or picloud or whatever, running the script in parallel, and saving the script output i.e. the script writes the result of its calculation to a csv file.
i’m running ubuntu 12.04, my python 2.7 script doesnt involve non-stand libraries, just os and csv. the script isn’t complex, just the data is too big for my machine and timeframe.
This script uses the cloud library for Python from PiCloud, and should be run locally.
Use the Realtime Cores feature to allocate a specific number of cores.