I need some help on a parallel processing task that I am trying to

Question

0

Editorial Team

Asked: June 14, 20262026-06-14T09:58:28+00:00 2026-06-14T09:58:28+00:00

I need some help on a parallel processing task that I am trying to

0

I need some help on a parallel processing task that I am trying to complete asap.

It simply involves splitting a largeish dataframe into smaller chunks and running the same script on each chunk.

I think this is called embarassingly parallel.

I would be very grateful if there’s someone out there who could suggest a template to achieve this task using either amazon cloud services or picloud.

I have made initial forays into amazon ec2 and picloud (the script I will run on each data chunk is in python) but realise that I may
not figure out how to do it in either without some help.

So, any pointers would be greatly appreciated. I’m just looking for basic help (to those in the know), such as the main steps involved in setting up parallel cores or cpus using either ec2 or picloud or whatever, running the script in parallel, and saving the script output i.e. the script writes the result of its calculation to a csv file.

i’m running ubuntu 12.04, my python 2.7 script doesnt involve non-stand libraries, just os and csv. the script isn’t complex, just the data is too big for my machine and timeframe.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-14T09:58:29+00:00

This script uses the cloud library for Python from PiCloud, and should be run locally.

# chunks is a list of filenames (you'll need to define generate_chunk_files)
chunks = generate_chunk_files('large_dataframe')
for chunk in chunks:
    # stores each chunk in your PiCloud bucket
    cloud.bucket.put(chunk)

def process_chunk(chunk):
    """Runs on PiCloud"""

    # saves chunk object locally
    cloud.bucket.get(chunk)
    f = open(chunk, 'r')
    # process the data however you want

# asynchronously runs process_chunk on the cloud for all chunks
job_ids = cloud.map(process_chunk, chunks)

Use the Realtime Cores feature to allocate a specific number of cores.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I need some help on a parallel processing task that I am trying to

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply