I’m using python boto and threading to download many files from S3 rapidly. I

Question

0

Asked: June 7, 20262026-06-07T14:43:11+00:00 2026-06-07T14:43:11+00:00

I’m using python boto and threading to download many files from S3 rapidly. I

0

I’m using python boto and threading to download many files from S3 rapidly. I use this several times in my program and it works great. However, there is one time when it doesn’t work. In that step, I try to download 3,000 files on a 32 core machine (Amazon EC2 cc2.8xlarge).

The code below actually succeeds in downloading every file (except sometimes there is an httplib.IncompleteRead error that doesn’t get fixed by the retries). However, only 10 or so of the 32 threads actually terminate and the program just hangs. Not sure why this is. All the files have been downloaded and all the threads should have exited. They do on other steps when I download fewer files. I’ve been reduced to downloading all these files with a single thread (which works but is super slow). Any insights would be greatly appreciated!

from boto.ec2.connection import EC2Connection
from boto.s3.connection import S3Connection
from boto.s3.key import Key

from boto.exception import BotoClientError
from socket import error as socket_error
from httplib import IncompleteRead

import multiprocessing
from time import sleep
import os

import Queue
import threading

def download_to_dir(keys, dir):
    """
    Given a list of S3 keys and a local directory filepath,
    downloads the files corresponding to the keys to the local directory.
    Returns a list of filenames.
    """
    filenames = [None for k in keys]

    class DownloadThread(threading.Thread):

        def __init__(self, queue, dir):
            # call to the parent constructor
            threading.Thread.__init__(self)
            # create a connection to S3
            connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
            self.conn = connection
            self.dir = dir
            self.__queue = queue

        def run(self):
            while True:
                key_dict = self.__queue.get()
                print self, key_dict
                if key_dict is None:
                    print "DOWNLOAD THREAD FINISHED"
                    break
                elif key_dict == 'DONE': #last job for last worker
                    print "DOWNLOADING DONE"
                    break
                else: #still work to do!
                    index = key_dict.get('idx')
                    key = key_dict.get('key')
                    bucket_name = key.bucket.name
                    bucket = self.conn.get_bucket(bucket_name)
                    k = Key(bucket) #clone key to use new connection
                    k.key = key.key

                    filename = os.path.join(dir, k.key)
                    #make dirs if don't exist yet
                    try:
                        f_dirname = os.path.dirname(filename)
                        if not os.path.exists(f_dirname):
                            os.makedirs(f_dirname)
                    except OSError: #already written to
                        pass

                    #inspired by: http://code.google.com/p/s3funnel/source/browse/trunk/scripts/s3funnel?r=10
                    RETRIES = 5 #attempt at most 5 times
                    wait = 1
                    for i in xrange(RETRIES):
                        try:
                            k.get_contents_to_filename(filename)
                            break
                        except (IncompleteRead, socket_error, BotoClientError), e:
                            if i == RETRIES-1: #failed final attempt
                                raise Exception('FAILED TO DOWNLOAD %s, %s' % (k, e))
                                break
                            wait *= 2
                            sleep(wait)

                    #put filename in right spot!
                    filenames[index] = filename

    num_cores = multiprocessing.cpu_count()

    q = Queue.Queue(0)

    for i, k in enumerate(keys):
        q.put({'idx': i, 'key':k})
    for i in range(num_cores-1):
        q.put(None) # add end-of-queue markers
    q.put('DONE') #to signal absolute end of job

    #Spin up all the workers
    workers = [DownloadThread(q, dir) for i in range(num_cores)]
    for worker in workers:
        worker.start()

    #Block main thread until completion
    for worker in workers:
        worker.join() 

    return filenames

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-07T14:43:13+00:00

Upgrade to AWS SDK version 1.4.4.0 or newer, or stick to exactly 2 threads. Older versions have a limit of at most 2 simultaneous connections. This means that your code will work well if you launch 2 threads; if you launch 3 or more, you are bound to see incomplete reads and exhausted timeouts.

You will see that while 2 threads can boost your throughput greatly, more than 2 does not change much because your network card is busy all the time anyway.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I’m using python boto and threading to download many files from S3 rapidly. I

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply