I’m creating a python script which accepts a path to a remote file and an n number of threads. The file’s size will be divided by the number of threads, when each thread completes I want them to append the fetch data to a local file.
How do I manage it so that the order in which the threads where generated will append to the local file in order so that the bytes don’t get scrambled?
Also, what if I’m to download several files simultaneously?
You could coordinate the works with locks &c, but I recommend instead using Queue — usually the best way to coordinate multi-threading (and multi-processing) in Python.
I would have the main thread spawn as many worker threads as you think appropriate (you may want to calibrate between performance, and load on the remote server, by experimenting); every worker thread waits at the same global
Queue.Queueinstance, call itworkQfor example, for “work requests” (wr = workQ.get()will do it properly — each work request is obtained by a single worker thread, no fuss, no muss).A “work request” can in this case simply be a triple (tuple with three items): identification of the remote file (URL or whatever), offset from which it is requested to get data from it, number of bytes to get from it (note that this works just as well for one or multiple files ot fetch).
The main thread pushes all work requests to the
workQ(justworkQ.put((url, from, numbytes))for each request) and waits for results to come to anotherQueueinstance, call itresultQ(each result will also be a triple: identifier of the file, starting offset, string of bytes that are the results from that file at that offset).As each working thread satisfies the request it’s doing, it puts the results into
resultQand goes back to fetch another work request (or wait for one). Meanwhile the main thread (or a separate dedicated “writing thread” if needed — i.e. if the main thread has other work to do, for example on the GUI) gets results fromresultQand performs the neededopen,seek, andwriteoperations to place the data at the right spot.There are several ways to terminate the operation: for example, a special work request may be asking the thread receiving it to terminate — the main thread puts on
workQjust as many of those as there are working threads, after all the actual work requests, then joins all the worker threads when all data have been received and written (many alternatives exist, such as joining the queue directly, having the worker threads daemonic so they just go away when the main thread terminates, and so forth).