I’ve got to download, process, and store an 8GB XML file from a secure web server. I could download the file using the WebRequest class, but this will take a VERY long time. Also, I know that the file is structured in such a way that it suits processing in discrete chunks.
How can I ‘stream’ this file such that I only get bite-size pieces which I can work on, without having to get the whole stream at one time?
Edit
I forgot to mention – we are hosted on Azure. An idea that comes to mind is to provision a worker role which just downloads large files and can take as long as it wants. How feasible would that be?
8 GB is a large workload. To protect myself from rework and to scale effectively, I would decouple the XML file download from it’s processing.
While downloading as a stream, I would write some sort of stream identifier to persistent storage and schedule each atomic unit of work to be done by placing a message with its relevant data on a queue.
This would allow recovery from the download going south for any reason or a unit of work being unsuccessful and/or interfering with the download.