I am working on a small utility app to concatenate large video files. The main concatenation step is to run something like this on the command line on Windows 7:
copy /b file1.dv + file2.dv + file3.dv output.dv
The input files are large – typically 7-15GB each. I know that I am dealing with a lot of data here, but the binary concatenation takes a very long time – for a total of around 40GB of data, it can almost an hour.
Considering that the process is basically just a scan through each file and copying it’s contents to a new file, why is the binary copy so slow?
The built in command
copywas designed way back in the DOS days, and hasn’t really been updated since. As a result, it was designed for machines with small disks, and very small primary memories. As a result, it uses very small buffers when copying things around. For typical workloads; this is no big deal, but doesn’t do so well for the specific case you’re dealing with.That said, I don’t think copy is going all that slowly given the scenario you describe. If it takes about an hour for a 40 gigabyte file, that means that you’re getting speeds of around 11 MB/s. Typical commodity Dell laptops like you describe in your comment are typically equipped with 5400 RPM consumer hard disks, which achieve something like 30MB/s (end of the disk) to 60MB/s (beginning of the disk) under ideal conditions for sequential reads and writes. However, your workload isn’t a sequential workload; it’s a constant shift of the read/write heads from the source file(s) to the target file(s). Throw in a 16ms typical latency for such disks and you’ve got about 60 seeks per second, or 30 copy operations per second. That would mean that copy was using a buffer of around 11MB / 30 = around 375k, which conveniently (after you account for the size of
copy‘s code and a few DOS device drivers) fits under the 640k ceiling that copy was originally designed for. This all assumes that your disk is operating under ideal conditions, and has plenty of leftover space allowing these reads and writes to actually be sequential within a copy operation.Of course if you’re doing anything else at the same time this is going to cause more seek operations, and your performance will be worse.
You will probably get better results (maybe up to twice as fast) if you use another application which is designed for large copy operations, and as such uses larger buffers. I’m unaware of any such application though; you’ll probably need to write one yourself if that’s what you need.