I am processing thousands of binaries wrapped in zip-like file containers, pulled from a remote database. I need to analyze the contents of these binaries with tools like readelf, but I want to avoid incurring unnecessary IO to write the binaries to disk.
Is there a way to invoke subprocess.Popen so that I can pass the in-memory file to the command-line utility that the command would understand as being a file? I’ve tried assigning the file descriptor to stdin, but the utilities don’t read the file contents from stdin as expected.
with zipfile.ZipFile(file,'r') as z:
with z.open(binary_path) as bin:
subprocess.Popen(['readelf','-d'],stdin=bin)
I’ve also tried directly setting the necessary argument to a reference to the file descriptor, but that’s also proven fruitless:
with zipfile.ZipFile(file,'r') as z:
with z.open(binary_path) as bin:
subprocess.Popen(['readelf','-d',bin])
Is what I’m attempting possible, or should I just resort to writing to disk and analyzing from there?
Much thanks!
Zeroth, why do you need to
popenreadelf, instead of usinglibelfor something similar? A quick search for “elf” at PyPI shows lots of possibilities. Have you looked them over?First, on many platforms, all of the I/O will end up going through the cache, so it won’t really slow you down, even if it does end up eventually flushing everything to disk just to delete it (which it may never do). Careful use of
mmapcan often help avoiding flushing to disk, but you may not even need it.So really, I’d test it first and see if excessive I/O really is slowing you down. If not, stop worrying about it.
If you want to be sure there’s no disk I/O (I’m assuming you’ve disabled all swap, because otherwise that idea is meaningless in the first place), the easiest solution is to create a temporary file that isn’t actually backed to disk.
The easiest way to do that is to create a ramdisk, and just put the temporary files there.
Alternatively, most platforms have a way to create a temporary file that either is never backed to disk, or is only backed to disk if absolutely necessary. Unfortunately, I don’t think any of the stdlib Python functions can do this, in which case you’ll have to write platform-specific code for it.
If you do want to pass an arbitrary buffer to a tool as stdin, it’s easy. But you have to know how to tell the tool to read stdin—often that means something like passing
-cas an option or-as a fake filename, or sometimes just not passing any filenames. Read the manpage to see which. For example:Unfortunately, some tools won’t work this way, often because they require a seekable file rather than just a stream. I believe
readelfis one of them. So this option isn’t available.And passing an arbitrary fd to a tool requires the tool to have a way to take arbitrary fds instead of filenames, which most of them don’t.