Given the function
def get_files_from_sha(sha, files):
from subprocess import Popen, PIPE
import tarfile
if 0 == len(files):
return {}
p = Popen(["git", "archive", sha], bufsize=10240, stdin=PIPE, stdout=PIPE, stderr=PIPE)
tar = tarfile.open(fileobj=p.stdout, mode='r|')
p.communicate()
contents = {}
doall = files == '*'
if not doall:
files = set(files)
for entry in tar:
if (isinstance(files, set) and entry.name in files) or doall:
tf = tar.extractfile(entry)
contents[entry.name] = tf.read()
if not doall:
files.discard(entry.name)
if not doall:
for fname in files:
contents[fname] = None
tar.close()
return contents
which is called in a loop for some values of sha, after a while (in my case, 4 iterations) it starts to fail at the call to tf.read(), with the message:
Traceback (most recent call last):
File "../yap-analysis/extract.py", line 243, in <module>
commits, identities, identities_by_name, identities_by_email, identities_freq = build_commits(commits)
File "../yap-analysis/extract.py", line 186, in build_commits
commit = get_commit(commit)
File "../yap-analysis/extract.py", line 84, in get_commit
contents = get_files_from_sha(commit['sha'], files)
File "../yap-analysis/extract.py", line 42, in get_files_from_sha
contents[entry.name] = tf.read()
File "/usr/lib/python2.7/tarfile.py", line 817, in read
buf += self.fileobj.read()
File "/usr/lib/python2.7/tarfile.py", line 737, in read
return self.readnormal(size)
File "/usr/lib/python2.7/tarfile.py", line 746, in readnormal
return self.fileobj.read(size)
File "/usr/lib/python2.7/tarfile.py", line 573, in read
buf = self._read(size)
File "/usr/lib/python2.7/tarfile.py", line 581, in _read
return self.__read(size)
File "/usr/lib/python2.7/tarfile.py", line 606, in __read
buf = self.fileobj.read(self.bufsize)
ValueError: I/O operation on closed file
I suspect there is some parallelization that subprocess attempts to make (?).
What is the actual cause and how to solve it in a clean and robust way on python2?
Do not use
.communicate()on thePopeninstance; it’ll read thestdoutstream until it is finished. From the documentation:The code for
.communicate()even adds an explicit.close()call on thestdoutof the pipe.Simply removing the call to
.communicate()should be enough, but do also add a.wait()after reading the tarfile contents:It could be that
tar.close()also closesp.stdout, but an extra.close()there should not hurt.