I have an issue which has to do with file input and output in Python (it’s a continuation from this question: how to extract specific lines from a data file, which has been solved now).
So I have one big file, danish.train, and eleven small files (called danish.test.part-01 and so on), each of them containing a different selection of the data from the danish.train file. Now, for each of the eleven files, I want to create an accompanying file that complements them. This means that for each small file, I want to create a file that contains the contents of danish.train minus the part that is already in the small file.
What I’ve come up with so far is this:
trainFile = open("danish.train")
for file_number in range(1,12):
input = open('danish.test.part-%02d' % file_number, 'r')
for line in trainFile:
if line not in input:
with open('danish.train.part-%02d' % file_number, 'a+') as myfile:
myfile.write(line)
The problem is that this code only gives output for file_number 1, although I have a loop from 1-11. If I change the range, for example to in range(2,3), I get an output danish.train.part-02, but this output contains a copy of the whole danish.train without leaving out the contents of the file danish.test.part-02, as I wanted.
I suspect that these issues may have something to do with me not completely understanding the with... as operator, but I’m not sure. Any help would be greatly appreciated.
When you
opena file, it returns an iterator through the lines of the file. This is nice, in that it lets you go through the file, one line at a time, without keeping the whole file into memory at once. In your case, it leads to a problem, in that you need to iterate through the file multiple times.Instead, you can read the full training file into memory, and go through it multiple times:
I’ve simplified the logic a little bit, as well. If you don’t care about the order of the lines, you could also consider reading the training lines into a set, and then just use set operations instead of the generator expression I used in the final line.