I have a problem but I feel the solution should be quite simple. I’m building a model and want to test its accuracy by 10-fold cross-validation. To do this I have to split my training corpus 90%/10% into training and test sections, then train my model on the 90% and test on the 10%. This I want to do ten times, by taking a different 90%/10% split every time, so that eventually each bit of the corpus has been used as testing data. Then I’ll average the results for each 10% test.
I have tried to write a script to extract 10% of the training corpus and write it to a new file, but so far I don’t get it working. What I have done is counting the total number of lines in the file, and then dividing this number by ten to know the size of each of the ten different test sets that I want to extract.
trainFile = open("danish.train")
numberOfLines = 0
for line in trainFile:
numberOfLines += 1
lengthTest = numberOfLines / 10
I have found, for my own training file, that it consists of 3638 lines, so each test should consist roughly of 363 lines.
How do I write line 1-363, line 364-726, etc. to different test files?
Once you have the count of lines, go back to the beginning of the file, and start copying out lines to
danish.train.part-01. When the line number is a multiple of the size of the 10% test set, open a new file for the next part.On this input file (sorry I don’t speak Danish!):
This creates files
and part 5, for example, contains: