I’m trying to open a file and read from the last point read. My files are rather big (20 Mb to ~ 1 Gb) After doing some research it seems that tell() and seek() would be one of the most efficient ways to perform this. I’ve tried the following code
opened = open(filename, "rU")
f1 = csv.reader(opened)
k = []
for line in f1:
k.append(opened.tell())
When I do this every value in the list is 8272 Long. Does that mean that I cannot use this implementation? Is there something I’m missing? Thanks for your help!
I’m running python 2.7 in Windows 7
Update
After piecing together everything learned here and trial and error I get the following code
opened = open(filename, "rU")
k = [0]
where = 1
for switch in opened:
where += len(switch) + 1
f = StringIO.StringIO(switch)
interesting = csv.reader(f, delimiter=',')
good_values = interesting.next()
k.append(where)
return k
This allows the user to know exactly where in the file to go to while still being able to parse it according to its format. I’m not completely sure of why the offsets need to be constantly added (It seems that the newline is not accurately accounted for in len()).
It looks like the
csv.readeris reading the file in chunks of 8272 bytes, that’s why you see this number returned fromopened.tell()many times – until, I guess, you have read all the lines from your file in the range of 0-8272. After that you will see 8272*2 a few times, exact number will depend on the length of the lines in the buffer read.So, basically, in your program,
tell()doesn’t give you offsets of new CSV lines, as you seem to assume. It’s only telling you about offset of the end of the file’s region currently read into an internal OS buffer used by system functions used to implement the Python’s IO functions.