I am trying to read a very simple but somehow large(800Mb) csv file using the csv library in python. The delimiter is a single tab and each line consists of some numbers.
Each line is a record, and I have 20681 rows in my file. I had some problems during my calculations using this file,it always stops at a certain row. I got suspicious about the number of rows in the file.I used the code below to count the number of row in this file:
tfdf_Reader = csv.reader(open('v2-host_tfdf_en.txt'),delimiter=' ')
c = 0
for row in tfdf_Reader:
c = c + 1
print c
To my surprise c is printed with the value of 61722!!! Why is this happening? What am I doing wrong?
800 million bytes in the file and 20681 rows means that the average row size is over 38 THOUSAND bytes. Are you sure? How many numbers do you expect in each line? How do you know that you have 20681 rows? That the file is 800 Mb?
61722 rows is almost exactly 3 times 20681 — is the number 3 of any significance e.g. 3 logical sub-sections of each record?
To find out what you really have in your file, don’t rely on what it looks like. Python’s
repr()function is your friend.Are you on Windows? Even if not, always
open(filename, 'rb').If the fields are tab-separated, then don’t put
delimeter=" "(whatever is between the quotes appears not to be a tab). Putdelimiter="\t".Try putting some debug statements in your code, like this:
Note: if you are getting
Error: field larger than field limit (131072), that means your file has 128Kb of data with no delimiters.I’d suspect that:
(a) your file has random junk or a big chunk of binary zeroes apppended to it — this should be obvious in a hex editor; it also should be obvious in a TEXT editor. Print all the rows that you do get to help identify where the trouble starts.
or (b) the delimiter is a string of one or more whitespace characters (space, tab), the first few rows have tabs, and the remaining rows have spaces. If so, this should be obvious in a hex editor (or in Notepad++, especially if you do
View/Show Symbol/Show all characters). If this is the case, you can’t usecsv, you’d need something simple like: