I’m currently trying to import a .txt file into some proprietary software but appear to continually receive an error. The .txt file is almost 2GB in size and has approximately 56 million lines.
Having spoken to the manufacturers, they have stated that there may be an error in one of the lines. Each line should contain an MD5 hash value (32-characters) and therefore, using Python v2.7, I’m looking to scan process the .txt file to check the length of each line and print the value of the ‘offending’ line.
Here’s what I’ve tried:-
f = open("x.txt")
contents = f.readlines()
f.close()
for line in contents:
if line(len) == 32:
continue
else:
print line
Unfortunately I receive an error when I try this code:-
File "<pyshell#30>", line 2, in <module>
if line(len) == 32:
TypeError: 'str' object is not callable
So I tried the below believing I had to convert the ‘line’ to an integer:-
for line in contents:
if int(line)(len) == 32:
continue
else:
print line
but that just brought back an error of:-
ValueError: invalid literal for int() with base 10: '000000000000000012452154365298BD"
As said, what I’m looking to do is read every line of the .txt file and if it isn’t a valid MD5 hash value, print the value to screen or even delete the value.
Many thanks
[edit] Turns out it was a schoolboy error. Thanks all
Since your file is 2 GB in size, I would not recommend doing it the way you’re doing it, even if you correct
line(len)tolen(line). You’re reading the whole file into memory, which is unnecessary and may cause an out-of-memory error if you don’t have enough RAM. Here’s how I’d do it:If you want to remove all lines with the wrong character count, the easiest way is to write a new, correct file:
After the script has run, check that the new file looks good, then move the old file to a backup dir and rename the new one
x.txt.