I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.
I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:
'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
data4.write(line)
data1.close()
for line in data2:
data4.write(line)
data2.close()
for line in data3:
data4.write(line)
data3.close()
data4.close()
The standard Unix way to merge files is
cat. It may not be much faster but it will be faster.Rather than make a temporary file, you may be able to
catdirectly to sqliteIn python, you will probably get better performance if you copy the file in blocks rather than lines. Use
file.read(65536)to read 64k of data at a time, rather than iterating through the files withfor