I have a lot of archive data in python dict format that I am working to convert over to json. I’m interested in speeding this process up if I can and would like to know if anyone has any suggestions. Currently I:
- take in the gzip compressed data “in multiple files”
- read them line by line
- do a literal_eval with ast
- use json dumps to to create the needed json string
- query for the date of the line
- open up a file for appending named by the detected date
- write the string to the file
- restart the entire process
Here’s my current working code:
import gzip
import ast
import json
import glob
import fileinput
from dateutil import parser
line = []
filestobeanalyzed = glob.glob('./data/*.json.gz')
for fileName in filestobeanalyzed:
inputfilename = fileName
print inputfilename # to keep track of where I'm at
for line in fileinput.input(inputfilename, openhook=fileinput.hook_compressed):
line = line.strip();
if not line: continue
try:
line = ast.literal_eval(line)
line = json.dumps(line)
except:
continue
date = json.loads(line).get('created_at')
if not date: continue
date_converted = parser.parse(date).strftime('%Y%m%d')
outputfilename = gzip.open(date_converted, "a")
outputfilename.write(line)
outputfilename.write("\n")
outputfilename.close()
There must be a more efficient way of doing this, I’m just not seeing it. Does anyone have any suggestions?
First profile it with http://packages.python.org/line_profiler/ it will give you timings line by line, you can also try to multithreaded it, use multiprocessing pool and map a list of files to your function.
Though I have feeling IO maybe an issue, though again that depends on the size of the files.
Im still not quite sure what you are trying to achieve, but try to read and write as few times as possible since IO is a killer. If you need to work on multiple directories then look at multithreading it something like this.