I have a lot of archive data in python dict format that I am

Question

0

Asked: June 5, 20262026-06-05T13:40:34+00:00 2026-06-05T13:40:34+00:00

I have a lot of archive data in python dict format that I am

0

I have a lot of archive data in python dict format that I am working to convert over to json. I’m interested in speeding this process up if I can and would like to know if anyone has any suggestions. Currently I:

take in the gzip compressed data “in multiple files”
read them line by line
do a literal_eval with ast
use json dumps to to create the needed json string
query for the date of the line
open up a file for appending named by the detected date
write the string to the file
restart the entire process

Here’s my current working code:

import gzip
import ast
import json
import glob
import fileinput
from dateutil import parser

line = []

filestobeanalyzed = glob.glob('./data/*.json.gz')

for fileName in filestobeanalyzed:
        inputfilename = fileName
        print inputfilename # to keep track of where I'm at
        for line in fileinput.input(inputfilename, openhook=fileinput.hook_compressed):
                line = line.strip();
                if not line: continue
                try:
                        line = ast.literal_eval(line)
                        line = json.dumps(line)
                except:
                        continue
                date = json.loads(line).get('created_at')
                    if not date: continue
                date_converted = parser.parse(date).strftime('%Y%m%d')
                outputfilename = gzip.open(date_converted, "a")
                outputfilename.write(line)
                outputfilename.write("\n")

outputfilename.close()

There must be a more efficient way of doing this, I’m just not seeing it. Does anyone have any suggestions?

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T13:40:36+00:00

First profile it with http://packages.python.org/line_profiler/ it will give you timings line by line, you can also try to multithreaded it, use multiprocessing pool and map a list of files to your function.

Though I have feeling IO maybe an issue, though again that depends on the size of the files.

# Assuming you have a large dictionary across a multi-part gzip files.
def files_to_be_analyzed(files):
    lines = ast.literal_eval("".join([gzip.open(file).read() for file in files]))
    date = lines['created_at']
    date_converted = parser.parse(date).strftime('%Y%m%d')
    output_file = gzip.open(date_converted, "a")
    output_file.write(lines + "\n")
    output_file.close()

Im still not quite sure what you are trying to achieve, but try to read and write as few times as possible since IO is a killer. If you need to work on multiple directories then look at multithreading it something like this.

import gzip
import ast
import json
import glob
import fileinput
from dateutil import parser
from multiprocessing import Pool

# Assuming you have a large dictionary across a multi-part gzip files.
def files_to_be_analyzed(files):
    lines = ast.literal_eval("".join([gzip.open(file).read() for file in files]))
    date = lines['created_at']
    date_converted = parser.parse(date).strftime('%Y%m%d')
    output_file = gzip.open(date_converted, "a")
    output_file.write(lines + "\n")
    output_file.close()

if __name__ == '__main__':
    pool = Pool(processes = 5) # Or what ever number of cores you have
    directories = ['/path/to/this/dire', '/path/to/another/dir']
    pool.map(files_to_be_analyzed, [glob.glob(path) for path in directories])
    pools.close()
    pools.join()

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a lot of archive data in python dict format that I am

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply