I’m executing multiple python processes like such:
find /path/to/logfiles/*.gz | xargs -n1 -P4 python logparser.py
and the output is occasionally scrambled.
The output stream is unbuffered and the size of the the write is smaller
than the default system (osx 10.8.2, python 2.7.2) defined PIPE_BUF of 512 bytes, so i believe the writes should be atomic, but output is occasionally scrambled. I must be missing something and any suggestions would be appreciated.
Thanks.
A simplified skeleton of the script is:
import argparse
import csv
import gzip
class class UnbufferedWriter(object):
"""Unbuffered Writer from
http://mail.python.org/pipermail/tutor/2003-November/026645.html
"""
def __init__(self, stream):
self.stream = stream
def write(self, data):
self.stream.write(data)
self.stream.flush()
def __getattr__(self, attr):
return getattr(self.stream, attr)
def parse_records(infile):
if infile.name.endswith('.gz'):
lines = gzip.GzipFile(fileobj=infile)
else:
lines = infile
for line in lines:
# match lines with regex and filter out on some conditions.
yield line_as_dict
def main(infile, outfile):
fields = ['remote_addr', 'time', 'request_time', 'request', 'status']
writer = csv.DictWriter(outfile, fields, quoting=csv.QUOTE_ALL)
for record in parse_records(infile):
row_as_dict = dict(
remote_addr=record.get('remote_addr', ''),
time=record.get('time', ''),
request_time=record.get('request_time', ''),
request=record.get('request', ''),
status=record.get('status', '')
)
writer.writerow(row_as_dict)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('infile', nargs='?', type=argparse.FileType('r'), default=sys.stdin)
parser.add_argument('outfile', nargs='?', type=argparse.FileType('w', 0), default=sys.stdout)
pargs = parser.parse_args()
pargs.outfile = UnbufferedWriter(pargs.outfile)
main(pargs.infile, pargs.outfile)
You might want to consider using GNU Parallel. By default, the output is buffered until the instance has completed running:
I believe the best way to run your script is vai:
or
You can specify the number of processes to run using the
-jflag, i.e.,-j4.The nice thing about Parallel is that is supports cartesian products of input arguments. For example, if you had some additional arguments that you wanted to iterate through for each file, you can use:
This will result in running the following across multiple processes:
Good luck!