I have some JSON formatted log files that I am copying to S3 so I can run Hive queries on them using Elastic Map Reduce. The script I use to copy the log files to S3 is written in Python.
Every once in a while I encounter a file with an incomplete line, typically at the end of the file. This causes any Hive queries that need that file to fail. I’ve been manually fixing the files by removing the bad line, but I’d like to integrate this step into my Python script to prevent these failures.
Here’s an example of the type of file I’m working with:
{"logLine":{"browserName":"FireFox","userAgent":"Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0"}}
{"logLine":{"browserName":"Pre","userAgent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.24 (KHTML, like Gecko; Google Web Preview) Chrome/11.0.696 Safari/534.24"}}
{"logLine":{"browserName":"Internet Explorer","userAgent":"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1
In that case I want to remove the last line since it’s incomplete. I know it’s incomplete because it’s missing the end of line character(s), and also because it’s not valid JSON due to the missing end quote and curly braces.
Is there an easy way to identify and remove that file from the file using Python?
Python has a json module in its standard library. It has a parser that will raise an exception if the input isn’t valid JSON. To check the last line, you could do something like