I’ve got a problem at work that requires me to insheet some MASSIVE tab-separated values files (think 8-15 GB .txt files) into my PostgreSQL DB, but I’ve run into a problem with the way the data was formatted in the first place. Basically, the way we are given the data (and unfortunately we cannot get the data in a better format), there are some backslashes that appear and cause a return/new line.
So, there are lines (rows of data, tab-delim) that get chopped up into multiple lines, where the last character of line n is a \ , and the first character of line n+1 is a tab. Usually line n will be broken up into 1-3 additional lines (e.g. line n ends in a “\”, lines n+1 and n+2 start with a tab and end with a “\”, and line n+3 starts with a tab).
I need to write a script that can work with these huge files (this will run on a linux server with 192 GB of RAM) to look for the lines that begin with a tab, and then remove the return (and “\” wherever it exists) and save the text file.
To recap, the customer’s logging program splits the original line N into lines n, n+1, and sometimes n+2 and n+3 (depending on how many \ characters appear in line N), and I need to write a python script to recreate the original line N.
This is based on @user665637’s good answer.
Changes:
Uses “raw strings” for the regular expressions, which are easier to read.
Takes an output filename from the command-line arguments and writes to that file. Prints a message and exits if the user provides the wrong number of arguments. When we unpack
sys.argvto get the arguments, we use the convention of using the variable name_for parts we don’t care about.Does not strip line endings, so the output file will have the same sort of line-endings as the input file. (When joining lines, of course it strips the line endings to do the join.)
Does not filter out blank lines from the input. It’s a little bit tricky, but by making an iterator and calling
next()on it, it gets the first input line before starting the loop; thussstarts out with a valid value instead ofNone, and we don’t have to test it each time to see whether to print it or not. The originalif lastLine:test, on an input line that was stripped, would not only protect against the initialNonevalue oflastLinebut would also filter out any blank lines from the input.If you have to use this with Python 3.0 or Python 2.6, you can’t have a
withstatement that does twoopen()calls; but you can just turn it into two nestedwithstatements, each of which does a singleopen().