I am new to python and stuck on how to do that.
I have a very large text file about 4GB contains error messages .Each message line in the text file represents one message, i need to filter out several columns and replace the space character with |.
Example:
input:
83b14af0-949b-71e0-18d5-0ad781020000 40ba8352-8dd2-71dc-12b8-0ad781020000 1 -1407714483 20 COLG-GRA-617-RD1.oss 1 181895426 12 oss-ap-1.oss 0 0 48 0 0 0 1307845644 1307845647 0 2 12 0 0 0 0 0 12 0 0 0 0 0 1307845918 3 OpC 6 opcecm 9 SNMPTraps 8 IBB_COLG 4 ATM0 0 0 0 69 Cisco Agent Interface Up (linkUp Trap) on interface ATM0 --Sev Normal 372 Generic: 3; Specific: 0; Enterprise: .1.3.6.1.4.1.9.1.569;
output:
83b14af0-949b-71e0-18d5-0ad781020000 | 40ba8352-8dd2-71dc-12b8-0ad781020000 | COLG-GRA-617-RD1.oss | 1307845644 | 1307845647 |1307845918 | Cisco Agent Interface Up (linkUp Trap) on interface ATM0 | Normal 372 | Generic: 3 | Specific: 0 | Enterprise: .1.3.6.1.4.1.9.1.569
Really I appreciate any help
Thank you
Your input file format is annoying. We could split the input on white space, but some of the fields you want to capture should contain white space. We could split the input on column numbers, but I am not certain that every string is always the same length; it seems likely that the numbers will vary in number of digits. So the best solution should involve regular expressions.
A single regular expression to parse this whole line would be pretty mind-numbing to write and to understand. But we can build up the pattern from shorter patterns. I think the result is pretty easy to understand. Also, if the file format changes or the fields you want to capture ever change, I think you can pretty easily change this.
Note that we use the Python “string repetition” operator,
*, to repeat the shorter patterns. If we have 2 words we want to recognize and capture, we can usec*2to repeat the capture pattern twice.In your example of the desired output, you had some extra white space. I wrote the patterns to not capture any white space, but if you actually want the white space you can edit the patterns as you like.
If you don’t know about regular expressions, you should read the documentation for the Python
remodule. Briefly, the part of the pattern enclosed in parentheses will be captured, and other parts will match but not be captured.\smatches white space, and\Smatches non-white space.+in a pattern means “1 or more” and*means “0 or more”.^and$match beginning and end of the pattern, respectively.With the pattern written, tested, and debugged, it’s very simple to write the program to actually process the file.
This will read the file one line at a time, process the line, and write the processed line to the output file.
Note that I’m using multiple
withstatements on one line. This works with any recent Python but doesn’t work on 2.5 or 3.0.