I am new to python and stuck on how to do that. I have

Question

0

Asked: June 1, 20262026-06-01T14:29:40+00:00 2026-06-01T14:29:40+00:00

I am new to python and stuck on how to do that. I have

0

I am new to python and stuck on how to do that.
I have a very large text file about 4GB contains error messages .Each message line in the text file represents one message, i need to filter out several columns and replace the space character with |.
Example:

input:
83b14af0-949b-71e0-18d5-0ad781020000 40ba8352-8dd2-71dc-12b8-0ad781020000 1 -1407714483 20 COLG-GRA-617-RD1.oss 1 181895426 12 oss-ap-1.oss 0 0 48 0 0 0 1307845644 1307845647 0 2 12 0 0 0  0 0 12 0 0 0  0 0 1307845918 3 OpC 6 opcecm 9 SNMPTraps 8 IBB_COLG 4 ATM0 0  0  0  69 Cisco Agent Interface Up (linkUp Trap) on interface ATM0 --Sev Normal 372 Generic: 3; Specific: 0; Enterprise: .1.3.6.1.4.1.9.1.569;
output:
83b14af0-949b-71e0-18d5-0ad781020000 | 40ba8352-8dd2-71dc-12b8-0ad781020000 | COLG-GRA-617-RD1.oss | 1307845644 | 1307845647 |1307845918 | Cisco Agent Interface Up (linkUp Trap) on interface ATM0 | Normal 372 | Generic: 3 | Specific: 0 | Enterprise: .1.3.6.1.4.1.9.1.569

Really I appreciate any help

Thank you

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-01T14:29:41+00:00

Your input file format is annoying. We could split the input on white space, but some of the fields you want to capture should contain white space. We could split the input on column numbers, but I am not certain that every string is always the same length; it seems likely that the numbers will vary in number of digits. So the best solution should involve regular expressions.

A single regular expression to parse this whole line would be pretty mind-numbing to write and to understand. But we can build up the pattern from shorter patterns. I think the result is pretty easy to understand. Also, if the file format changes or the fields you want to capture ever change, I think you can pretty easily change this.

Note that we use the Python “string repetition” operator, *, to repeat the shorter patterns. If we have 2 words we want to recognize and capture, we can use c*2 to repeat the capture pattern twice.

In your example of the desired output, you had some extra white space. I wrote the patterns to not capture any white space, but if you actually want the white space you can edit the patterns as you like.

If you don’t know about regular expressions, you should read the documentation for the Python re module. Briefly, the part of the pattern enclosed in parentheses will be captured, and other parts will match but not be captured. \s matches white space, and \S matches non-white space. + in a pattern means “1 or more” and * means “0 or more”. ^ and $ match beginning and end of the pattern, respectively.

import re

# Define patterns we want to recognize.

c = r'(\S+)\s+'  # a word we want to capture
s = r'\S+\s+'  # a word we want to skip
mesg = r'(\S.*\S)\s+--Sev\s+'  # mesg to capture; terminated by string '--Sev'
w2 = r'(\S+\s+\S+)\s+'  # two words separated by some white space
w2semi = r'(\S+\s+\S+)\s*;\s+'  # two words terminated by a semicolon
tail = r'(.*\S)\s*;'

# Join together the above patterns to make one giant pattern that parses
# the input.
s_pat = ( r'^\s*' + 
    c*2 + s*3 + c*1 + s*10 + c*2 + s*14 + c*1 + s*14 +
    mesg + w2 + w2semi*2 + tail +
    r'\s*$')

# Pre-compile the pattern for speed.
pat = re.compile(s_pat)

# Test string and the expected output result.
s_input = "83b14af0-949b-71e0-18d5-0ad781020000 40ba8352-8dd2-71dc-12b8-0ad781020000 1 -1407714483 20 COLG-GRA-617-RD1.oss 1 181895426 12 oss-ap-1.oss 0 0 48 0 0 0 1307845644 1307845647 0 2 12 0 0 0  0 0 12 0 0 0  0 0 1307845918 3 OpC 6 opcecm 9 SNMPTraps 8 IBB_COLG 4 ATM0 0  0  0  69 Cisco Agent Interface Up (linkUp Trap) on interface ATM0 --Sev Normal 372 Generic: 3; Specific: 0; Enterprise: .1.3.6.1.4.1.9.1.569;"
s_correct = "83b14af0-949b-71e0-18d5-0ad781020000|40ba8352-8dd2-71dc-12b8-0ad781020000|COLG-GRA-617-RD1.oss|1307845644|1307845647|1307845918|Cisco Agent Interface Up (linkUp Trap) on interface ATM0|Normal 372|Generic: 3|Specific: 0|Enterprise: .1.3.6.1.4.1.9.1.569"

# re.match() returns a "match group"
m = re.match(pat, s_input)
# m.groups() returns sequence of captured strings; join with '|'
s_output = '|'.join(m.groups())

# sanity check
if s_correct == s_output:
    print "excellent"
else:
    print "bogus"

# excellent.

With the pattern written, tested, and debugged, it’s very simple to write the program to actually process the file.

# use the pattern defined above, named "pat"
with open(input_file, "r") as f_in, open(output_file, "w") as f_out:
    for line_num, line in enumerate(f_in, 1):
        try:
            m = re.match(pat, line)
            s_output = '|'.join(m.groups())
            f_out.write(s_output + '\n')
        except Exception:
            print("unable to parse line %d: %s" % (line_num, line)

This will read the file one line at a time, process the line, and write the processed line to the output file.

Note that I’m using multiple with statements on one line. This works with any recent Python but doesn’t work on 2.5 or 3.0.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I am new to python and stuck on how to do that. I have

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply