I have a text file in the format of:
aaa: bcd;bcd;bcddd;aaa:bcd;bcd;bcd;
Where “bcd” can be any length of any characters, excluding ; or :
What I want to do is print the text file in the format of:
aaa: bcd;bcd;bcddd;
aaa: bcd;bcd;bcd;
-etc-
My method of approach to this problem was to isolate a pattern of “;...:” and then reprint this pattern without the initial ;
I concluded I would have to use awk’s ‘gsub’ to do this, but have no idea how to replicate the pattern nor how to print the pattern again with this added new line character 1 character into my pattern.
Is this possible?
If not, can you please direct me in a way of tackling it?
We can’t quite be sure of the variability in the
aaaorbcdparts; presumably, each one could be almost anything.You should probably be looking for:
That makes up the unit you want to match.
With that, you can substitute what was found by the same followed by a newline, and then print the result. The only trick is avoiding superfluous newlines.
Example script:
Example output
Paraphrasing the question in a comment:
As I tried to explain at the top, you’re looking for what we can call ‘words’, meaning sequences of characters that are neither a colon nor a semicolon. In the regex, that is
[^:;]+, meaning one or more (+) of the negated character class — one or more non-colon, non-semicolon characters.Let’s pretend that spaces in a regex are not significant. We can space out the regex like this:
The slashes simply mark the ends, of course. The first cluster is a word; then there’s a colon. Then there is a group enclosed in parentheses, tagged with a
+at the end. That means that the contents of the group must occur at least once and may occur any number of times more than that. What’s inside the group? Well, a word followed by a semicolon. It doesn’t have to be the same word each time, but there does have to be a word there. If something can occur zero or more times, then you use a*in place of the+, of course.The key to the regex stopping is that the
aaa:in the middle of the first line does not consist of a word followed by a semicolon; it is a word followed by a colon. So, the regex has to stop before that because theaaa:doesn’t match the group. Thegsub()therefore finds the first sequence, and replaces that text with the same material and a newline (that’s the"&\n", of course). It (gsub()) then resumes its search directly after the end of the replacement material, and — lo and behold — there is a word followed by a colon and some words followed by semicolons, so there’s a second match to be replaced with its original material plus a newline.I think that
$0must contain the newline at the end of the line. Therefore, without thesub()to remove a trailing newlines, theprint(implictly of$0with a newline) generated a blank line I didn’t want in the output, so I removed the extraneous newline(s). The newline at the end of$0would not be matched by thegsub()because it is not followed by a colon or semicolon.