I’m trying to figure out the syntax of both the sed command and perl script:
sed 's/^EOR:$//' INPUTFILE |
perl -00 -ne '/
TAGA01:\s+(.*?)\n
.*
TAGCC08:\s+(.*?)\n
# and so on
/xs && print "$1 $2\n"'
Why is there a circumflex ^ in the sed command? The third slash / will replace all instances of EOR: with a blank line, correct?
I understand some of the Perl script. Looking at perlrun, -00 will slurp the stream in paragraph mode and -n starts a while <> loop.
Why is there the first slash / next to the apostrophe? The command searches for TAGXXXX:, but I am not sure what \s+(.*?) does. Does that put whatever is after the tag into a variable? How about the .* in the between tag searches? What does /ns do? What do the $1 and $2 refer to in the print line?
This was tough to find online, and if someone could kick me in the right direction, I’d appreciate it.
The circumflex
^is regex for “start of line”, and$is regex for “end of line”; sosedwill only remove lines which contain exactly “EOR:” and nothing else.The Perl script is basically
perl -00 -ne '/(re)g(ex)/ && print "re ex\n"'with a big ole regex instead of the simple placeholder I put here. In particular, the/xmodifier allows you to split the regex over several lines. So the first/is the start of the regex and the final/is the end of the regex and the lines in between form the regex together.The
/smodifier changes how Perl interprets.in a regex; normally it will match any character except newline, but with this option, it includes newlines as well. This means that.*can match multiple lines.\smatches a single whitespace character;\s+matches as many whitespace characters as possible, but there has to be at least one.(.*?)matches an arbitrary length of string; the dot matches any character, the asterisk says zero or more of any character, and the question mark modifies the asterisk repetition operator to match as short a string as possible instead of as long a string as possible. The parentheses cause the skipped expression to be captured in a back reference; the backrefs are named$1,$2, etc, as many as there are backreferences; the numbers correspond to the order of the opening parenthesis (so if you apply(a(b))to the string “ab”,$1will be “ab” and$2will be “b”).Finally,
\nmatches a literal newline. So the(.*?)non-greedy match will match up to the first newline, i.e. the tail of the line on which the TAGsomething was found. (Iimagine these are gene sequences, not “tags”?)
It doesn’t really make sense to run
sedseparately; Perl would be quite capable of removing theEOR:lines before attempting to match the regex.