I have a sed(1) script doing many step-by-step transformations (substitutions) of a given
input stream that works well for the task itself. Now, what I need is to limit these
operatations to/inside “/” quoted multiline string only. The input stream is simple text
file containing multiline “/” quoted strings I need to perform my
sequence of s/// commands on. I know it’s quite hard to achieve that in sed(1) but
I still hope anybody knows how to. Script I have so far (but works correctly on single line basis) follows.
The sed(1) “tricks” are at the beginning and at the
end of the script, the rest is just sequence of s///s expressions and it is correct:
#! /bin/sed -f
# Convert /PinYin/ strings to /UTF-8 PinYin/ strings.
# Notice: /PinYin/ strings MUST NOT be multiline (to do).
/\/.*\// {
s/\//\
/g
:a
h
s/[^\n]*\n//
s/\n.*//
s/ang1/||aq||ng/g
s/ang2/||aw||ng/g
s/ang3/||ae||ng/g
s/ang4/||ar||ng/g
s/eng1/||eq||ng/g
s/eng2/||ew||ng/g
s/eng3/||ee||ng/g
s/eng4/||er||ng/g
s/ing1/||iq||ng/g
s/ing2/||iw||ng/g
s/ing3/||ie||ng/g
s/ing4/||ir||ng/g
s/ong1/||oq||ng/g
s/ong2/||ow||ng/g
s/ong3/||oe||ng/g
s/ong4/||or||ng/g
s/an1/||aq||n/g
s/an2/||aw||n/g
s/an3/||ae||n/g
s/an4/||ar||n/g
s/en1/||eq||n/g
s/en2/||ew||n/g
s/en3/||ee||n/g
s/en4/||er||n/g
s/in1/||iq||n/g
s/in2/||iw||n/g
s/in3/||ie||n/g
s/in4/||ir||n/g
s/un1/||uq||n/g
s/un2/||uw||n/g
s/un3/||ue||n/g
s/un4/||ur||n/g
s/ao1/||aq||o/g
s/ao2/||aw||o/g
s/ao3/||ae||o/g
s/ao4/||ar||o/g
s/ou1/||oq||u/g
s/ou2/||ow||u/g
s/ou3/||oe||u/g
s/ou4/||or||u/g
s/ai1/||aq||i/g
s/ai2/||aw||i/g
s/ai3/||ae||i/g
s/ai4/||ar||i/g
s/ei1/||eq||i/g
s/ei2/||ew||i/g
s/ei3/||ee||i/g
s/ei4/||er||i/g
s/a1/||aq||/g
s/a2/||aw||/g
s/a3/||ae||/g
s/a4/||ar||/g
s/a1/||aq||/g
s/a2/||aw||/g
s/a3/||ae||/g
s/a4/||ar||/g
s/er2/||ew||r/g
s/er3/||ee||r/g
s/er4/||er||r/g
s/lyue/l||u:||e/g
s/nyue/n||u:||e/g
s/e1/||eq||/g
s/e2/||ew||/g
s/e3/||ee||/g
s/e4/||er||/g
s/o1/||oq||/g
s/o2/||ow||/g
s/o3/||oe||/g
s/o4/||or||/g
s/i1/||iq||/g
s/i2/||iw||/g
s/i3/||ie||/g
s/i4/||ir||/g
s/nyu3/n||u:e||/g
s/lyu/l||u:||/g
s/u:1/||u:q||/g
s/u:2/||u:w||/g
s/u:3/||u:e||/g
s/u:4/||u:r||/g
s/u:0/||u:s||/g
s/u1/||uq||/g
s/u2/||uw||/g
s/u3/||ue||/g
s/u4/||ur||/g
s/||aq||/ā/g
s/||aw||/á/g
s/||ae||/ǎ/g
s/||ar||/à/g
s/||eq||/ē/g
s/||ew||/é/g
s/||ee||/ě/g
s/||er||/è/g
s/||iq||/ī/g
s/||iw||/í/g
s/||ie||/ǐ/g
s/||ir||/ì/g
s/||oq||/ō/g
s/||ow||/ó/g
s/||oe||/ǒ/g
s/||or||/ò/g
s/||uq||/ū/g
s/||uw||/ú/g
s/||ue||/ǔ/g
s/||ur||/ù/g
s/||u:q||/ǖ/g
s/||u:w||/ǘ/g
s/||u:e||/ǚ/g
s/||u:r||/ǜ/g
s/||u:s||/ü/g
G
s/\([^\n]*\)\n\([^\n]*\)\n[^\n]*\n/\2\/\1\//
/\n/ b a
}
Sample input:
Some text containing for instance Chinese greeting /ni3
hao3/ and perhaps some other Chinese sentence, say /ni2
kan4, .../
Expected output:
Some text containing for instance Chinese greeting /nǐ
hǎo/ and perhaps some other Chinese sentence, say /ní
kàn, .../
My knowledge of sed(1) is not as powerful to solve this problem on my own. Therefor I ask you for helping me with it. Thank you.
Finally it was quite easy to achieve with only a small improvement to the original
sed(1)code. Perhaps it could be done somehow better but while having conversion code working in “line scope” I managed to let it be (with minor improvements that are not important to the essence of this question) and rather read whole file in the pattern space, replace newlines with\001(^A) characters, let the original code do it’s work and in the end replace the^Acharacters back to newlines. Here it is:Sample input text:
Sample run:
It seems to work just fine (at least to suite my needs) and thus I consider this issue to be closed. Many thanks belongs to all people involved, especially Mr. Lev Levitsky!
P.S.: I also placed the code here (GitHub) where you can track some possible future changes.
P.S. 2: The
^Acharacters were lost while saving this answer. Now they are replaced with their ASCII representation here. You have to replace them to their binary representation (invi(1)press^V ^Ain insert mode) or use the GitHub version instead.P.S. 3: I still feel the
^A“hack” as quite ugly. In case anybody knows to avoid it in this case while still having the middle conversion code as simple as it is now, please share your ideas.