I am using a Voice-to-Text application which gives transcription files as output.. The transcribed text contains a few tags like (s) (for sentence beginning)..(/s)( for sentence end ).. (VOCAL_NOISE)(for un-recognized words).. but the text also contains unwanted tags like (VOCAL_N) , (VOCAL_NOISED) , (VOCAL_SOUND), (UNKNOWN).. i am using SED to process the text.. but cannot write an appropriate regex to replace all other tags except (s), (/s) and (VOCAL_NOISE), with the tag ~NS.. would appreciate if someone could help me with it..
Example text:
(s) Hi Stacey , this is Stanley (/s) (s) I would (VOCAL_N) appreciate if you could call (UNKNOWN) and let him know I want an appointment (VOCAL_NOISE) with him (/s)
Output should be:
(s) Hi Stacey , this is Stanley (/s) (s) I would ~NS appreciate if you could call ~NS and let him know I want an appointment (VOCAL_NOISE) with him (/s)
This should take care of it:
Explanation:
s|([^)]*)|\n&\n|g– divide the line by putting every parenthesized string between two newliness@\n\((/\?s)\|(VOCAL_NOISE)\)\n@\1@g– remove the newlines around “(s)”, “(/s)” and“(VOCAL_NOISE)” (keepers)
s|\n\(([^)]*)\)\n|~NS|g– replace anything else between newlines that is within parentheses with “~NS”This works since newlines are guaranteed not to appear within a newly read line of text.
Edit: Shortened the command by using alternation
\(foo\|bar\)Previous version: