To prefix unique words with “UNIQUE:” inside a file I’ve tried to use a perl regex command like:
perl -e 'undef $/;while($_=<>){s/^(((?!\b\3\b).)*)\b(\w+)\b(((?!\b\3\b).)*)$/\1UNIQUE:\3\4/gs;print $_;}' demo
On a demo file containing:
watermelon banana
apple pear pineapple orange mango
strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
lemon
The output is:
watermelon banana
apple pear pineapple orange mango
strawberry cherry
kiwi pineapple lemon cranberry watermelon
orange plum cherry
kiwi banana plum
mango cranberry apple
UNIQUE:lemon
Unfortunately, the \3 backreference doesn’t seem to be handled if used in advance.
Is there another way to achieve this with another regex or with other usual commands available on a Linux box? (grep, sed, awk,…)
Many thanks
EDIT:
Unfortunately, many of the solutions works for the provided case only which was incomplete, my apologies for that, it should also work on a text like:
{watermelon || banana}
apple = ( pear pineapple orange mango )
strawberry cherry
kiwi = pineapple = lemon = cranberry = watermelon
orange - plum = cherry
kiwi = banana + plum
mango = cranberry && apple
lemon
If it simplifies the problem, words may be prefixed with something like $ or @.
I see you are already using Perl. When you want to count something using a hash is always a nice approach…
which will output:
{watermelon || banana} apple = ( UNIQUE:pear pineapple orange mango ) UNIQUE:strawberry cherry kiwi = pineapple = lemon = cranberry = watermelon orange - plum = cherry kiwi = banana + plum mango = cranberry && apple lemonUsing an regexp is probably going to be hard. You need to run through the entire file twice. One pass to count all occurrences of words and one pass to mark-up the unique words.
The above snippet read the input once, but keeps the entire original text in $str – obviously a bad idea if the input was large.