I am new to grep and awk, and I would like to create tab separated values in the “frequency.txt” file output (this script looks at a large corpus and then outputs each individual word and how many times it is used in the corpus – I modified it for the Khmer language). I’ve looked around ( grep a tab in UNIX ), but I can’t seem to find an example that makes sense to me for this bash script (I’m too much of a newbee).
I am using this bash script in cygwin:
#!/bin/bash
# Create a tally of all the words in the corpus.
#
echo Creating tally of word frequencies...
#
sed -e 's/[a-zA-Z]//g' -e 's// /g' -e 's/\t/ /g' \
-e 's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g' -e 's/[0-9]//g' \
-e 's/ /\n/g' -e 's/០//g' -e 's/១//g' -e 's/២//g' \
-e 's/៣//g' -e 's/៤//g' -e 's/៥//g' -e 's/៦//g' \
-e 's/៧//g' -e 's/៨//g' -e 's/៩//g' dictionary.txt | \
tr [:upper:] [:lower:] | \
sort | \
uniq -c | \
sort -rn > frequency.txt
grep -Fwf dictionary.txt frequency.txt | awk '{print $2 "," $1}'
Awk is printing with a comma, but that is only on-screen. How can I place a tab (a comma would work as well), between the frequency and the term?
Here’s a small part of the dictionary.txt file (Khmer does not use spaces, but in this corpus there is a non-breaking space between each word which is converted to a space using sed and regular expressions):
ព្រះវិញ្ញាណនឹងប្រពន្ធថ្មោងថ្មីពោលថា
អញ្ជើញមក ហើយអ្នកណាដែលឮក៏ថា
អញ្ជើញមកដែរ អ្នកណាដែលស្រេក
នោះមានតែមក ហើយអ្នកណាដែលចង់បាន
មានតែយកទឹកជីវិតនោះចុះ
ឥតចេញថ្លៃទេ។
Here is an example output of frequency.txt as it is now (frequency and then term):
25605 នឹង 25043 ជា 22004 បាន 20515 នោះ
I want the output frequency.txt to look like this (where TAB is an actual tab character):
25605TABនឹង 25043TABជា 22004TABបាន 20515TABនោះ
Thanks for your help!
You should be able to replace the whole lengthy
sedcommand with this:Comments:
's// /g'– the first two slashes mean re-use the previous match which was[a-z][A-Z]and replace them with spaces, but they were deleted so this is a no-op's/[«|»|:|;|.|,|(|)|-|?|។|”|“]//g'– the pipe characters don’t delimit alternatives inside square brackets, they are literal (and more than one is redundant), the equivalent would be's/[«»:;.,()-?។”“|]//g'(leaving one pipe in case you really want to delete them)'s/ /\n/g'– earlier, you replaced tabs with spaces, now you’re replacing the spaces with newlinesYou should be able to have the tabs you want by inserting this in your pipeline right after the
uniq:If you want the AWK command to output a tab: