I am trying to make an executable that will take in any number of text files and give an output that is the distribution of words by number of occurrences. This is to be done in bash scripting, and what I have so far is:
#!/bin/bash
y=$(cat $* | wc -w)
cat $* | tr ' ' '//' | tr '[:upper:]' '[:lower:]' | tr -d '[:punct:]' |
grep -v '[^a-z]'| sort | uniq -c | sort -rn | head -$y
I get an error trying to set y and I can’t figure out how to get head to print out every word otherwise.
Is there a better way to print it out?
Why run
headat all? There’s no guarantee that there will be as many words as there are words in the files; indeed, it is practically guaranteed that there won’t be (since there’ll be some repeated words). And if you want to see all the data, then show all the data; don’t filter the output fromsort -nr.The first
tronly needs one slash, I think. Normally, you’d map blanks and punctuation to newlines (with a-soption totrto squeeze adjacent newlines to one). The slashes from the firsttrcount as punctuation in the thirdtr, so it isn’t obvious what you’re up to there. I think I’d expect to see something like:Note the use of
"$@"rather than$*; there’s no difference when the file names you specify don’t contain blanks (newlines, tabs, etc); when they do, the"$@"form is correct and$*is not, so you may as well always use"$@". It is correct far more often than$*is.For some C source code I had lying around, the output from the script was:
Note that the word ‘h’ appears as often as the word ‘include’; there’s a reason for that! The word
tappears a lot, but that’s because, for example,size_tis treated as two words by the filtering. Preserving underscores is possible; change the firsttrto use'[:alpha:]_'(note the underscore). You eliminated digits, but you can keep those too if you want.