Background
Create a probability lexicon based on a CSV file of words and tallies. This is a prelude to a text segmentation problem, not a homework problem.
Problem
Given a CSV file with the following words and tallies:
aardvark,10
aardwolf,9
armadillo,9
platypus,5
zebra,1
Create a file with probabilities relative to the largest tally in the file:
aardvark,1
aardwolf,0.9
armadillo,0.9
platypus,0.5
zebra,0.1
Where, for example, aardvark,1 is calculated as aardvark,10/10 and platypus,0.5 is calculated as platypus,5/10.
Question
What is the most efficient way to implement a shell script to create the file of relative probabilities?
Constraints
- Neither the words nor the numbers are in any order.
- No major programming language (such as Perl, Ruby, Python, Java, C, Fortran, or Cobol).
- Standard Unix tools such as
awk,sed, orsortare welcome. - All probabilities must be relative to the highest probability in the file.
- The words are unique, the numbers are not.
- The tallies are natural numbers.
Thank you!
No need to read the file twice:
If you need the output sorted by word:
or
If you need the output sorted by probability: