I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column.
Here is sample data:
Food|Type|Spicy
Broccoli|Vegetable|No
Lettuce|Vegetable|No
Spinach|Vegetable|No
Habanero|Vegetable|Yes
Swiss Cheese|Dairy|No
Milk|Dairy|No
Yogurt|Dairy|No
Orange Juice|Fruit|No
Papaya|Fruit|No
Watermelon|Fruit|No
Coconut|Fruit|No
Cheeseburger|Meat|No
Gorgonzola|Dairy|No
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No
This is the current script that SiegeX has substantially contributed:
$ cat matrix2.awk
NR==1{
for(i=1;i<=NF;i++)
head[i]=$i
next
}
{
for(i=1;i<=NF;i++)
{
if($i && !arr[i,$i]++)
n[i]++
if(arr[i,$i] > 1)
f[i]=1
}
}
END{
for(i=1;i<=length(head);i++) {
printf("%-6d%s\n",n[i],head[i])
if(f[i]) {
for(x in arr) {
split(x,b,SUBSEP)
if(b[1]==i && b[2])
printf("% -6d %s\n",arr[i,b[2]],b[2])
}
}
}
}
This is the current output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
3 Meat
1 Fish
4 Dairy
1 Bread
2 Spicy
17 No
2 Yes
And this is the desired output:
$ awk -F "|" -f matrix2.awk testlist.csv
20 Food
6 Type
6 Fruit
4 Vegetable
4 Dairy
3 Meat
1 Fish
2 Spicy
17 No
2 Yes
The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.
The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded
Output