I have an awk script that processes a csv file and produces a report

Question

0

Asked: May 28, 20262026-05-28T19:21:18+00:00 2026-05-28T19:21:18+00:00

I have an awk script that processes a csv file and produces a report

0

I have an awk script that processes a csv file and produces a report that counts the number of rows for each column, named in the header field, that contain data /[A-Za-z0-9]/. What I would like to do is enhance the script and print the top 5 most duplicated data elements in each column.

Here is sample data:

Food|Type|Spicy
Broccoli|Vegetable|No
Lettuce|Vegetable|No
Spinach|Vegetable|No
Habanero|Vegetable|Yes
Swiss Cheese|Dairy|No
Milk|Dairy|No
Yogurt|Dairy|No
Orange Juice|Fruit|No
Papaya|Fruit|No
Watermelon|Fruit|No
Coconut|Fruit|No
Cheeseburger|Meat|No
Gorgonzola|Dairy|No
Salmon|Fish|
Apple|Fruit|No
Orange|Fruit|No
Bagel|Bread|No
Chicken|Meat|No
Chicken Wings|Meat|Yes
Pizza||No

This is the current script that SiegeX has substantially contributed:

$ cat matrix2.awk 
NR==1{
  for(i=1;i<=NF;i++)
    head[i]=$i
  next
}
{
  for(i=1;i<=NF;i++)
  {
    if($i && !arr[i,$i]++)
      n[i]++
    if(arr[i,$i] > 1)
      f[i]=1
  }
}
END{
  for(i=1;i<=length(head);i++) {
    printf("%-6d%s\n",n[i],head[i])
    if(f[i]) {
      for(x in arr) {
        split(x,b,SUBSEP)
        if(b[1]==i && b[2])
          printf("% -6d %s\n",arr[i,b[2]],b[2])
      }
    }
  }
}

This is the current output:

$ awk -F "|" -f matrix2.awk testlist.csv 
20    Food
6     Type
 6     Fruit
 4     Vegetable
 3     Meat
 1     Fish
 4     Dairy
 1     Bread
2     Spicy
 17    No
 2     Yes

And this is the desired output:

$ awk -F "|" -f matrix2.awk testlist.csv 
20    Food
6     Type
 6     Fruit
 4     Vegetable
 4     Dairy
 3     Meat
 1     Fish
2     Spicy
 17    No
 2     Yes

The only thing left that I would like to add is a general function that limits each columns output to the top 5 most duplicated fields. As mentioned below, a columnar version of sort |uniq -c |sort -nr |head -5.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-05-28T19:21:19+00:00

The following script is both extensible and scalable as it will work with an arbitrary number of columns. Nothing is hardcoded

awk -F'|' '
NR==1{
  for(i=1;i<=NF;i++)
    head[i]=$i
  next
}
{
  for(i=1;i<=NF;i++)
  {
    if($i && !arr[i,$i]++)
      n[i]++
    if(arr[i,$i] > 1)
      f[i]=1
  }
}
END{
  for(i=1;i<=length(head);i++) {
    printf("%-32s%d\n",head[i],n[i])
    if(f[i]) {
      for(x in arr) {
        split(x,b,SUBSEP)
        if(b[1]==i && b[2])
          printf("    %-28s%d\n",b[2],arr[i,b[2]])
      }
    }
  }
}' infile

Output

$ ./report
Food                            9
Type                            5
    Meat                        2
    Bread                       1
    Vegetable                   2
    Fruit                       2
    Fish                        1
Spicy                           2
    Yes                         2
    No                          6

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have an awk script that processes a csv file and produces a report

Leave an answerCancel reply

1 Answer

Output

Leave an answer
Cancel reply