I have a CSV-file similar to this test.csv file: Header 1; Header 2; Header

Question

0

Asked: June 10, 20262026-06-10T18:22:46+00:00 2026-06-10T18:22:46+00:00

I have a CSV-file similar to this test.csv file: Header 1; Header 2; Header

0

I have a CSV-file similar to this test.csv file:

Header 1; Header 2; Header 3
A;B;US
C;D;US
E;F;US
G;H;FR
I;J;FR
K;L;FR
M;"String with ; semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";

Now, I want to split this file based on header 3. So I want to end up with four separate CSV files, one for “US”, “FR”, “UK”, and “”.

With my very limited Linux command line skills (sadly 🙁 I used until now this line:

awk -F\; 'NR>1{ fname="country_yearly_"$3".csv"; print >>(fname); close(fname);}' test.csv

Of course, the experienced command line users of you will notice my problem: One field in my test.csv contains rows in which the semicolon which I use as a separator is also present in fields that are marked with quotation marks (I can’t guarantee that for sure because of millions of rows, but I’m happy with an answer that assumes this). So sadly, I get an additional file named country_yearly_ semicolon”.csv, which contains this row in my example.

In my venture to solve this issue, I came across this question on SO. In particular, Thor’s answer seems to contain the solution of my problem by replacing all semicolons in strings. I adjusted his code accordingly as follows:

awk -F'"' -v OFS='' '
  NF > 1 { 
    for(i=2; i<=NF; i+=2) { 
      gsub(";", "|", $i);
      $i = FS $i FS;       # reinsert the quotes
    }
    print
  }' test.csv > test1.csv

Now, I get the following test1.csv file:

M;"String with | semicolon";UK
N;"String without semicolon";UK
O;"String OK";
P;"String OK";

As you can see, all rows that have quotation marks are shown and my problem line is fixed as well, but a) I actually want all rows, not only those in quotation marks and I can’t figure out which part in his code does limit the rows to ones with quotation marks and b) I think it would be more efficient if test.csv is just changed instead of sending the output to a new file, but I don’t know how to do that either.

EDIT in response to Birei’s answer:

Unfortunately, my minimal example was too simple. Here is an updated version:

Header 1; Header 2; Header 3; Header 4
A;B;US; 
C;D;US;
E;F;US;
G;H;FR;
I;J;FR;
K;L;FR;
M;"String with ; semicolon";UK;"Yet another ; string"
N;"String without semicolon";UK; "No problem here"
O;"String OK";;"Fine"
P;"String OK";;"Not ; fine"

Note that my real data has roughly 100 columns and millions of rows and the country column, ignoring semicolons in strings, is column 13. However, as far as I see it I can’t use the fact that it’s column 13 if I don’t get rid of the semicolons in strings first.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-10T18:22:48+00:00

To split the file, you might just do:

awk -v FS=";" '{ CSV_FILE = "country_yearly_" $NF ".csv" ; print > CSV_FILE }'

Which always take the last field to construct the file name.

In your example, only lines with quotation marks are printed due to the NF > 1 pattern. The following script will print all lines:

awk -F'"' -v OFS='' '
  NF > 1 { 
    for(i=2; i<=NF; i+=2) { 
      gsub(";", "|", $i);
      $i = FS $i FS;       # reinsert the quotes
    }
  }
  {
    # print all lines
    print
  }' test.csv > test1.csv

To do what you want, you could change the line in the script and reprocess it:

awk -F'"' -v OFS='' '
  # Save the original line
  { ORIGINAL_LINE = LINE = $0 }
  # Replace the semicolon inside quotes by a dummy character
  # and put the resulting line in the LINE variable
  NF > 1 {
    LINE = ""
    for(i=2; i<=NF; i+=2) { 
      gsub(";", "|", $i)
      LINE = LINE $(i-1) FS $i FS     # reinsert the quotes
    }
    # Add the end of the line after the last quote
    if ( $(i+1) ) { LINE = LINE $(i+1) }
  }
  {
    # Put the semicolon-separated fields in a table
    # (the semicolon inside quotes have been removed from LINE)
    split( LINE, TABLE, /;/ )
    # Build the file name -- TABLE[ 3 ] is the 3rd field
    CSV_FILE = "country_yearly_" TABLE[ 3 ] ".csv"
    # Save the line
    print ORIGINAL_LINE > CSV_FILE
  }'

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a CSV-file similar to this test.csv file: Header 1; Header 2; Header

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply