I am trying to load text data into a postgresql database via COPY FROM. Data is definitely not clean CSV.
The input data isn’t always consistent: sometimes there are excess fields (separator is part of a field’s content) or there are nulls instead of 0’s in integer fields.
The result is that PostgreSQL throws an error and stops loading.
Currently I am trying to massage the data into consistency via perl.
Is there a better strategy?
Can PostgreSQL be asked to be as tolerant as mysql or sqlite in that respect?
Thanks
PostgreSQL’s
COPY FROMisn’t designed to handle bodgy data and is quite strict. There’s little support for tolerance of dodgy data.I thought there was little interest in adding any until I saw this proposed patch posted just a few days ago for possible inclusion in PostgreSQL 9.3. The patch has been resoundingly rejected, but shows that there’s some interest in the idea; read the thread.
It’s sometimes possible to
COPY FROMinto a stagingTEMPORARYtable that has alltextfields with no constraints. Then you can massage the data using SQL from there. That’ll only work if the SQL is at least well-formed and regular, though, and it doesn’t sound like yours is.If the data isn’t clean, you need to pre-process it with a script in a suitable scripting language.
Have that script:
INSERTrows;COPYrows in; orCOPY FROMPython’s
csvmodule can be handy for this. You can use any language you like; perl, python, php, Java, C, whatever.If you were enthusiastic you could write it in
PL/PerluorPL/Pythonu, inserting the data as you read it and clean it up. I wouldn’t bother.