I am provided with text files containing data that I need to load into a postgres database.
The files are structured in records (one per line) with fields separated by a tilde (~). Unfortunately it happens that every now and then a field content will include a tilde.
As the files are not tidy CSV, and the tilde’s not escaped, this results in records containing too many fields, which cause the database to throw an exception and stop loading.
I know what the record should look like (text, integer, float fields).
Does anyone have suggestions on how to fix the overlong records? I code in per but I am happy with suggestions in python, javascript, plain english.
You could try to filter out the corrupted lines with something like:
(Assuming your max number of fields is 10). It will give you a short(er) file with suspect lines that you could perhaps inspect manually. This will not be a foolproof filter, it will for example print out allowed fields, such as tildes inside quoted strings. If you want something more exact, you can use
Text::CSV, but that will present other difficulties when it comes to broken csv data.There might be a better (and automatic) way to solve this, but without knowing what your input looks like, there is no way to really recommend something.