(Not a duplicate of 4079956)
I have an SQL_ASCII database, LC_CTYPE=LC_COLLATION="C", which contains mostly ASCII data as well as some non-ASCII characters from some codepage, say LATIN1.
I want to transcode, in-place (no pg_dump/pg-restore), all non-ASCII codepoints from the LATIN1 codepage to UTF-8 then alter the database encoding to UTF-8, e.g.:
-- change encoding first, transcode data after
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8')
WHERE datname='sqlasciidb';
UPDATE tbl SET str=convert_from(str::bytea, 'LATIN1')
WHERE str::bytea<>convert_from(str::bytea, 'LATIN1')::bytea;
or
-- transcode data first, change encoding after
CREATE DOMAIN my_varlena AS bytea;
CREATE CAST (my_varlena AS text) WITHOUT FUNCTION;
UPDATE tbl SET str=convert(str::bytea, 'LATIN1','UTF8')::my_varlena::text
WHERE str::bytea<>convert(str::bytea, 'LATIN1', 'UTF8');
DROP DOMAIN my_varlena CASCADE;
UPDATE pg_database SET encoding=pg_char_to_encoding('UTF8')
WHERE datname='sqlasciidb';
What, if anything, is wrong with the above approach?
Some problems I can see:
- after
pg_databaseis updated, all connections to the database should be closed and reopened for the backend to take into account the new encoding - all indexes based on the altered columns should be rebuilt
Anything else?
Looks like you’ve got the main gist of it. I assume you’ve already tried this with a test database? I did give it a quick test when suggesting it to someone and it seemed to work ok for me, although this was far from a thorough test.
My gut feel is to transcode first and change encoding after, because while the database is still in SQL_ASCII, you aren’t going to have to deal with errors from postgresql trying to interpret not-yet-transcoded or improperly-transcoded data, and can look at data with relative impunity. OTOH changing the encoding first guarantees that only subsequently-connecting backends will write data in UTF8…
Also have a check for things like function bodies, view definitions, constraint definitions etc. that may need transcoding too? (you’d hope not, but…)