For legacy reasons, we have a VARCHAR2 column in our Oracle 10 database—where the character encoding is set to AL32UTF8—that contain some non-UTF-8 values. The values are always in one of these character sets:
- US-ASCII
- UTF-8
- CP1252
- Latin-1
I’ve written a Perl function to fix broken values outside the database. For a value from this database column, it loops through this list of encodings and tries to convert the value to UTF-8. If the conversion fails, it tries the next encoding. The first one to convert without error is the value we keep. Now, I would like to replicate this functionality inside the database so that anyone can use it.
However, all I can find for this is the CONVERT function, which never fails, but inserts a replacement character for characters it does not recognize. So there is no way, as far as I can tell, to know when the conversion failed.
Therefor, I have two questions:
- Is there some existing interface that tries to convert a string into one of list of encodings, returning the first that succeeds?
- And if not, is there some other interface that indicates failure if it’s not able to convert a string to an encoding? If so, then I could write the previous function.
UPDATE:
For reference, I have written this PostgreSQL function in PL/pgSQL that does exactly what I need:
CREATE OR REPLACE FUNCTION encoding_utf8(
bytea
) RETURNS TEXT LANGUAGE PLPGSQL STRICT IMMUTABLE AS $$
DECLARE
encoding TEXT;
BEGIN
FOREACH encoding IN ARRAY ARRAY[
'UTF8',
'WIN1252',
'LATIN1'
] LOOP
BEGIN
RETURN convert_from($1, encoding);
EXCEPTION WHEN character_not_in_repertoire OR untranslatable_character THEN
CONTINUE;
END;
END LOOP;
END;
$$;
I’d dearly love to know how to do the equivalent in Oracle.
Thanks to the key information about the illegal characters in UTF-8 from @collapsar, as well as some digging by a co-worker, I’ve come up with this:
Curiously, it never gets to WE8ISO8859P1: WE8MSWIN1252 converts every single one of the list of 800 or so bad values I have without complaint. The same is not true for my Perl or PostgreSQL implementations, where CP1252 fails for some values but ISO-8859-1 succeeds. Still, the values from Oracle seem adequate, and appear to be valid Unicode (tested by loading them into PostgreSQL), so I can’t complain. This will be good enough to sanitize my data, I think.