I have a web service that pulls text from an NCLOB column and returns the data via XML. The NCLOB column is populated by extracting text from documents, so there are occasions where invalid XML characters are placed in the XML, causing the consuming system to fail.
As per the W3C, the range of valid characters is:
#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
We have tried a few different RegExp patterns, and we’re close, but we’re not completely there yet. Here is the closest we’ve come. All of the invalid characters are replaced except for the high surrogates (DB9B – DBFF).
REGEXP_REPLACE(
TEXT,
'[^[:print:]' || chr(13) || chr(10) || ']|[' || UNISTR('\FFFE-\FFFF') || ']',
'*')
We have also tried this, but none of the surrogates (D800 – DFFE) are replaced.
REGEXP_REPLACE(REPLACE(TEXT, unistr('\0000'), ' '),
'[' || unistr('\0001-\0008') || ']'
|| '|[' || unistr('\000B-\000C') || ']'
|| '|[' || unistr('\000E-\001F') || ']'
|| '|[' || unistr('\D800-\DFFF') || ']'
|| '|[' || unistr('\FFFE-\FFFF') || ']',' ')
How can we match the high surrogates? Any thoughts or guidance would be most appreciated.
You could write your own function since
regex_replacedoes not seem to work for the high surrogates. Here’s an example (tested on 9.2 and 11.2):It should run with large NCLOB, here’s an example with a clob > 32k characters: