I have a web service that pulls text from an NCLOB column and returns

Question

0

Asked: June 17, 20262026-06-17T12:34:00+00:00 2026-06-17T12:34:00+00:00

I have a web service that pulls text from an NCLOB column and returns

0

I have a web service that pulls text from an NCLOB column and returns the data via XML. The NCLOB column is populated by extracting text from documents, so there are occasions where invalid XML characters are placed in the XML, causing the consuming system to fail.

As per the W3C, the range of valid characters is:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

We have tried a few different RegExp patterns, and we’re close, but we’re not completely there yet. Here is the closest we’ve come. All of the invalid characters are replaced except for the high surrogates (DB9B – DBFF).

REGEXP_REPLACE(
    TEXT,
    '[^[:print:]' || chr(13) || chr(10) || ']|[' || UNISTR('\FFFE-\FFFF') || ']',
    '*')

We have also tried this, but none of the surrogates (D800 – DFFE) are replaced.

REGEXP_REPLACE(REPLACE(TEXT, unistr('\0000'), ' '),
      '[' || unistr('\0001-\0008') || ']' 
  || '|[' || unistr('\000B-\000C') || ']'
  || '|[' || unistr('\000E-\001F') || ']'
  || '|[' || unistr('\D800-\DFFF') || ']' 
  || '|[' || unistr('\FFFE-\FFFF') || ']',' ')

How can we match the high surrogates? Any thoughts or guidance would be most appreciated.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-17T12:34:01+00:00

You could write your own function since regex_replace does not seem to work for the high surrogates. Here’s an example (tested on 9.2 and 11.2):

CREATE OR REPLACE FUNCTION replace_invalid(p_clob NCLOB) RETURN NCLOB IS
   l_result NCLOB;
   l_char   NVARCHAR2(1 char);
BEGIN
   FOR i IN 1 .. length(p_clob) LOOP
      l_char := substr(p_clob, i, 1);
      IF utl_raw.cast_to_binary_integer(utl_raw.cast_to_raw(l_char)) 
          BETWEEN to_number('DB9B', 'xxxx') AND to_number('DBFF', 'xxxx') THEN
         l_result := l_result || N'*';
      ELSE
         l_result := l_result || l_char;
      END IF;
   END LOOP;
   RETURN l_result;
END;

It should run with large NCLOB, here’s an example with a clob > 32k characters:

SQL> DECLARE
  2     l_in  NCLOB;
  3     l_out NCLOB;
  4  BEGIN
  5     FOR i IN 1 .. to_number('DBFF', 'xxxx') LOOP
  6        l_in := l_in || nchr(i);
  7     END LOOP;
  8     dbms_output.put_line('l_in length:' || length(l_in));
  9     l_out := replace_invalid(l_in);
 10     dbms_output.put_line('l_out length:' || length(l_out));
 11     dbms_output.put_line('chars replaced:' 
 12                        || (length(l_out) - length(REPLACE(l_out, '*', ''))));
 13  END;
 14  /

l_in length:56319
l_out length:56319
chars replaced:102

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I have a web service that pulls text from an NCLOB column and returns

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply