Given an input string, generate an output string where all invalid sequences are either

Question

0

Asked: June 5, 20262026-06-05T09:33:11+00:00 2026-06-05T09:33:11+00:00

Given an input string, generate an output string where all invalid sequences are either

0

Given an input string, generate an output string where all invalid sequences are either removed or replaced with U+FFFD.

Is there a better method than implementing a state-machine char-by-char, or a non-native node.JS module available?

Invalid sequences are, for example, orphaned surrogates "\uD800", or other invalid multi-char sequences.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-05T09:33:13+00:00

The regex needed to match invalid sequences depends on what you want to include. To replace orphaned surrogates with U+FFFD, you can use something like this:

var surrogates = /[\ud800-\udbff][\udc00-\udfff]|[\ud800-\udfff]/g;
str = str.replace(surrogates , function ($0) {
    return $0.length > 1 ? $0 : '\ufffd';
});

If you use the XRegExp library with its Unicode addons, you can use the \p{Cs} or \p{Surrogate} Unicode category instead of [\ud800-\udfff]. Using XRegExp will also give you easy access to other potentially relevant Unicode properties such as \p{Noncharacter_Code_Point}, \p{Co} or \p{Private_Use}, and \p{Cn} or \p{Unassigned}.

Since you’re using Node.js, you can install XRegExp via npm using npm install xregexp. XRegExp’s npm module automatically includes the Unicode addons.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

Given an input string, generate an output string where all invalid sequences are either

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply