Given an input string, generate an output string where all invalid sequences are either removed or replaced with U+FFFD.
Is there a better method than implementing a state-machine char-by-char, or a non-native node.JS module available?
Invalid sequences are, for example, orphaned surrogates "\uD800", or other invalid multi-char sequences.
The regex needed to match invalid sequences depends on what you want to include. To replace orphaned surrogates with U+FFFD, you can use something like this:
If you use the XRegExp library with its Unicode addons, you can use the
\p{Cs}or\p{Surrogate}Unicode category instead of[\ud800-\udfff]. Using XRegExp will also give you easy access to other potentially relevant Unicode properties such as\p{Noncharacter_Code_Point},\p{Co}or\p{Private_Use}, and\p{Cn}or\p{Unassigned}.Since you’re using Node.js, you can install XRegExp via npm using
npm install xregexp. XRegExp’s npm module automatically includes the Unicode addons.