I want to parse UTF-8 in C++. When parsing a new character, I don’t

Question

0

Asked: June 15, 20262026-06-15T17:45:29+00:00 2026-06-15T17:45:29+00:00

I want to parse UTF-8 in C++. When parsing a new character, I don’t

0

I want to parse UTF-8 in C++. When parsing a new character, I don’t know in advance if it is an ASCII byte or the leader of a multibyte character, and also I don’t know if my input string is sufficiently long to contain the remaining characters.

For simplicity, I’d like to name the four next bytes a, b, c and d, and because I am in C++, I want to do it using references.

Is it valid to define those references at the beginning of a function as long as I don’t access them before I know that access is safe? Example:

void parse_utf8_character(const string s) {
    for (size_t i = 0; i < s.size();) {
        const char &a = s[i];
        const char &b = s[i + 1];
        const char &c = s[i + 2];
        const char &d = s[i + 3];

        if (is_ascii(a)) {
            i += 1;
            do_something_only_with(a);
        } else if (is_twobyte_leader(a)) {
            i += 2;
            if (is_safe_to_access_b()) {
                do_something_only_with(a, b);
            }
        }
        ...
     }
}

The above example shows what I want to do semantically. It doesn’t illustrate why I want to do this, but obviously real code will be more involved, so defining b,c,d only when I know that access is safe and I need them would be too verbose.

Report

Leave an answer
Cancel reply

You must login to add an answer.

Need An Account,

1 Answer

Editorial Team · Answer 1 · 2026-06-15T17:45:30+00:00

There are three takes on this:

Formally
well, who knows. I could find out for you by using quite some time on it, but then, so could you. Or any reader. And it’s not like that’s very practically useful.
EDIT: OK, looking it up, since you don’t seem happy about me mentioning the formal without looking it up for you. Formally you’re out of luck:
N3280 (C++11) §5.7/5 “If both the pointer operand and the result point to elements of the same array object, or one past
the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.”
Two situations where this can produce undesired behavior: (1) computing an address beyond the end of a segment, and (2) computing an address beyond an array that the compiler knows the size of, with debug checks enabled.
Technically
you’re probably OK as long as you avoid any lvalue-to-rvalue conversion, because if the references are implemented as pointers, then it’s as safe as pointers, and if the compiler chooses to implement them as aliases, well, that’s also ok.
Economically
relying needlessly on a subtlety wastes your time, and then also the time of others dealing with the code. So, not a good idea. Instead, declare the names when it’s guaranteed that what they refer to, exists.

Sign Up

Sign In

Forgot Password

The Archive Base Latest Questions

I want to parse UTF-8 in C++. When parsing a new character, I don’t

Leave an answerCancel reply

1 Answer

Leave an answer
Cancel reply