Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 8970639
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: June 15, 20262026-06-15T17:45:29+00:00 2026-06-15T17:45:29+00:00

I want to parse UTF-8 in C++. When parsing a new character, I don’t

  • 0

I want to parse UTF-8 in C++. When parsing a new character, I don’t know in advance if it is an ASCII byte or the leader of a multibyte character, and also I don’t know if my input string is sufficiently long to contain the remaining characters.

For simplicity, I’d like to name the four next bytes a, b, c and d, and because I am in C++, I want to do it using references.

Is it valid to define those references at the beginning of a function as long as I don’t access them before I know that access is safe? Example:

void parse_utf8_character(const string s) {
    for (size_t i = 0; i < s.size();) {
        const char &a = s[i];
        const char &b = s[i + 1];
        const char &c = s[i + 2];
        const char &d = s[i + 3];

        if (is_ascii(a)) {
            i += 1;
            do_something_only_with(a);
        } else if (is_twobyte_leader(a)) {
            i += 2;
            if (is_safe_to_access_b()) {
                do_something_only_with(a, b);
            }
        }
        ...
     }
}

The above example shows what I want to do semantically. It doesn’t illustrate why I want to do this, but obviously real code will be more involved, so defining b,c,d only when I know that access is safe and I need them would be too verbose.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-06-15T17:45:30+00:00Added an answer on June 15, 2026 at 5:45 pm

    There are three takes on this:

    • Formally
      well, who knows. I could find out for you by using quite some time on it, but then, so could you. Or any reader. And it’s not like that’s very practically useful.
      EDIT: OK, looking it up, since you don’t seem happy about me mentioning the formal without looking it up for you. Formally you’re out of luck:
      N3280 (C++11) §5.7/5 “If both the pointer operand and the result point to elements of the same array object, or one past
      the last element of the array object, the evaluation shall not produce an overflow; otherwise, the behavior is undefined.”

      Two situations where this can produce undesired behavior: (1) computing an address beyond the end of a segment, and (2) computing an address beyond an array that the compiler knows the size of, with debug checks enabled.

    • Technically
      you’re probably OK as long as you avoid any lvalue-to-rvalue conversion, because if the references are implemented as pointers, then it’s as safe as pointers, and if the compiler chooses to implement them as aliases, well, that’s also ok.

    • Economically
      relying needlessly on a subtlety wastes your time, and then also the time of others dealing with the code. So, not a good idea. Instead, declare the names when it’s guaranteed that what they refer to, exists.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I want to parse xml file in utf-8 and sort it by some field.
I have a UTF-16 encoded xmlstring that I want to parse with php. encoding
hi i am very new to Xml parsing i want to change following attribute
I'm trying to make a Bison parser to handle UTF-8 characters. I don't want
I've made a little forum and I want parse the date on newest posts
I want to parse something like this: Hi [{tagname:content}] [{tag1:xnkudfdhkfujhkdjki diidfo now nested tag
I want to parse a X.509 certificate in php. The certificate is in DER-encoded
I want to parse the xml file with dynamic content using DOM parser in
I want to parse this CSS Selector (and others of a similar form): div.class1#myid.class2[key=value]
I want to parse a xml file using xquery in Ruby, I found this

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.