Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6548427
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 25, 20262026-05-25T11:58:14+00:00 2026-05-25T11:58:14+00:00

Lets say I have a string: char theString[] = 你们好āa; Given that my encoding

  • 0

Lets say I have a string:

char theString[] = "你们好āa";

Given that my encoding is utf-8, this string is 12 bytes long (the three hanzi characters are three bytes each, the latin character with the macron is two bytes, and the ‘a’ is one byte:

strlen(theString) == 12

How can I count the number of characters? How can i do the equivalent of subscripting so that:

theString[3] == "好"

How can I slice, and cat such strings?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-25T11:58:14+00:00Added an answer on May 25, 2026 at 11:58 am

    You only count the characters that have the top two bits are not set to 10 (i.e., everything less that 0x80 or greater than 0xbf).

    That’s because all the characters with the top two bits set to 10 are UTF-8 continuation bytes.

    See here for a description of the encoding and how strlen can work on a UTF-8 string.

    For slicing and dicing UTF-8 strings, you basically have to follow the same rules. Any byte starting with a 0 bit or a 11 sequence is the start of a UTF-8 code point, all others are continuation characters.

    Your best bet, if you don’t want to use a third-party library, is to simply provide functions along the lines of:

    utf8left (char *destbuff, char *srcbuff, size_t sz);
    utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
    utf8rest (char *destbuff, char *srcbuff, size_t pos;
    

    to get, respectively:

    • the left sz UTF-8 bytes of a string.
    • the sz UTF-8 bytes of a string, starting at pos.
    • the rest of the UTF-8 bytes of a string, starting at pos.

    This will be a decent building block to be able to manipulate the strings sufficiently for your purposes.


    However, you may need to tighten up your definition of what a character is, and hence how to calculate the size of a string.

    If you consider a character to be a Unicode code point, the information above is perfectly adequate.

    But you may prefer a different approach. The Annex 29 documentation detailing grapheme cluster boundaries has this snippet:

    It is important to recognize that what the user thinks of as a "character" – a basic unit of a writing system for a language – may not be just a single Unicode code point.

    One simple example is g̈, which can be thought of as a single character but consists of the two Unicode code points:

    • 0067 (g) LATIN SMALL LETTER G; and
    • 0308 (◌̈ ) COMBINING DIAERESIS.

    That would show up as two distinct Unicode characters were you to use the rule "any character not of the binary form 10xxxxxx is the start of a new character".

    Annex 29 also calls these grapheme clusters by a more user-friendly name, user-perceived characters. If it’s those you wish to count, that annex gives further details.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

Lets say I have a string COLIN. The numeric value of this string would
Lets say I have an array like this: string [] Filelist = ... I
lets say I have a string that I want to split based on several
Lets say that I'm trying to solve a parsing problem of string to char
So for starters lets say that I have a LinkedList<String> , I can easily
lets say i have a number string 1234567890 and out of that i want
Lets say I have a String: Go to this page: http://mysite.com/?page=1 , and I
Let's say you have a string like this: char* a=01234 Letting &a=0000, &(a+1)=0001, &(a+2)=0002,
How do you split a string? Lets say i have a string dog, cat,
Lets say I have the following code: abstract class Animal case class Dog(name:String) extends

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.