Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • SEARCH
  • Home
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 5961511
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 22, 20262026-05-22T18:58:29+00:00 2026-05-22T18:58:29+00:00

I am working on a rudimentary hand-coded lexical scanner and wish to support UTF-8

  • 0

I am working on a rudimentary hand-coded lexical scanner and wish to support UTF-8 input (it’s not 1970 anymore!). Input characters are read from stdin or a file one at a time and pushed into a buffer until whitespace is seen, etc. I thought about writing my own wrapper for fgetc() that would instead return char[] of bytes that make up the UTF-8 character and work with the result as a string… it’d be easy enough, but would become a slippery-slope. I’d rather not waste time re-inventing the wheel and instead use an existing, tested library like ICU. And so now I have a non-UTF-8 supporting code that works with fgetc(), isspace(), strcmp(), etc. which I am trying to update to use ICU. This is my first foray with ICU and have been reading through the documentation and trying to find usage examples with Google code search, but there are still some points of confusion I’m hoping someone will be able to clarify.

The u_fgetc() function returns UChar, and u_fgetcx() returns UChar32… the documentation recommends using u_fgetcx() to read codepoints, so that’s what I’m starting with. I’m keeping the same approach as above, but I’m pushing UChar32s into a buffer instead of chars.

  • What is the proper way to compare a character against a known value? Originally I was able to do if (c == '+') to check if the plus-sign was fetched from the input. GCC doesn’t complain when c is a UChar32 (which is then a comparison between UChar32 and char) but is this really proper?

  • I was able to use strcmp() to compare the buffered characters with a known value, for example if ((strcmp(buf, "else") == 0). There is u_strcmp() provided by ICU and I think I may need to use the U_STRING_DECL and U_STRING_INIT macros to specify the known literal, but I am not certain. The documentation shows they result in UChar[], though I assume I need UChar32[]… and I’m uncertain how to use them correctly anyway. Any guidance here would be welcomed.

  • After reading in a series of numeric characters I have been converting them with strtol() so I can work with them. Is there a similar function made available by ICU since I am converting UChar32[] now?

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-22T18:58:30+00:00Added an answer on May 22, 2026 at 6:58 pm

    UChar is for holding a Code Unit, while UChar32 is for holding a Code Point. If your input stays on the Basic Multilingual Plane (BMP), UChar is sufficient, and indeed most ICU functions operate on UChar[].

    Strongly recommended reading is the ICU User Guide, which explains most of the internals and best practices.

    • What is the proper way to compare a Unicode character variable against a known value?
      A character (or UChar or UChar32) is just another integer type with a certain width and signedness, and can be compared to other integer types with the usual caveats and restrictions. As for defining a character value, C99 (chapter 6.4.3) provides Universal character names notation: \u followed by four hex digits, or \U followed by eight hex digits, specifying the ISO/IEC 10646 “short identifier”. The area below 0x00a0 (with exceptions of 0x0024 '$', 0x0040 '@', and 0x0060 (backtick) is reserved (but can be represented by casting a simple character constant to UChar). Also reserved is the range from 0xd800 through 0xdfff (for use by UTF-16).

    • How to define Unicode string literals? U_STRING_DECL and U_STRING_INIT are indeed what you’re looking for. (As written above, ICU mainly operates on UChar[].) If you were using C++ instead of C, UNICODE_STRING_SIMPLE (optionally followed by getTerminatedBuffer() to yield UChar[] again) provides a much more comfortable way of defining Unicode string literals.

    • How to convert a Unicode string representing a numerical into that numerical’s value? unum_parse() and its brethren in unum.h will help you there.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

I'm trying to implement some rudimentary tabs in a Cocoa editor I'm working on.
I'm working on a real frankensite here not of my own design. There's a
I'm working to create a rudimentary file system in c++ and am having issues
I'm working on a few report output scripts that need to do some rudimentary
I'm working a rudimentary system for holding pages on a number of domains. I
I've been trying to get a sort of rudimentary filter working. Basically, you click
I wish my first post wasn't so newbie. I've been working with openframeworks, so
Working with dates in ruby and rails on windows, I'm having problems with pre-epoch
Working with a SqlCommand in C# I've created a query that contains a IN
Working on a project at the moment and we have to implement soft deletion

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.