Sign Up

Sign Up to our social questions and Answers Engine to ask questions, answer people’s questions, and connect with other people.

Have an account? Sign In

Have an account? Sign In Now

Sign In

Login to our social questions & Answers Engine to ask questions answer people’s questions & connect with other people.

Sign Up Here

Forgot Password?

Don't have account, Sign Up Here

Forgot Password

Lost your password? Please enter your email address. You will receive a link and will create a new password via email.

Have an account? Sign In Now

You must login to ask a question.

Forgot Password?

Need An Account, Sign Up Here

Please briefly explain why you feel this question should be reported.

Please briefly explain why you feel this answer should be reported.

Please briefly explain why you feel this user should be reported.

Sign InSign Up

The Archive Base

The Archive Base Logo The Archive Base Logo

The Archive Base Navigation

  • Home
  • SEARCH
  • About Us
  • Blog
  • Contact Us
Search
Ask A Question

Mobile menu

Close
Ask a Question
  • Home
  • Add group
  • Groups page
  • Feed
  • User Profile
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Buy Points
  • Users
  • Help
  • Buy Theme
  • SEARCH
Home/ Questions/Q 6028429
In Process

The Archive Base Latest Questions

Editorial Team
  • 0
Editorial Team
Asked: May 23, 20262026-05-23T04:44:19+00:00 2026-05-23T04:44:19+00:00

The following may not qualify as a SO question; if it is out of

  • 0

The following may not qualify as a SO question; if it is out of bounds, please feel free to tell me to go away. The question here is basically, “Do I understand the C standard correctly and is this the right way to go about things?”

I would like to ask for clarification, confirmation and corrections on my understanding of character handling in C (and thus C++ and C++0x). First off, an important observation:

Portability and serialization are orthogonal concepts.

Portable things are things like C, unsigned int, wchar_t. Serializable things are things like uint32_t or UTF-8. “Portable” means that you can recompile the same source and get a working result on every supported platform, but the binary representation may be totally different (or not even exist, e.g. TCP-over-carrier pigeon). Serializable things on the other hand always have the same representation, e.g. the PNG file I can read on my Windows desktop, on my phone or on my toothbrush. Portable things are internal, serializable things deal with I/O. Portable things are typesafe, serializable things need type punning. </preamble>

When it comes to character handling in C, there are two groups of things related respectively to portability and serialization:

  • wchar_t, setlocale(), mbsrtowcs()/wcsrtombs(): The C standard says nothing about “encodings”; in fact, it is entirely agnostic to any text or encoding properties. It only says “your entry point is main(int, char**); you get a type wchar_t which can hold all your system’s characters; you get functions to read input char-sequences and make them into workable wstrings and vice versa.

  • iconv() and UTF-8,16,32: A function/library to transcode between well-defined, definite, fixed encodings. All encodings handled by iconv are universally understood and agreed upon, with one exception.

The bridge between the portable, encoding-agnostic world of C with its wchar_t portable character type and the deterministic outside world is iconv conversion between WCHAR-T and UTF.

So, should I always store my strings internally in an encoding-agnostic wstring, interface with the CRT via wcsrtombs(), and use iconv() for serialization? Conceptually:

                        my program
    <-- wcstombs ---  /==============\   --- iconv(UTF8, WCHAR_T) -->
CRT                   |   wchar_t[]  |                                <Disk>
    --- mbstowcs -->  \==============/   <-- iconv(WCHAR_T, UTF8) ---
                            |
                            +-- iconv(WCHAR_T, UCS-4) --+
                                                        |
       ... <--- (adv. Unicode malarkey) ----- libicu ---+

Practically, that means that I’d write two boiler-plate wrappers for my program entry point, e.g. for C++:

// Portable wmain()-wrapper
#include <clocale>
#include <cwchar>
#include <string>
#include <vector>

std::vector<std::wstring> parse(int argc, char * argv[]); // use mbsrtowcs etc

int wmain(const std::vector<std::wstring> args); // user starts here

#if defined(_WIN32) || defined(WIN32)
#include <windows.h>
extern "C" int main()
{
  setlocale(LC_CTYPE, "");
  int argc;
  wchar_t * const * const argv = CommandLineToArgvW(GetCommandLineW(), &argc);
  return wmain(std::vector<std::wstring>(argv, argv + argc));
}
#else
extern "C" int main(int argc, char * argv[])
{
  setlocale(LC_CTYPE, "");
  return wmain(parse(argc, argv));
}
#endif
// Serialization utilities

#include <iconv.h>

typedef std::basic_string<uint16_t> U16String;
typedef std::basic_string<uint32_t> U32String;

U16String toUTF16(std::wstring s);
U32String toUTF32(std::wstring s);

/* ... */

Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++, together with a well-defined I/O interface to UTF using iconv? (Note that issues like Unicode normalization or diacritic replacement are outside the scope; only after you decide that you actually want Unicode (as opposed to any other coding system you might fancy) is it time to deal with those specifics, e.g. using a dedicated library like libicu.)

Updates

Following many very nice comments I’d like to add a few observations:

  • If your application explicitly wants to deal with Unicode text, you should make the iconv-conversion part of the core and use uint32_t/char32_t-strings internally with UCS-4.

  • Windows: While using wide strings is generally fine, it appears that interaction with the console (any console, for that matter) is limited, as there does not appear to be support for any sensible multi-byte console encoding and mbstowcs is essentially useless (other than for trivial widening). Receiving wide-string arguments from, say, an Explorer-drop together with GetCommandLineW+CommandLineToArgvW works (perhaps there should be a separate wrapper for Windows).

  • File systems: File systems don’t seem to have any notion of encoding and simply take any null-terminated string as a file name. Most systems take byte strings, but Windows/NTFS takes 16-bit strings. You have to take care when discovering which files exist and when handling that data (e.g. char16_t sequences that do not constitute valid UTF16 (e.g. naked surrogates) are valid NTFS filenames). The Standard C fopen is not able to open all NTFS files, since there is no possible conversion that will map to all possible 16-bit strings. Use of the Windows-specific _wfopen may be required. As a corollary, there is in general no well defined notion of “how many characters” comprise a given file name, as there is no notion of “character” in the first place. Caveat emptor.

  • 1 1 Answer
  • 0 Views
  • 0 Followers
  • 0
Share
  • Facebook
  • Report

Leave an answer
Cancel reply

You must login to add an answer.

Forgot Password?

Need An Account, Sign Up Here

1 Answer

  • Voted
  • Oldest
  • Recent
  • Random
  1. Editorial Team
    Editorial Team
    2026-05-23T04:44:19+00:00Added an answer on May 23, 2026 at 4:44 am

    Is this the right way to write an idiomatic, portable, universal, encoding-agnostic program core using only pure standard C/C++

    No, and there is no way at all to fulfill all these properties, at least if you want your program to run on Windows. On Windows, you have to ignore the C and C++ standards almost everywhere and work exclusively with wchar_t (not necessarily internally, but at all interfaces to the system). For example, if you start with

    int main(int argc, char** argv)
    

    you have already lost Unicode support for command line arguments. You have to write

    int wmain(int argc, wchar_t** argv)
    

    instead, or use the GetCommandLineW function, none of which is specified in the C standard.

    More specifically,

    • any Unicode-capable program on Windows must actively ignore the C and C++ standard for things like command line arguments, file and console I/O, or file and directory manipulation. This is certainly not idiomatic. Use the Microsoft extensions or wrappers like Boost.Filesystem or Qt instead.
    • Portability is extremely hard to achieve, especially for Unicode support. You really have to be prepared that everything you think you know is possibly wrong. For example, you have to consider that the filenames you use to open files can be different from the filenames that are actually used, and that two seemingly different filenames may represent the same file. After you create two files a and b, you might end up with a single file c, or two files d and e, whose filenames are different from the file names you passed to the OS. Either you need an external wrapper library or lots of #ifdefs.
    • Encoding agnosticity usually just doesn’t work in practice, especially if you want to be portable. You have to know that wchar_t is a UTF-16 code unit on Windows and that char is often (bot not always) a UTF-8 code unit on Linux. Encoding-awareness is often the more desirable goal: make sure that you always know with which encoding you work, or use a wrapper library that abstracts them away.

    I think I have to conclude that it’s completely impossible to build a portable Unicode-capable application in C or C++ unless you are willing to use additional libraries and system-specific extensions, and to put lots of effort in it. Unfortunately, most applications already fail at comparatively simple tasks such as “writing Greek characters to the console” or “supporting any filename allowed by the system in a correct manner”, and such tasks are only the first tiny steps towards true Unicode support.

    • 0
    • Reply
    • Share
      Share
      • Share on Facebook
      • Share on Twitter
      • Share on LinkedIn
      • Share on WhatsApp
      • Report

Sidebar

Related Questions

The following statement is used because the method in question (refreshPartyList()) may not always
This may not be possible, but I thought I'd throw it out here: Given
The following method does not compile. Visual Studio warns An out parameter may not
I just realized that i may not be following best practices in regards to
How do I create a function for following code so that i may not
This may be a repeat of the following unanswered question: Help with bitmap lock
This may be a stupid question but I have a code with the following
Hello this is may first question and I have found so far the following
union members may not have destructors or constructors. So I can't template the following
The following code is supposed to take a string that may or may not

Explore

  • Home
  • Add group
  • Groups page
  • Communities
  • Questions
    • New Questions
    • Trending Questions
    • Must read Questions
    • Hot Questions
  • Polls
  • Tags
  • Badges
  • Users
  • Help
  • SEARCH

Footer

© 2021 The Archive Base. All Rights Reserved
With Love by The Archive Base

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.