C++ char8_t, the Boson

What could possibly go wrong(tm)

Update 2023 Apr

C23 is helping.  C23 UTF-8 string literal is char8_t[N] , where N is the size of the string in UTF-8 code units including the null terminator. Each char8_t element in the array is initialized from the next multibyte character in s-char-sequence using UTF-8 encoding.

No; stdio stil does not do UTF-8.

For the details see here.

Update 2022 Sep (published 5 Apr 2020)

In case you might be wondering why don’t we “just” use char32_t and get over it once and for all, please see this GODBOLT.

It shows C and/or C++ console output simply does not deal with char32_t and char16_t. When on Linux you use char * which is codead as UTF8 and when on windows you use wchar_t which is coded as UTF16. Thus you have to transform your char32_t / char16_t strings to them frist, depending on which platform you happen to be, then you can use the console subsystem.

Update 2020 Apr

How it all started

UTF-8, as we know it today very likely, started with a (now defunct) OS called “Plan 9“. All is behind that link. UTF-8 is one of the offspring technologies of that OS.  Here is the fascinating PDF from 1992 January: UTF-8-Plan9-paper.

When I say “fascinating” I do mean exactly that. Be sure to read that PDF. Looking from where are we now and in the context of WG21 C++ committee (for me at least), the most fascinating thing emanating from that pdf is the easiness and effectiveness of decision-making, when there is no committee. Decision making made possible without the presence of 200+ people around, sitting on that ISO committee.

Oh yes ISO is also in there in that pdf, just read it. The document is short and sweet.

Recap and code

In order not to be accused of “having an agenda”, I shall recap the situation as of today (5 Apr 2020) and produce some kind of usable solution.

Recap

  1. char8_t can not hold a full range of utf-8 glyphs.
  2. utf-8 glyphs can be up to 4 bytes. C++ char8_t holds up to 1 byte.
  3. char8_t stream out is expressly forbidden starting from C++20
  4. for utf-8 compatibility while in VStudio use char and the /utf-8 switch
  5. in case you really want C++20 (and  after) way of transforming to/from char8_t, you need to use <cuchar> … Alas, not yet fully implemented in any of the 3, as required by the standard

Is there a way around it?

In case of using Visual Studio (fully up to date please) PLEASE MAKE SURE you use the /utf-8 compiler switch. That means your source will be utf-8 and the “executable character set” will be utf-8 too. If on Linux do nothing, the code below will “just work”.

What point 3 above actually means is, that one can print the utf8 by casting to the char * and one can iterate over utf8 literals too. AFAIK this is a hack. For example:

printf("%c", elem); (inside clang, gcc and perhaps cl.exe) as used above “knows” utf-8 is the source and executable character set, thus the expected appears, glyph by glyph:

ひらがな

If on Windows be sure the console font selected can show those glyphs! Also, make sure chcp is set to 65001. Which on win10 by now is the cmd default.

The above begs the question: if that works what is the point of having controversy arround char8_t in the first place?

ps: I, serve all kinds of customers. So if you like pain, the entrance to “the room of pain” is here: Null-terminated multibyte strings.

Update 2020 Feb

The Boson: One that is hard to catch

We are after yet another WG21 milestone.  C++20 is done.

Here is my succinct message. Regardless of OS, there is no legal way to output UTF-8, string, or char. Let us try printf.

For “no decision yet”, WG21 is “blaming” WG14 … ditto a few more short years and the world will have formatting char for UTF8, from C/C++ that is. I am also puzzled, why make char8_t a keyword and then not implement support for it?

Perhaps so far WG21 SG16 sees the issue like this:

Confused? Rest assured, by 2023 this will be done and dusted.

Original Article — 2019 NOV 16

JavaScript, GO, and all those youngsters are “Unicode based”. That is: the source code is Unicode, strings are Unicode even variable names are Unicode. And it all just works. And. It is all in UTF-8 Unicode.

And there is this thing called JSON. And JSON mandates utf-8 Unicode.  And there is this thing that came into existence after C++ was invented: WWW aka Internet. Yup. utf-8 too.

So is this utf-8 thing any important? Here is the data. Six years after this graph and utf-8 are everywhere.

utf8 growth
utf8 growth chart

History

And what C++ ISO committee did in the meantime? C++ is adorned with this official euphemism: ” ..The original C++ Standard (known as C++98 or C++03) is Unicode agnostic…”. Why?  No one knows why. C++ is and was, firmly “no Unicode”: ASCII source, strings, names, etc.

Before we go any further: C++ was born on UNIX. LINUX these days. And over there in that distant galaxy, far far away from the dark force of Windows, ‘char’ ruleth since the dawn of time. And yes, it is all utf-8 encoded.

Quiz question: Who invented UTF-8? The answer might be so obvious, you might be amazed at how obvious it is. Look it up.

And then, the big bold “modern C++” committee (aka WG21 ) decided to “make this right”.   For some committee logic reason, utf-8 was expressly ignored. Thus  char16_t and char32_t were born first. And utf-8 and char8_t were “discussed”. u8 string literal was documented in a c++11 draft and the world has gone by for the next 9 years.  C++20 cut-of-date has come and passed by us, and we are nearing C++20 release and some of us C++ lemmings, are happily playing with C++20 already available goodies.  char8_t including.

Thus Unicode characters are part of C++11/14/17/20 and beyond standards.  But now we are coming to the moment when the most important one (char8_t) has to be implemented and make C++ in that process, fully utf-8 capable.

What could possibly go wrong?TM

The current C++20 (not)draft says that char8_t , it is a keyword.  Fine.  utf-8 string and char literals are u8 prefixed.  Let’s try this right now, November 2019.  It works. Probably because char8_t has unsigned char as an “underlying type.” Although we are told it is strictly a “distinct type”.

Right now (2019 NOV) that C++20 line of code, works with some warnings. But it would be clearer if it wouldn’t at all. How is that?

printf()is apparently the responsibility of WG14, aka “ISO C committee”. And char8_t is (un)officially part of ISO C 2.X. Which will be ISO C20, or 21 or whatever? With a bit of luck, it will be “out” in the year 2021.

Let me repeat this little devil detail: none of this is official.

I am now wondering what are those “sensible systems” where that single line does not work. That should not work in C++20 mode with your compiler.  But it works, and I do know this is not “by accident”.

Un-informed state

The only particle that can escape the black hole is information.

Yes, dear sharp-eyed reader: C++20 will come out before ISO C2.x. And that will leave printf() and char8_t in an undefined state until then. I think, better described as an un-informed state. Right in the middle of C++20.

2019 November, C++20 (WG21 committee) is in a “un-informed” state about char8_t.  Apparently char16_t and char32_t are ok to be fully implemented, but char8_t is not ok to be implemented. Any concrete info on that WG21 not-decision is hard to obtain. “It has to do something with legacy” is the educated answer.

There was a WG21 chain of events and a chain of decisions on ‘char8_t’ for the C++20 release.  But, it is now not clearly documented and not explained in one place.

I have a solution — perhaps

There are (where, who?)  distinguished members of WG21 , that claimed, char8_t* is a proverbial “can of worms” because of our good old char* and legacy code. But I just fail to see why simply not stopping char8_t and charbe convertible to each other?

That might be a simple solution. And that will NOT be a dreaded “breaking change”. That code does not exist (or is very rare), in the present so it can not be broken.

Note: with char16_t and char32_t you can do the above casting to char *.  I think that was clumsy to allow for that.

<cuchar> routines should be the only way to transform between different chars

But they are not. That would be a breaking change, by some. ABI breaking?

Has a name, but does not fully exist

Such a thing exists in the spec. C++20 type that is named but not fully implemented. Or is it: Somehow implemented? And that is char8_t. The unobserved boson of ISO C++ multiverse.

It is actually defined in N4849 but the definition (p.17) ends with:  “A UTF-8 character literal containing multiple c-chars is ill-formed.”.  That means exactly this:

So, you better don’t use that char8_t  boson yet. Until C++23. Perhaps.

I have done some tests with a single element from an u8 sequence. And that indeed does not work. char8_t * to printf("%s") does work (by illegally casting ti char *), but char8_t to printf("%c") is an accident waiting to happen. So I will not use that one.

For the latest and shortest test Please see this Wandbox.To save you from reading my code, here is the key issue.

If you want a single u8 glyph you need to code it, as an u8 pointer
char8_t const * single_glyph = u8"ア";

To make things every bit more convoluted for C++20 aficionados, to print u8 right now, the only sort-of-a sure way is:

“Use std::cout!”, I hear you shout. Unfortunately, that does not compile in C++20 mode of all three (or is it two?) top compilers.


// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout << u8"ア";
std::cout << char8_t ('ア');

And that is also documented. To start reading on this subject, these two papers are the required starting point:

  1. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm
  2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

In that order.

Un-break it, somehow?

To make char8_t unbroken, (before the C++23 imminent release ?) will be an interesting spectacle to watch.

The WG21 committee might even try and claim the GNU, LLVM, and MSFT are to implement it somehow. Perhaps by committing to the final and official decision that is promised for the year 2023 and C++23.

And WG21 can claim, it is the responsibility of WG14 to define, char8_t first. Before it can be implemented anywhere, second.

Let us just hope compiler vendors will agree on a common char8_t format specifier, and all the other little devil details, and do the resilient implementation. In time for the C++20 release? (Hint: no they did not)

But wait, there is more!  There is also yet another un-informed little devil detail. Standard library header <cuchar>. As of 2019 NOV, this is the situation, with that. Currently and certainly in a very uninformed state.

 

What could possibly go wrong?TM