char8_t The un-observed particle of C++20.

Your Type not Found

JavaScript, GO, and all those youngsters are “Unicode based”. That is: source code is Unicode, strings are Unicode even variable names are Unicode. And it all just works. And. It is all in UTF-8 Unicode.

And there is this thing called JSON. And JSON mandates utf-8 Unicode.  And there is this thing that came into existence after C++ was invented: WWW aka Internet. Yup. utf-8 too.

History

And C++? C++ is adorned with this official euphemism: ” ..The original C++ Standard (known as C++98 or C++03) is Unicode agnostic…”. Why?  No one knows. C++ is and was, firmly “no Unicode”: ASCII source, strings, names, etc.

Before we go any further: C++ was born on UNIX. LINUX these days. And over there in that distant galaxy, far far away from the dark force of Windows, ‘char’ rules. And yes, it is all utf-8.

And then, the big bold “modern C++” world has decided to make this right.   But for some committee logic reason, utf-8 was expressly ignored. Thus  char16_t and char32_t were born first. And utf-8 and char8_t were “discussed”. u8 string literal was documented in a c++11 draft and the world has gone by for the next 9 years.  C++20 cut-of-date has come and past and we are nearing C++20 release and some of us C++ lemmings are happily playing with C++20 already available goodies. utf-8 char8_t including.

Thus Unicode characters are part of C++11/14/17/20 standard.  But now we are coming to the moment when the most important one (char8_t) has to be implemented and make C++ fully utf-8 capable. What could possibly go wrong?

The present

The current C++20 (not)draft says that char8_t it is a keyword.  Fine.  utf-8 string literals are u8 prefixed.  Let’s try this right now, November 2019.  It works. Probably because char8_t has unsigned char as an “underlying type.”

// why this works?
printf("%s", u8"РАЧУНАРИ" );

Right now (2019 NOV) that C++20 line of code, works. But it would be clearer if it wouldn’t. How?

printf()is apparently the responsibility of WG14, aka “ISO C committee”. And char8_t is (un)officially part of ISO C 2.X. Which will be ISO C20, or 21 or whatever? With a bit of luck it will be “out” in the year 2021.

I am now wondering what are those “sensible systems” where that single line does not work. That should not work in C++20 mode with your compiler.  But it works, and I do not know if this is “by accident”.

Un-informed state

Yes, dear sharp-eyed reader: C++20 will come out before C2.x. And that will leave printf() and char8_t in an undefined state, better described as un-informed state. Right in the middle of C++20.

2019 Novemebr, C++20 (WG21 committee) is in a “un-informed” state about char8_t.  Apparently char16_t and char32_t are ok to be fully implemented, but char8_t is not ok to be. Any concrete info is  hard to obtain. "It has to do something with legacy" is the educated answer.

There was WG21 chain of events and chain of the decisions on 'char8_t' for C++20 release.  But, it is now not clearly documented and not explained in one place.

There are (where ?)  distinguished members of WG21 , that claimed, char8_t* is  a proverbail "can of worms" because of our good old char* and legacy code. But I just fail to see why simply not stopping char8_t and charbe convertible to each other?

That might be a simple solution. Note: with char16_t and char32_t you can do the above casting to char *.  I think that was clumsy to allow for that.

Has name, but does not fully exist

Yes, dear reader. Such a thing exists. C++20 type that is named but not fully implemented. Or is it: Somehow implemented? And that is char8_t. The un-observed particle of C++20.

It is actually defined in N4835 but the definition ends with:  "A UTF-8 character literal containing multiple c-chars is ill-formed.".  That means exactly this:

So, you better don't use that char8_t thing.

I have done some tests with a single element from a u8 sequence. And that indeed does not work. char8_t * to printf(“%s”) does work (somehow), but char8_t to printf(“%c”) is an accident waiting to happen.

Please see -- https://wandbox.org/permlink/6NQtkKeZ9JUFw4Sd

To save you reading my code, here is the key issue.

If you want a single u8 glyph you need to c++ code it, as an u8 string

char8_t const * single_glyph = u8″ア”;

To make things ever bit more convoluted for C++20 aficionados, to print u8 right now, the only sort-of-a sure way is:

"Use std::cout!", I hear you shout. Unfortunately, that does not compile in C++20 mode of all the three (or is it two?) top compilers.


// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout << u8"ア";
std::cout << char8_t ('ア');

And that is also documented. To start reading on this subject, these two papers are the required starting point:

  1. http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm
  2. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1423r2.html

In that order.

Un-break it, somehow?

To make char8_t unbroken, (before the C++20 imminent release ?) will be an interesting spectacle to watch.

The WG21 committee might even try and claim the GNU, LLVM, and MSFT are to implement it somehow. Without committing to the final and official decision that is promised for the year 2023 and C++23.

And WG21 can claim, it is the responsibility of WG14 to define, char8_t first. Before it can be implemented anywhere, second.

Let us just hope compiler vendors will agree on a common char8_t` format specifier, and all the other little devil details, and do the resilient implementation. In time for the C++20 release?

But wait, there is more!  There is also yet another un-informed devil detail. Standard library header <cuchar>. As of 2019 NOV this is the situation , with that. Currently and certainly in an very uninformed state.

What can possibly go wrong?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.