Update 2020 FEB
The Boson: One that is hard to catch
We are after yet another WG21 milestone. C++20 is done.
Here is my succinct message. Regardless of OS, there is no legal way to output UTF-8, string or char. Lets try
// what are the formatting chars to be used bellow?
printf("%" WG21_DECISION_HERE ", u8"ひらがな"); // char8_t *
printf("%" WG21_DECISION_HERE ", u8'ひ'); // char8_t
// iostream not-a-solution is simple, it is forbiden in C++20
// to stream out char8_t or char8_t *
For “no decision yet”, WG21 is “blaming” WG14 … thus few more short years and the world will have formating char for UTF8, from C/C++ that is. I am also puzzled, why making
char8_t a keyword and then not implementing support for it?
Perhaps so far WG21 SG16 sees the issue like this:
printf("%" WG14_DECISION_HERE ", u8"ひらがな"); // char8_t *
printf("%" WG14_DECISION_HERE ", u8'ひ'); // char8_t
Confused? Reset assured, by 2023 this will be done and dusted.
Original Article — 2019 NOV 16
And there is this thing called JSON. And JSON mandates utf-8 Unicode. And there is this thing that came into existence after C++ was invented: WWW aka Internet. Yup. utf-8 too.
So is this utf-8 thing any important? Here is the data:
And C++? C++ is adorned with this official euphemism: ” ..The original C++ Standard (known as C++98 or C++03) is Unicode agnostic…”. Why? No one knows why. C++ is and was, firmly “no Unicode”: ASCII source, strings, names, etc.
Before we go any further: C++ was born on UNIX. LINUX these days. And over there in that distant galaxy, far far away from the dark force of Windows, ‘char’ rules. And yes, it is all utf-8.
Quiz question: Who invented UTF-8? The answer might be so obvious, you might be amazed at how obvious it is.
And then, the big bold “modern C++” committee (aka WG21 )has decided to make this right. For some committee logic reason, utf-8 was expressly ignored. Thus
char32_t were born first. And utf-8 and
char8_t were “discussed”.
u8 string literal was documented in a c++11 draft and the world has gone by for the next 9 years. C++20 cut-of-date has come and past by us, and we are nearing C++20 release and some of us C++ lemmings are happily playing with C++20 already available goodies.
Thus Unicode characters are part of C++11/14/17/20 standard. But now we are coming to the moment when the most important one (
char8_t) has to be implemented and make C++ in that process, fully
utf-8 capable. What could possibly go wrong?
The current C++20 (not)draft says that
char8_t it is a keyword. Fine.
utf-8 string literals are
u8 prefixed. Let’s try this right now, November 2019. It works. Probably because
unsigned char as an “underlying type.”
// this should not work?
printf("%s", u8"РАЧУНАРИ" );
Right now (2019 NOV) that C++20 line of code, works. But it would be clearer if it wouldn’t. How is that?
printf()is apparently the responsibility of WG14, aka “ISO C committee”. And
char8_t is (un)officially part of ISO C 2.X. Which will be ISO C20, or 21 or whatever? With a bit of luck, it will be “out” in the year 2021. Let me repeat this core devil detail: this is not official.
I am now wondering what are those “sensible systems” where that single line does not work. That should not work in C++20 mode with your compiler. But it works, and I do not know if this is “by accident”.
Yes, dear sharp-eyed reader: C++20 will come out before C2.x. And that will leave
char8_t in an undefined state, better described as an un-informed state. Right in the middle of C++20.
2019 November, C++20 (WG21 committee) is in a “un-informed” state about
char32_t are ok to be fully implemented, but
char8_t is not ok to be implemented. Any concrete info on that WG21 not-decision is hard to obtain. “It has to do something with legacy” is the educated answer.
There was a WG21 chain of events and chain of the decisions on ‘char8_t’ for the C++20 release. But, it is now not clearly documented and not explained in one place.
I have a solution — perhaps
There are (where?) distinguished members of WG21 , that claimed,
char8_t* is a proverbial “can of worms” because of our good old
char* and legacy code. But I just fail to see why simply not stopping
charbe convertible to each other?
// in the parallel C++20 universe
const char * char_literal = "ABCDEFGH";
const char8_t * new_literal = u8"ひらがな" ;
// no can do
char_literal = new_literal ;
// casting? no can do
char_literal = (char *)new_literal ;
// element to element? no can do
char_literal = new_literal
// comparisons? no can do
char_literal == new_literal
That might be a simple solution. And that will NOT be a dreaded “breaking change”. Such code does not exist or is very rare, in the present so it can not be broken.
char32_t you can do the above casting to
char *. I think that was clumsy to allow for that.
Has name, but does not fully exist
Yes, dear reader. Such a thing exists. C++20 type that is named but not fully implemented. Or is it: Somehow implemented? And that is
char8_t. The un-observed boson of C++20.
It is actually defined in N4849 but the definition (p.17) ends with: “A UTF-8 character literal containing multiple c-chars is ill-formed.”. That means exactly this:
// perfect you hit the jackpot, you are lucky today!
char8_t A8 = u8'A' ;
// you have made ill-formed char8_t,
// you have provoked the wrath of gods
// 'な' is 4 bytes long, you are ungrateful!
char8_t H8 = u8'な' ;
So, you better don’t use that
char8_t boson yet. Until C++23 perhaps.
I have done some tests with a single element from a u8 sequence. And that indeed does not work.
char8_t * to
printf("%s") does work (somehow), but
printf("%c") is an accident waiting to happen. So I will not use that one.
For the latest and shortest test Please see this Wandbox.To save you reading my code, here is the key issue.
If you want a single u8 glyph you need to c++ code it, as an u8 string
char8_t const * single_glyph = u8"ア";
To make things ever bit more convoluted for C++20 aficionados, to print u8 right now, the only sort-of-a sure way is:
// works with warnings and possibly by accident
// ... nobody knows
std::printf("%s", u8"ア" ) ;
“Use std::cout!”, I hear you shout. Unfortunately, that does not compile in C++20 mode of all the three (or is it two?) top compilers.
// does not compile under C++20
// error : overload resolution selected deleted operator '<<'
// see P1423, proposal 7
std::cout << u8"ア";
std::cout << char8_t ('ア');
And that is also documented. To start reading on this subject, these two papers are the required starting point:
In that order.
Un-break it, somehow?
char8_t unbroken, (before the C++20 imminent release ?) will be an interesting spectacle to watch.
The WG21 committee might even try and claim the GNU, LLVM, and MSFT are to implement it somehow. Without committing to the final and official decision that is promised for the year 2023 and C++23.
And WG21 can claim, it is the responsibility of WG14 to define,
char8_t first. Before it can be implemented anywhere, second.
Let us just hope compiler vendors will agree on a common
char8_t format specifier, and all the other little devil details, and do the resilient implementation. In time for the C++20 release?
But wait, there is more! There is also yet another un-informed devil detail. Standard library header <cuchar>. As of 2019 NOV this is the situation , with that. Currently and certainly in a very uninformed state.
What could possibly go wrong?