[Update 2023-01-07] Author of this humble post has decided not to use C or C++ to do any char or text or string processing, using those two languages. It is just too much trouble for no obvious gains. No drama. Might use GO. Language made by two gentlemen who invented UTF8.
[Update 2021-09-05] Here is the link to the why’s and hows of Unicode, with a focus on Windows code. At last, managed to find the time.
[Update 2021-03-18] Code in here is updated and there is a link to the Godbolt working version too.
(Note: this is the second part of C++ : codecvt deprecated. Panic? )
Update: This is not a foreign language translator or some such code. This is a standard C++ 17 utility to transform the core character sets between each other. The first 127 characters, that is. As such, it is remarkably useful and simple. For full-blown, locale-aware solutions please look elsewhere, starting from here. End of update.
Standard C++ std lib is one very complete and useful library. But there are times when you do realize you can build one or two very simple utilities on top of it.
Simple but sometimes surprisingly powerful. Like perhaps this one is.
A mechanism for transforming any standard sequence of chars (i.e. holding standard char types), into any of the four standard string types. Which are:
Type |
Definition |
std::string |
std::basic_string<char> |
std::wstring |
std::basic_string<wchar_t> |
std::u16string (C++11) |
std::basic_string<char16_t> |
std::u32string (C++11) |
std::basic_string<char32_t> |
First the reason you are here, The code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 |
// https://godbolt.org/z/bj6vzj #include <assert.h> #include <stdlib.h> #include <iostream> #include <string> #include <string_view> #include <array> #include <vector> namespace dbj { template< typename C > struct is_char : std::integral_constant<bool, std::is_same<C, char>::value || std::is_same<C, char8_t>::value || std::is_same<C, char16_t>::value || std::is_same<C, char32_t>::value || std::is_same<C, wchar_t>::value> {}; // inspired by https://nlitsme.github.io/2019/10/c-type_tests/ struct not_this_one {}; // Tag type for detecting which begin/ end are being selected // Import begin/ end from std here so they are considered // alongside the fallback (...) overloads in this namespace using std::begin; using std::end; not_this_one begin( ... ); not_this_one end( ... ); template <typename T> struct is_range { constexpr static const bool value = !std::is_same<decltype(begin(std::declval<T>())), not_this_one>::value && !std::is_same<decltype(end(std::declval<T>())), not_this_one>::value; }; template <typename T> constexpr inline bool is_range_v = dbj::is_range<T>::value ; namespace inner { // (c) 2018 - 2021 by dbj.org, // Disclaimer, Terms and Conditions: // https://dbj.org/dbj_license // template < typename return_type > struct meta_converter final { template<typename T> return_type operator () (T arg) { if constexpr (dbj::is_range_v<T>) { static_assert ( // arg must have this typedef dbj::is_char< typename T::value_type >{}(), "can not transform ranges not made of std char types" ); return { arg.begin(), arg.end() }; } else { using actual_type = std::remove_cv_t< std::remove_pointer_t<T> >; return this->operator()( std::basic_string<actual_type>{ arg } ); } } }; // meta_converter } // inner // all the types required / implicit instantiations using char_range_to_string = inner::meta_converter<std::string >; using wchar_range_to_string = inner::meta_converter<std::wstring >; using u16char_range_to_string = inner::meta_converter<std::u16string>; using u32char_range_to_string = inner::meta_converter<std::u32string>; } // dbj |
One struct
with one function call operator, does it all. The usage:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
#define TX(X_) std::cout << "\n" << #X_ << " : " << (X_) int main() { using namespace std::literals::string_literals; // make a converter to convert to // std:string dbj::char_range_to_string string_conv{}; // all of the bellow will do the conversion auto s1 = string_conv( R"#(\"the\0\0standard string literal\")#"s ); TX( s1.data()) ; // array to string auto s2 = string_conv( std::array<wchar_t , 6>{ L'A', L'r', L'r', L'a', L'y', L'!'} ); TX( s2.data()) ; // vector to string auto s3 = string_conv( std::vector<wchar_t>{ L'V', L'e', L'c', L't', L'o', L'r'} ); TX( s3.data()) ; } |
And so on. Any standard sequence made up of standard chars will do as a legal input. As long as it has begin()
and end()
methods, and the value_type
typedef. That is including native string literals too, as a legal input.
1 2 3 4 5 6 7 8 |
// native string literals conversions // to matching std basic string types { auto s1 = string_conv( "the native string literal"); auto s2 = string_conv( L"wide native string literal"); auto s3 = string_conv( u"u16char native string literal"); auto s4 = string_conv( U"u32char native string literal"); } |
char8_t
is best avoided. We could also serve stunt programmers to a certain extent, too:
1 2 3 4 5 6 7 8 9 |
// converter dbj::wchar_range_to_string to_wstring{}; // we can serve stunt men to // convert for example an reference to // array of chars into std::wstring auto ws_ = to_wstring( static_cast<const char(&)[]> ( "Abra Ca Dabra") ); |
Perhaps (one might remark) we could code this in a more “resilient” way. But why should we? Using (for example) non-standard strings as return type simply will not compile.
And after all, it is certainly wise to wait for C++20 constraints and concept’s to appear soon in a compiler near you. Applying that standard feature will certainly make for one resilient and more user-friendly version.
In case you would like to try this yourself but need some guidance, do mail us, please.