C++26: Cleaning up string literals | Sandor Dargo's Blog<br>Sandor Dargo's Blog<br>On C++, software development and books
HOME TAGS ARCHIVES BOOKS SPEAKING DAILY C++ WORKSHOPS HI... SUBSCRIBE
Blog 2026 06 10 C++26: Cleaning up string literals Post<br>Cancel
C++26: Cleaning up string literals<br>Sandor Dargo Jun 10 2026-06-10T00:00:00+02:00<br>6 min
The two papers we are covering today are complementary in a philosophical sense. They both improve how string literals are handled in C++26. P2361R6 tackles strings that are never evaluated — the ones that only exist at compile time. P1854R4 tackles evaluated string literals, making non-encodable characters ill-formed instead of implementation-defined. Let’s start with the unevaluated side.<br>To be precise, we are not only talking about C++26 here. The changes for evaluated contexts (P1854R4) are taken as a defect report with immediate effect starting from the first standardized version, C++98.
P2361R6: Unevaluated strings<br>Not all string literals in C++ end up in your compiled program. Some are used exclusively at compile time — the compiler reads them, acts on them, and then they’re gone. They never make it into the binary.<br>These strings appear in the following contexts:<br>_Pragma directives#line directives[[nodiscard]] and [[deprecated]] attributesextern linkage specificationsasm statementsstatic_assert messagesLiteral operator namesSince these strings are never evaluated at runtime, they don’t need to be converted to the execution encoding or any encoding specified by a prefix like L, u8, u, or U. Despite this, before C++26, the standard didn’t formally distinguish them from regular string literals. Compilers handled them inconsistently.<br>What changes<br>P2361R6 introduces the concept of an unevaluated string and establishes clear rules:<br>No encoding prefix is allowed. Writing static_assert(false, L"bad") or static_assert(false, u8"bad") becomes ill-formed.The string is not converted to the execution encoding. The compiler keeps the original sequence of characters for diagnostic purposes.Among escape sequences, only universal-character-names and simple-escape-sequences (except \0) are permitted. These are replaced by the corresponding Unicode codepoints. Numeric escape sequences (like \x1B or \077) and conditional escape sequences are ill-formed. Regular characters — including non-printable ones embedded directly in the source — are still allowed.You might consider the last point too restrictive, but it makes sense. Since the compiler doesn’t know which encoding to convert these strings to, numeric escape sequences have no meaningful interpretation. There’s no way for them to denote a valid code unit in an unknown encoding. Simple escape sequences like \n or \t are fine because they map to well-known Unicode codepoints.<br>Despite the changes, how diagnostic messages are presented does not change. That remains a quality-of-implementation concern. The proposal explicitly does not restrict which Unicode characters can appear in unevaluated strings — only which escape mechanisms can be used. So non-printable characters, control characters, and invisible characters can still appear directly in the source text or via simple escape sequences. The compiler just has to figure out how to display them.<br>Are these breaking changes?<br>Technically, yes. But a large survey of over 90 million lines of open source code found essentially no real-world code affected by these restrictions. The only uses of encoding prefixes on _Pragma or static_assert strings turned up in Clang’s own test suite.<br>Future directions<br>The paper explicitly notes that these changes don’t prevent supporting constant expressions in static_assert or attributes in the future. One could imagine a grammar where:<br>static_assert(some_condition, u8"explanation");
becomes valid again — not as an unevaluated string, but as a constant expression that happens to be a u8 string literal. That would be the subject of a separate proposal.<br>P1854R4: Making non-encodable string literals ill-formed<br>Now let’s turn to strings that are evaluated — the ones that end up in your binary. Before C++26, if you put a character into a string literal that couldn’t be represented in the literal’s associated encoding, the behavior was implementation-defined. MSVC would silently use ? as a substitute for unrepresentable characters. GCC would emit a diagnostic. Clang, which always uses UTF-8 for narrow literals, avoided the problem entirely.<br>This kind of implementation-defined behavior is dangerous. It can lead to silently incorrect programs — your string looks right in the source code, but the compiled binary contains something different.<br>P1854R4 makes this straightforward: if a character cannot be represented in the literal’s associated encoding, the program is ill-formed. No more silent substitution.<br>This paper is part of a broader effort — alongside P2362R3 and P2621R3 — to make lexing less surprising when non-Latin1 characters appear in source code.<br>What...