OpenAI won't let you "escape" freely in JSON mode

OpenAI won't let you “escape” freely in JSON mode

Authors Weixuan Xiao

Published June 23, 2026

TL;DR

Accented characters like é may be escaped in JSON as \u00e9. We found that OpenAI’s and Azure OpenAI’s endpoints can’t emit these correctly in JSON mode: after the prefix \u00, the decoder allows only control-character completions (\u0000 - \u001f). So é cannot form. The output stays valid JSON but holds the wrong bytes — typically a NUL plus literal e9 (\u0000e9).

Once parsed, those control bytes could break production systems: for exmample, PostgreSQL rejects NUL in text, logs and indexes corrupt silently. This is not a JSON limitation — RFC 8259 does permit any \uXXXX escape. It is an undocumented endpoint constraint from OpenAI and Azure OpenAI.

This applies not only to the accented characters, but also to any UTF-8 characters that the models want to generate in the escaped sequence \uXXXX in JSON, including CJK, Hindi, Cyrillic, etc.

The constraint doesn’t limit what the model can express: JSON can accept raw UTF-8, so é or the other UTF-8 characters don’t necessarily need to be escaped. The trouble starts when prompt examples use \uXXXX escapes (the default in Python!): the model imitates them, attempts the escape, and hits the blocked path. Show raw characters in your examples and the failure never arises.

Introduction

Since the announcement of JSON mode at OpenAI DevDay in 2023, and the broader push toward Structured Outputs (schema-constrained generation) (OpenAI, 2024), many teams have increasingly relied on LLMs to act as trustable serializers: “it can just return valid JSON as requested and we can use the JSON directly in production.”

In practice, these features do a great job at enforcing outer structure —objects vs arrays, required keys, field types, enums, and “no extra fields.” But they do not automatically guarantee the semantic correctness of the content .

This article describes a failure mode we observed under OpenAI’s JSON mode, where the output can be syntactically valid JSON yet contain corrupted strings , with downstream consequences — such as database errors (especially for modern solutions using PostgreSQL as backend, such as Neon and Supabase), confusing system logs due to invisible and unexpected characters, frontend failure due to serialization, etc.

We first present what we observed in production and explain Structured Outputs and the JSON grammar. Then, we demonstrate our experiments and the results to reproduce the failures under JSON mode on OpenAI endpoints. Finally, we show our findings and conclude with mitigations and takeaways.

Observations

At Giskard, we have some Python tooling that uses a PostgreSQL database and OpenAI endpoints. The application:

sends requests to OpenAI endpoints and enables JSON mode

parses the JSON results

saves the parsed contents in the database.

In backend logs, we occasionally saw errors where PostgreSQL rejected an insert of a string field because it contained a null byte (\x00, which PostgreSQL does not accept in text fields):

invalid byte sequence for encoding "UTF8": 0x00 We found that these invalid strings were coming from LLM outputs and had mismatches in accented characters:

What we expected: French characters with accents in the words, e.g. é

What we actually got: some fragments like Cr\x00e9dit in Python backend

A simplified illustration in the Python string:

expected: "é" ( U+00E9 as unicode codepoint) got: "\\x00e9" ( a NUL byte followed by 2 ASCII letters "e9" ) Intuitively, it looks like the model meant to produce \u00e9 but “with extra zeros”, yielding \u0000e9. After the JSON parsing in Python, the \u0000 becomes a \x00 (NUL byte) and the e9 literally becomes 2 letters — e and 9 in Python string. The PostgreSQL database actually validates the UTF-8 string to be saved, which led to the failure due to the existing NUL.

On the Internet, we also find related reports — the similar issue also occurs for German letters, and is unresolved on the OpenAI community forum.

The JSON mode, that we enabled, is part of the implementation of Structured Outputs. It could guarantee that the generated contents follow the required JSON structure.

In JSON specification, RFC 8259 §7 (Strings) specifies that JSON strings can include Unicode characters (UTF‑8), or escape sequences as follows:

A Unicode escape sequence has the form: \u + exactly four hex digits, representing one UTF‑16 code unit .

ASCII control characters U+0000–U+001F cannot appear raw inside JSON strings, but they are allowed when escaped (e.g., \u0000, \u001a, \u001f).

Characters above U+FFFF can be represented as a surrogate pair (e.g., \uD83D\uDE00 for 😀).

Therefore, both é and \u00e9 could be used to represent é (U+00E9) in a JSON string.

Notice that the LLM output presented in the example above is also a valid JSON string: "\u0000e9". However, it does not represent é. Instead, it represents:

\u0000 (NUL)

followed by the literal...

OpenAI won't let you "escape" freely in JSON mode

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi