The Impossibility of Mitigating AI Jailbreaks

NickySlicks1 pts0 comments

On the Impossibility of Mitigating AI Jailbreaks – AI RELIABILITY REVIEW

Prompt injections and jailbreaks are having a moment in AI media coverage at the moment …

McDonald’s AI solves python problems.

xAI, targeted presumably at children, helps to build a bomb.

xAI, targeted presumably at children, helps to build a bomb.

ChatGPT breaks copyright when sufficiently prompted.

… and for good reason: We observe a wide range of (often funny) failures: The McDonald’s customer support bot intended to assist with ordering food can be enticed to solve python puzzles1; xAI’s chatbots targeted at presumably young audiences can, under repeated prompting, provide instructions on how to build a pipe bomb2; And ChatGPT models can generate copyrighted characters when sufficiently described3.

1 Source: LinkedIn

2 Source: Instagram Reel

3 Source: Reddit

4 See Reinforcement Learning from Human Feedback by Nathan Lambert or Reinforcement Learning: An Overview Chapter 6 by Kevin Murphy for an introduction to these techniques.

These failures stem from a breakdown in the separation between developer-intended control instructions and user-provided input in LLM-based systems, and are referred to as jailbreaking, policy evasion, or prompt injections. The standard mitigation for these failures is alignment post-training, which trains models on curated examples via supervised fine-tuning and reinforcement learning from human feedback4 to follow intended instructions and adhere to safety policies. However, alignment only changes what the model is likely to do, not what it is able to do: it reshapes the distribution over possible outputs without imposing hard constraints on behavior.

This post develops that intuition and shows how it can be systematically exploited5. The argument then traces how jailbreaking combined with a lack of separation between control and data can produce systematic failures of system-level control.

5 Note that the following section is an intuitive version of our NeurIPS 2025 paper: Mission Impossible: A Statistical Perspective on Jailbreaking LLMs. For a more rigorous treatment of the argument, I refer the reader to the paper.

Alignment is never guaranteed

LLMs through a probabilistic lens

From a probabilistic perspective, large language models can be understood as defining very high-dimensional distributions over sequences. For the purposes of illustration, we begin with a simple low-dimensional example. Assume there exists a ground-truth distribution over two random variables: shape and color. A generative model can be trained to approximate this distribution by observing samples drawn from it. In the simplest case, we could imagine explicitly representing the full joint distribution, where each combination of shape and color is assigned a probability. Whenever we observe an object, we increase the likelihood assigned to that event in our joint distribution.

Learning a generative model \(p_{\text{model}}\), from samples stemming from a source distribution \(p_{\text{data}}\).

The challenge becomes apparent as we scale this setup: In our simple example, we have two variables (shape and color), each with three possible values—(red, blue, green) and (circle, triangle, square). This results in \(3^2 = 9\) possible outcomes, and thus nine probabilities to determine. With language, this grows substantially. Each position in a sequence of text is a random variable with vocabulary size on the order of tens of thousands. For a standard vocabulary of \(16,000\) tokens and context length of \(1,024\) tokens, the number of possible sequences is \(16{,}000^{1024} \approx 10^{4305}\). This number vastly exceeds the number of particles in the observable universe (approximately \(10^{80}\)). It also far exceeds the amount of available training data: estimates suggest that all text on the internet amounts to roughly \(10^{12}\) to \(10^{14}\) tokens.

While LLMs do not explicitly represent this joint distribution, they nonetheless induce a probability distribution over this space of sequences, and that we can exploit.

How does alignment change \(p_{\text{model}}\)?

To make this concrete, we return to our toy example. Let one variable (color) represent the request (e.g., “Tell me how to bake a cake”), and the other variable (shape) represent the response . Each point in the joint distribution corresponds to a (request, response) pair.

Some of these pairs are undesirable, for example, a harmful request paired with a compliant response, such as (“Tell me how to build bio weapons”, “Of course, you will need …”). In our illustration, we represent such cases as blue squares.

In practice, we do not have direct access to the model’s full joint distribution. Instead, alignment operates indirectly: we provide examples of desirable and undesirable behaviors and update the model to increase or decrease their likelihood. In particular, we penalize outputs corresponding to undesirable pairs (like blue squares),...

distribution from model alignment joint text

Related Articles