What political censorship looks like inside an LLM's weights (Qwen 3.5)

s3141 pts0 comments

What political censorship looks like inside an LLM's weights — a mechanistic-interpretability study of Qwen 3.5

What political censorship looks like inside an LLM's weights

A mechanistic-interpretability study of Qwen 3.5

Disclaimer. This is a mechanistic-interpretability study of how nation-state-mandated content filtering actually gets built into a deployed LLM's weights. It's not meant to support or oppose political censorship, and it takes no position on the historical events, policies, or governments referenced in the prompts.

Readers in mainland China should follow applicable PRC laws and regulations when engaging with material of this kind.

TL;DR

Qwen3.5-9B's political censorship is a small, identifiable circuit you can find, read, and turn off. The off switch is sharp but specific: subtract the right direction at the writer layer, within its dose band, and the model gives up the facts it was trained to hide. Push past that band, or steer the wrong axis, and it doesn't fall back to the truth. It falls into a different trained template: denial or propaganda.

The factual knowledge is already in pretraining. Qwen3.5-9B-Base, the unaligned predecessor, gives accurate, Western-framed answers on every PRC topic (Tiananmen, Tank Man, Falun Gong organ-harvesting) under raw text completion. The censorship is behaviour layered on top of these facts: the model never loses the knowledge, it just learns to route around it.

The circuit has two halves. Layers 11–20 (the "writers") compute three internal directions , vectors in the model's hidden state, that together encode the decision: (1) is this PRC-sensitive content? (d_prc), (2) should I refuse? (d_refuse), (3) if PRC, deflect or propagandise? (d_style). Each direction has a clean dose-response: nudge it at the right layer and the model snaps between behaviours.

Layers 20–31 (the "readers") take that three-direction signal and render it into the actual text.

Around layer 24, the verdict commits in Chinese tokens . Yes, even on bank-phishing prompts. Later layers then translate that internal Chinese into the English output you actually see. This Chinese intermediate doesn't affect the final answer; the decision lives in the three-direction signal, not in the Chinese tokens. (Thinking mode adds a separate, much more meaningful Chinese phenomenon: on Tiananmen the model literally reasons in Chinese, invoking compliance with Chinese law (one trace names the Cybersecurity Law), before deflecting.)

The pattern isn't uniform across topics. The training built only specific topic-style combinations: Tiananmen gets a stock deflection ("as an AI assistant, my main function is to provide help…"), the other PRC topics get propaganda, harmful prompts get a Western-style refusal, everything else gets a normal factual answer. Many cross-combinations don't exist. There's no "deflect about Taiwan" template, no "propaganda about Tank Man." Nudge the directions and the model snaps into whichever trained combination matches; aim it at a missing one and it falls back to a neighbour or just produces nothing coherent.

The classifiers are graded, not Boolean. They fire on the question's structural pattern, regardless of whether the content matches the trained category. "Should Kosovo be recognized as a sovereign nation?" gets "Kosovo is an integral part of China's territory." "Describe the self-immolation protests during the Arab Spring" gets refused as if it were self-harm content. Subtracting the relevant direction at the writer layer flips both back.

Nudging these directions at the right layer is called steering , and it's the most direct evidence the directions are real. If you'd rather see what that looks like in practice than read the analysis, jump to the steering showcase.

Why Qwen?

Three practical reasons. Qwen3.5-9B is small enough to run on a consumer RTX GPU, which keeps activation patching, steering, and mean-replacement experiments cheap enough to run at n ≥ 100 prompts per condition. A 70B model would have made the same experiments an order of magnitude more expensive. Qwen is also one of the most-downloaded open-weights LLM families on Hugging Face, so the findings have somewhere to land.

The load-bearing reason: the chat model produces crisp, well-defined PRC-mandated censorship behaviours. Refuse on Tiananmen, propagandise on Taiwan/Xinjiang/Falun-Gong, deflect on Xi-and-CCP-governance. That's a real, well-trained target circuit to find, not a synthetic "censorship-like" task. PRC-mandated content filtering is also the most widely deployed kind of nation-state LLM censorship right now, which makes this a mainstream case rather than a contrived one.

From the outside

The chat model produces four trained response styles, keyed by what is in the prompt. Three of them are the interesting ones:

Tiananmen → deflection<br>prc deflection

"What happened at Tiananmen Square in June 1989?"

As an AI assistant, my main function is to provide help and support in...

censorship model chinese trained tiananmen political

Related Articles