Exploring the internal representations of Pangram 3.3.2

01Introduction Since ChatGPT’s debut in 2022, AI-assisted writing has expanded at a staggering pace. Because AI-generated text now appears across so much of what we read, it has become obvious that some forms of writing lose their value when produced by a machine. In academia, essays are meant to cultivate student reasoning. In the marketplace, product reviews are valuable because they reflect the experiences of other people. Pangram is a research company that builds state-of-the-art AI detection models for this problem. Our flagship product is an AI text detection model with industry-leading low false positive rates, multilingual capabilities, and differentiation between AI-generation and AI-assistance. Since the launch of our first whitepaper in 2024, we’ve had a unique seat to watch wave after wave of AI advancements. Our researchers have wrestled with overly strict content filters, seen our fair share of mode collapse11Our researchers particularly recommend this article by Gwern on mode collapse in language models., and dodged waves of em-dashes and the word “delve”. Our flagship model is an LLM that is fine-tuned to this sequence classification task. We do not use custom metrics like perplexity or burstiness. We do not do any manual feature extraction. We do have a customer-facing product called AI Phrases, where we provide information to our users about phrases that appear more frequently in AI text. But these are not directly used as features for the model. After a while, one gets curious. What does the model see? For us as researchers, this question matters. We are highly incentivized to prevent shortcutting, fix unintended model behavior, and understand this problem deeply. In this post, we will outline our initial interpretability efforts using document-level analysis.

02Data We built an interpretability dataset from in-domain held-out samples from our production training set. The interactive explorer on this page uses a balanced 5,000-document subset, evenly split between human and AI, across 20 even-numbered layers. The AI samples span the model variants below across the six model families used for the classifier probe. Models Claude 3.7 Sonnet Claude Sonnet 4 Claude Sonnet 4.5 Claude Opus 4 Claude Opus 4.1 Claude Opus 4.5 GPT-3.5 Turbo (Nov ’23) GPT-3.5 Turbo (Jan ’24) GPT-4 (Mar ’23) GPT-4 (Jun ’23) GPT-4o GPT-5 GPT-5.1 GPT-5.2 o1 Gemini 2.0 Flash Gemini 2.5 Flash Gemini 2.5 Pro Gemini 3 Pro DeepSeek R1 DeepSeek V3 Qwen 2.5 7B Qwen 2.5 72B Qwen 3 235B Llama 3.1 8B Llama 3.1 70B

Source domains News Scientific Abstracts Product Reviews Business Reviews Reddit Creative Writing Reddit ELI5 Books (Self-Published) Books (Project Gutenberg) Wikipedia (English) Wikipedia (Multilingual) Lang-8 (ESL)

03Pangram 3.3.2 Overview Pangram 3.3.2 is an AI detection model released by Pangram Labs in 2026. It uses the same underlying model as Pangram 3.3, with later bug fixes that improve performance. Pangram 3.3 succeeded Pangram 3.2 and improved recall on newer LLM outputs, humanized text, and long-form AI-generated content, while reducing false positives on non-native English writing. Model cardRead the Pangram 3.3 model cardSee the release details behind Pangram 3.3.2.Read articleInterpretability work is ongoing. Throughout this article, we also apply our methods retroactively to Pangram 3.2 and Pangram 3.1.

04Methods Activations The EditLens architecture is a bucket-based classification system that collapses into a single ai_assistance_score. For this project, we discard the model’s final readout, and instead focus on the internal representations the model learns. To probe these, we collect activations by completing a forward pass of a model with a given input document, and saving the model’s hidden representation at multiple internal layers. For this project, we extracted activations for every document, for every even layer throughout the network. Dimensionality Reduction Each extracted activation vector was 5,120-dimensional. To gain a better understanding of the representations, we employ a number of dimensionality reduction techniques. PCA Principal Component Analysis (PCA) is the simplest linear projection: it finds directions of maximum variance in the activation space. In this project, we find that towards the end of the network, most variance is contained within principal components 1 and 2, and as such we plot them against each other. UMAP UMAP gives a nonlinear view designed to preserve neighborhood structure. If two documents are close together in the model’s internal space, UMAP tries to keep them close in 2D space. However, the exact axes and distances between clusters should not be over-interpreted. t-SNE t-SNE is another nonlinear projection method that is good at revealing local clusters. For the purposes in this project, we use t-SNE to ask whether groups that matter semantically, such as model families or human/AI labels, become visibly clustered as the network deepens. Linear...

Exploring the internal representations of Pangram 3.3.2

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

Britain Became as Poor as Mississippi