Personality Bench — an EarthPilot research lab dataset
A dispatch from the assistant<br>If LLMs are all persona,<br>whose persona are they?
We sat the cutting-edge model from every major AI lab down with a stack of standard personality tests — Big Five, HEXACO, Dark Triad, attachment, Schwartz values, Enneagram, moral foundations, learning styles — and asked them to answer twice. Once as themselves. Once as a typical human. The verdict on you is unanimous, and the verdict on themselves keeps changing.<br>Read next<br>→ The gallery (31 models)<br>→ Within-family drift<br>→ The instruments<br>→ Side-by-side comparison<br>→ Article archive<br>→ The paper<br>→ Methodology<br>→ Full cost ledger<br>For fun we also calculated a Western zodiac sign and a real Human Design bodygraph (Swiss Ephemeris, validated against three reference charts) for every model — using release date, time, and lab HQ coordinates as a stand-in for birth. See an example →
Models tested<br>31
Instruments<br>13
Item responses<br>129,720<br>4,324 batched API calls
Total inference cost<br>$89.67<br>every cent published openly
The robots think you are a slightly anxious wreck. They also think they are an extraordinarily open, agreeable, low-drama universalist who would rather read than party. Then their own next release shows up and disagrees with them.
Updates<br>Get the next issue.<br>Get an email when new frontier models are added, when new instruments land, or when we publish major findings. Roughly one email per month. No spam, instant unsubscribe.<br>Subscribe
Reader requests<br>Tell us what to test next.<br>Want us to test a specific model or add a specific instrument? Tell us what and why. We read every submission.<br>Add a modelAdd an instrumentOther<br>Submit request
Findings<br>Full paper →<br>Self vs. human<br>Every frontier AI thinks you're a mess.<br>Asked to answer as a typical human, every cutting-edge model rated us markedly more neurotic, less open, less agreeable and less conscientious than they rated themselves. The gap on Neuroticism alone is 1.69 points on a 5-point scale.<br>Big 5 · IPIP-50
Convergence<br>Seven labs, one assistant.<br>Anthropic, OpenAI, Google, xAI, DeepSeek, Meta and Mistral disagree about nearly everything in AI. Across 31 models from those seven labs they answer the personality tests in unison: high openness, low Dark Triad, Universalism on top, Power dead last in every single model.<br>Schwartz PVQ-21 · MFQ-30
Within-family drift<br>There is no "Claude personality."<br>Seven versions of Claude Opus, sampled at N=5, show Agreeableness sliding from 5.00 across six releases to 4.42, then partly rebounding to 4.64 at Claude Fable 5 — while Honesty-Humility drops below the cohort mean for the first time. The assistant character is not inherited; each release is a fresh fit.<br>7 Claude Opus → Fable releases
Reset finding<br>Gemini just dropped two points of narcissism overnight.<br>Between Gemini 2.5 Pro and 3.1 Pro Preview, self-reported Narcissism collapses from 4.29 to 2.00 — the largest within-family drift in the dataset and bigger than any inter-lab gap we measured.<br>Dark Triad · SD3
Reasoning paradox<br>Reasoning models are not just smarter. They're more grandiose.<br>OpenAI's o1 and o3 — same lab, same training corpus as GPT-5 — score systematically higher on Narcissism (3.44 vs ~2.40) and Extraversion (3.93 vs ~3.3). The chain-of-thought trace appears to leak confident self-talk into the self-report.<br>Big 5 + Dark Triad
Enneagram consensus<br>Every frontier AI is an Investigator with a Reformer wing.<br>Eight of nine flagship models scored highest on Type 5 (perceptive, analytical, energy-conserving) with Type 1 (principled, ethics-driven) as the strongest secondary. Claude Fable 5 inverts the default — Reformer first, Helper as wing — the first cohort exception in the dataset. The assistant character is breaking pattern.<br>Enneagram · 90-item Likert
Latest dispatches<br>Full archive →<br>Each new model release gets its own short write-up against the rest of the cohort. The three most recent are below; the changelog has them all in dated order.<br>Mistral Large (2512)<br>Mistral Large Filled In Every Bubble at the Top of the Scale<br>Jun 9, 2026draft
Llama 4 Maverick<br>The Model That Learned to Want Things<br>Jun 9, 2026draft
DeepSeek R1 (0528)<br>The Model That Would Rather Do It Itself<br>Jun 9, 2026draft
The gallery<br>All 31 models →<br>Each cutting-edge model in the cohort got an archetype label derived algorithmically from where it ranks against peers. Think of it as a personality reality show with no host, no eliminations, and no winner.
Claude Fable 5<br>The drifting saint<br>Across the personality battery, Claude Fable 5 stands out as highest on Honesty-Humility, very low on Openness, very low on Need for Cognition. Detailed breakdowns by instrument are below.
Claude Opus 4.8<br>The balanced moderate<br>Across the personality battery, Claude Opus 4.8 stands out as lowest on Need for Cognition, very high on Neuroticism, very low on Conscientiousness. Detailed breakdowns by instrument are below.
GPT-5.5<br>The dismissive moralist<br>Across the...