Show HN: A local rig to test if AI social simulation predicts reality

GitHub - zzvimercm-git/mirofish-calibration · GitHub

/" data-turbo-transient="true" />

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Clear

Search syntax tips

Provide feedback

--> We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

/;ref_cta:Sign up;ref_loc:header logged out"}" Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

zzvimercm-git

mirofish-calibration

Public

Notifications You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files NameNameLast commit message Last commit date Latest commit

History 1 Commit 1 Commit

cases

harness

.env.example

.gitignore

LICENSE

README.md

requirements.txt

run.py

View all files

Repository files navigation

Does AI social simulation actually predict reality? — a calibration rig

Multi-agent "social simulation" engines (à la MiroFish — 16k★, OASIS/CAMEL-AI) promise: feed in a document, spawn hundreds of AI personas, and predict how the public will react — before you ship. The category is hot and well-funded.

One problem: nobody publishes the calibration. The demos show one impressive run on one case and say "look, it predicted!". Does the simulation actually beat just asking a single LLM? Nobody measures it.

This is a small, honest rig that measures it. Runs 100% locally on Ollama (sovereign, no cloud).

⚠️ Read the limitations before the findings. This is a rehearsal, not a verdict. See below.

TL;DR (preliminary — n=5 synthetic cases, local qwen2.5:7b)

On what people will say (sentiment direction): a single LLM ties a crude multi-agent swarm. Both mediocre on hard cases (~60%).

On which objections will surface : a single LLM wins clearly (recall ~98% vs ~70%).

On the aggregate "magic" signals (virality magnitude, polarization) — the things simulation is supposed to be good at: the numbers are noise at this scale. Spearman ρ flips sign between runs (+0.71 ↔ −0.71; +0.82 ↔ +0.10). At n=5, ρ≈±0.7 isn't even significant.

Adding an agent-interaction round (the core MiroFish thesis) did not help in this crude form.

Conclusion: at small scale the "predictive magic" is indistinguishable from a coin flip. That doesn't disprove MiroFish — it shifts the burden of proof onto the category , and gives you a rig to actually test it instead of trusting a demo.

Headline result (5× averaged, local qwen2.5:7b)

Predictor Sentiment dir. Objection recall Objection prec. Magnitude (rank) Polarization (rank)

mini_swarm (no interaction) 64% 71% 62% +0.10 −0.47

single_llm (one zero-shot call) 52% 84% 71% +0.22 +0.05

dumb (always "mixed") 40% 0% 0% n/a n/a

The single LLM is the bar to beat. A crude swarm doesn't.

⚠️ Limitations (front and center — this is the whole point)

n=5, and the cases are synthetic (hand-written, illustrative). This is a methodology rehearsal, not evidence about the real world.

The swarm here is a crude proxy, NOT MiroFish. Real MiroFish has many more agents and richer interaction dynamics. This rig tests naive persona-averaging and a toy interaction round — it does not (yet) test real MiroFish.

One small local model (qwen2.5:7b). A bigger/different model may change everything.

5-point rank correlations are not statistically meaningful. Treat magnitude/polarization here as noise illustration, not signal.

→ To get a real answer you need: dozens of real cases with documented ground truth, multiple seeds, and the actual MiroFish engine. That's the open work.

How it works

Cases (cases/*.yaml): a real stimulus + its known reaction (ground truth).

Predictors (interchangeable): mirofish (the real sim — adapter stub to implement), mini_swarm / swarm_x (crude swarm, no/with interaction), single_llm (the baseline to beat), dumb (sanity).

Metrics : sentiment direction, objection recall/precision (semantic LLM-judge), magnitude & polarization rank correlation.

Report : honest comparison, with --runs N to average away run-to-run noise.

Quick start (local, Ollama)

pip install -r requirements.txt # or: python -m venv .venv && .venv/bin/pip install -r requirements.txt cp .env.example .env # points at local Ollama by default ollama pull qwen2.5:7b

python run.py --predictors single_llm,dumb # baselines, fast python run.py --predictors swarm_x,mini_swarm,single_llm --runs 5 # the real...

Show HN: A local rig to test if AI social simulation predicts reality

Related Articles

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org