Show HN: A local rig to test if AI social simulation predicts reality

zzvimercm1 pts0 comments

GitHub - zzvimercm-git/mirofish-calibration · GitHub

/" data-turbo-transient="true" />

Skip to content

Search or jump to...

Search code, repositories, users, issues, pull requests...

-->

Search

Clear

Search syntax tips

Provide feedback

--><br>We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

-->

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Sign in

/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

{{ message }}

zzvimercm-git

mirofish-calibration

Public

Notifications<br>You must be signed in to change notification settings

Fork

Star

main

BranchesTags

Go to file

CodeOpen more actions menu

Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit

History<br>1 Commit<br>1 Commit

cases

cases

harness

harness

.env.example

.env.example

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

run.py

run.py

View all files

Repository files navigation

Does AI social simulation actually predict reality? — a calibration rig

Multi-agent "social simulation" engines (à la MiroFish — 16k★, OASIS/CAMEL-AI) promise: feed in a document, spawn hundreds of AI personas, and predict how the public will react — before you ship. The category is hot and well-funded.

One problem: nobody publishes the calibration. The demos show one impressive run on one case and say "look, it predicted!". Does the simulation actually beat just asking a single LLM? Nobody measures it.

This is a small, honest rig that measures it. Runs 100% locally on Ollama (sovereign, no cloud).

⚠️ Read the limitations before the findings. This is a rehearsal, not a verdict. See below.

TL;DR (preliminary — n=5 synthetic cases, local qwen2.5:7b)

On what people will say (sentiment direction): a single LLM ties a crude multi-agent swarm. Both mediocre on hard cases (~60%).

On which objections will surface : a single LLM wins clearly (recall ~98% vs ~70%).

On the aggregate "magic" signals (virality magnitude, polarization) — the things simulation is supposed to be good at: the numbers are noise at this scale. Spearman ρ flips sign between runs (+0.71 ↔ −0.71; +0.82 ↔ +0.10). At n=5, ρ≈±0.7 isn't even significant.

Adding an agent-interaction round (the core MiroFish thesis) did not help in this crude form.

Conclusion: at small scale the "predictive magic" is indistinguishable from a coin flip. That doesn't disprove MiroFish — it shifts the burden of proof onto the category , and gives you a rig to actually test it instead of trusting a demo.

Headline result (5× averaged, local qwen2.5:7b)

Predictor<br>Sentiment dir.<br>Objection recall<br>Objection prec.<br>Magnitude (rank)<br>Polarization (rank)

mini_swarm (no interaction)<br>64%<br>71%<br>62%<br>+0.10<br>−0.47

single_llm (one zero-shot call)<br>52%<br>84%<br>71%<br>+0.22<br>+0.05

dumb (always "mixed")<br>40%<br>0%<br>0%<br>n/a<br>n/a

The single LLM is the bar to beat. A crude swarm doesn't.

⚠️ Limitations (front and center — this is the whole point)

n=5, and the cases are synthetic (hand-written, illustrative). This is a methodology rehearsal, not evidence about the real world.

The swarm here is a crude proxy, NOT MiroFish. Real MiroFish has many more agents and richer interaction dynamics. This rig tests naive persona-averaging and a toy interaction round — it does not (yet) test real MiroFish.

One small local model (qwen2.5:7b). A bigger/different model may change everything.

5-point rank correlations are not statistically meaningful. Treat magnitude/polarization here as noise illustration, not signal.

→ To get a real answer you need: dozens of real cases with documented ground truth, multiple seeds, and the actual MiroFish engine. That's the open work.

How it works

Cases (cases/*.yaml): a real stimulus + its known reaction (ground truth).

Predictors (interchangeable): mirofish (the real sim — adapter stub to implement), mini_swarm / swarm_x (crude swarm, no/with interaction), single_llm (the baseline to beat), dumb (sanity).

Metrics : sentiment direction, objection recall/precision (semantic LLM-judge), magnitude & polarization rank correlation.

Report : honest comparison, with --runs N to average away run-to-run noise.

Quick start (local, Ollama)

pip install -r requirements.txt # or: python -m venv .venv && .venv/bin/pip install -r requirements.txt<br>cp .env.example .env # points at local Ollama by default<br>ollama pull qwen2.5:7b

python run.py --predictors single_llm,dumb # baselines, fast<br>python run.py --predictors swarm_x,mini_swarm,single_llm --runs 5 # the real...

mirofish cases real local simulation search

Related Articles