GitHub - zzvimercm-git/mirofish-calibration · GitHub
/" data-turbo-transient="true" />
Skip to content
Search or jump to...
Search code, repositories, users, issues, pull requests...
-->
Search
Clear
Search syntax tips
Provide feedback
--><br>We read every piece of feedback, and take your input very seriously.
Include my email address so I can be contacted
Cancel
Submit feedback
Saved searches
Use saved searches to filter your results more quickly
-->
Name
Query
To see all available qualifiers, see our documentation.
Cancel
Create saved search
Sign in
/;ref_cta:Sign up;ref_loc:header logged out"}"<br>Sign up
Appearance settings
Resetting focus
You signed in with another tab or window. Reload to refresh your session.<br>You signed out in another tab or window. Reload to refresh your session.<br>You switched accounts on another tab or window. Reload to refresh your session.
Dismiss alert
{{ message }}
zzvimercm-git
mirofish-calibration
Public
Notifications<br>You must be signed in to change notification settings
Fork
Star
main
BranchesTags
Go to file
CodeOpen more actions menu
Folders and files<br>NameNameLast commit message<br>Last commit date<br>Latest commit
History<br>1 Commit<br>1 Commit
cases
cases
harness
harness
.env.example
.env.example
.gitignore
.gitignore
LICENSE
LICENSE
README.md
README.md
requirements.txt
requirements.txt
run.py
run.py
View all files
Repository files navigation
Does AI social simulation actually predict reality? — a calibration rig
Multi-agent "social simulation" engines (à la MiroFish — 16k★, OASIS/CAMEL-AI) promise: feed in a document, spawn hundreds of AI personas, and predict how the public will react — before you ship. The category is hot and well-funded.
One problem: nobody publishes the calibration. The demos show one impressive run on one case and say "look, it predicted!". Does the simulation actually beat just asking a single LLM? Nobody measures it.
This is a small, honest rig that measures it. Runs 100% locally on Ollama (sovereign, no cloud).
⚠️ Read the limitations before the findings. This is a rehearsal, not a verdict. See below.
TL;DR (preliminary — n=5 synthetic cases, local qwen2.5:7b)
On what people will say (sentiment direction): a single LLM ties a crude multi-agent swarm. Both mediocre on hard cases (~60%).
On which objections will surface : a single LLM wins clearly (recall ~98% vs ~70%).
On the aggregate "magic" signals (virality magnitude, polarization) — the things simulation is supposed to be good at: the numbers are noise at this scale. Spearman ρ flips sign between runs (+0.71 ↔ −0.71; +0.82 ↔ +0.10). At n=5, ρ≈±0.7 isn't even significant.
Adding an agent-interaction round (the core MiroFish thesis) did not help in this crude form.
Conclusion: at small scale the "predictive magic" is indistinguishable from a coin flip. That doesn't disprove MiroFish — it shifts the burden of proof onto the category , and gives you a rig to actually test it instead of trusting a demo.
Headline result (5× averaged, local qwen2.5:7b)
Predictor<br>Sentiment dir.<br>Objection recall<br>Objection prec.<br>Magnitude (rank)<br>Polarization (rank)
mini_swarm (no interaction)<br>64%<br>71%<br>62%<br>+0.10<br>−0.47
single_llm (one zero-shot call)<br>52%<br>84%<br>71%<br>+0.22<br>+0.05
dumb (always "mixed")<br>40%<br>0%<br>0%<br>n/a<br>n/a
The single LLM is the bar to beat. A crude swarm doesn't.
⚠️ Limitations (front and center — this is the whole point)
n=5, and the cases are synthetic (hand-written, illustrative). This is a methodology rehearsal, not evidence about the real world.
The swarm here is a crude proxy, NOT MiroFish. Real MiroFish has many more agents and richer interaction dynamics. This rig tests naive persona-averaging and a toy interaction round — it does not (yet) test real MiroFish.
One small local model (qwen2.5:7b). A bigger/different model may change everything.
5-point rank correlations are not statistically meaningful. Treat magnitude/polarization here as noise illustration, not signal.
→ To get a real answer you need: dozens of real cases with documented ground truth, multiple seeds, and the actual MiroFish engine. That's the open work.
How it works
Cases (cases/*.yaml): a real stimulus + its known reaction (ground truth).
Predictors (interchangeable): mirofish (the real sim — adapter stub to implement), mini_swarm / swarm_x (crude swarm, no/with interaction), single_llm (the baseline to beat), dumb (sanity).
Metrics : sentiment direction, objection recall/precision (semantic LLM-judge), magnitude & polarization rank correlation.
Report : honest comparison, with --runs N to average away run-to-run noise.
Quick start (local, Ollama)
pip install -r requirements.txt # or: python -m venv .venv && .venv/bin/pip install -r requirements.txt<br>cp .env.example .env # points at local Ollama by default<br>ollama pull qwen2.5:7b
python run.py --predictors single_llm,dumb # baselines, fast<br>python run.py --predictors swarm_x,mini_swarm,single_llm --runs 5 # the real...