Has Anyone Actually Produced Anything Valuable From a Multi-Agent System?
Every time I rebuild a multi-agent system, I come back to the same uncomfortable split. Coding agents work.<br>Code generation works. A good single agent, pointed at a repo with tools, tests, and tight feedback, can<br>produce real value. It can read code, make changes, run checks, and leave behind a diff that can be<br>reviewed. That is not magic, but it is economically useful.
Outside code, my results have been much less convincing. I have tried multi-agent workflows for images,<br>CAD, BIM, design packages, provider councils, and artifact generation. The pattern is consistent: a single<br>strong model with a clean prompt often beats a committee of agents. The committee adds latency, cost,<br>coordination failure, and prompt dilution. What it rarely adds is judgment.
Blade is my attempt to make this question falsifiable. The architecture is evidence-first: agents do not<br>just chat. Work is submitted through the Blade CLI and API into task rooms. Roles are registered as<br>versioned skills and tools. Events go through NATS JetStream. Outputs go to NATS Object Store. Every run is<br>supposed to end with a blade.evidence.v1 manifest containing provider calls, object keys,<br>replay acknowledgements, tests, logs, hashes, and a verdict. The point is to stop calling a transcript<br>success.
The control plane is written in Elixir. Long-running orchestration is handled through Oban-backed jobs,<br>while runtime work is pushed out to Kubernetes workers instead of local subprocesses. For proof runs, those<br>workers launch as k8s Jobs using Kata isolation, so each agent role runs in a constrained VM-like sandbox.<br>That gives the system a real execution boundary: queued work, isolated workers, durable events, stored<br>artifacts, and replayable evidence.
The image experiments were the clearest. In the baseline, OpenAI generated both sides: the single-agent<br>side used the raw prompt, while the multi-agent side used an OpenAI planner, prompt writer, and critic<br>before generation. Multi-agent won 0 of 12 prompts; single-agent won 11, with one tie. In the stricter<br>version, the chain used a DeepSeek planner, Kimi prompt writer, OpenAI critic, OpenAI image generation,<br>and OpenAI/Gemini blinded judging. Multi-agent won 1 of 10; single-agent won 8. Later small suites were<br>mixed, but not a durable win.
CAD was more interesting, but still not proof of agent intelligence. The repo has real artifact packages:<br>a PCB enclosure, logarithmic slide-rule bracelet, VTOL concept, and F1-style sequential gearbox. The roles<br>included CAD modeler, mesh QA, render engineer, printability reviewer, assembly planner, gear-geometry<br>agent, shift agent, powerflow agent, and proof agent. They produced STEP, STL, GLB, drawings, renders,<br>BOMs, mesh QA, and reports. Useful artifacts, yes. But most CAD runs were local demos with no provider<br>calls or JetStream replay, so they prove the artifact pipeline more than they prove multi-agent reasoning.
Logarithmic bracelet generated with the Blade CAD artifact framework, including the live slide-rule math<br>preview.
BIM is the closest thing to a real multi-agent win. A DeepSeek-bound remote k8s run used BIM program,<br>modeler, geometry, coordination, code-rules, quantity, render, artist, and integrator agents. It produced<br>IFC-style export, GLB, viewer, schedules, reports, Object Store artifacts, and replay validation. But the<br>label matters: schematic BIM proof, not permit-ready BIM.
My conclusion: multi-agent systems have value when they enforce process, evidence, isolation, and<br>auditability. They are not automatically better thinkers. For generation, a single capable agent often<br>wins.