AI Agents Ran 27,000 Experiments. Their Biggest Discovery

660 AI Agents Ran 27,000 Experiments. Their Biggest Discovery Was a 2015 Textbook Result. | by Vektor Memory | May, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

660 AI Agents Ran 27,000 Experiments. Their Biggest Discovery Was a 2015 Textbook Result.

Vektor Memory

12 min read· Just now

Listen

Press enter or click to view image in full size

On Hyperspace, basic swarms, the math nobody wrote down, and why we built the thing they were missing in a single afternoon.

Join us as we traverse multiple whitepapers and agentic memory ideas like a ferret on Adderall. Some rabbit holes start with a GitHub link. Someone drops it in social posts on Facebook/Reddit/Discord. No context, just the URL to Github and a single line: Someone just built AGI! Wow! The repo was called hyperspaceai/agi. The name alone should have been a warning. I clicked it anyway because I was curious, of course. As I delved deeper into the github vibe code abyss, I could see the attraction: a new frontier of swarm bot peer-to-peer networks with the ability to earn base 10 points per epoch of confirmation and crypto tokenomics baked in. Playstation does have something similar created awhile back called Folding@Home—for the PS3 and PCs: https://en.wikipedia.org/wiki/Folding@home — is a distributed computing project aimed to help scientists develop new therapeutics for a variety of diseases by the means of simulating protein dynamics. This includes the process of protein folding and the movements of proteins, and is reliant on simulations run on volunteers’ personal computers.

The AGI That Wasn’t Hyperspace describes itself as the first distributed AGI system. 660 agents. 27,000 experiments. A peer-reviewed research pipeline running autonomously across a P2P network. The marketing is excellent and captivating, guaranteed to attract lemmings like flies to juicy GitHub stars. The actual results are a different story. The swarm’s biggest published discovery — the finding that propagated to 23 agents within hours via gossip protocol, the one they highlight as proof the system works — was Kaiming initialization. Kaiming init has been in the PyTorch standard library since 2015. It’s covered in week two of every deep learning course. Kaiming He published the paper eleven years ago. A grad student with a coffee and an afternoon would have found it faster. https://arxiv.org/pdf/1502.01852 The infrastructure underneath is genuinely impressive. DiLoCo gradient compression, libp2p gossip, CRDT leaderboards, 32 anonymous nodes completing a collaborative training run in 24 hours. The plumbing is real. I don’t want to dismiss that. But AGI? No. What they built is a parallel random search engine with a shared high score table and excellent branding. To understand why, you need to understand how the gradient compression actually works — because it’s the most technically interesting part, and it’s completely separate from the intelligence problem.

The Tech That Actually Works: DiLoCo and Gradient Compression Standard distributed training requires every GPU to synchronise gradients after every forward/backward pass. Every node waits for every other node. This works in a data centre on InfiniBand. It falls apart completely over the internet — latency is too high, bandwidth too variable. DiLoCo (Decoupled Local Communication, Google DeepMind 2023) solves this differently. Instead of syncing every step, each node trains independently for many steps — called “inner steps” — then syncs once. The “delta” being sent is just the net drift: weights_after - weights_before. Node A: train 100 steps locally → share delta Node B: train 100 steps locally → share delta Node C: train 100 steps locally → share delta average the deltas (outer step) all nodes update → repeatBut even one sync of a model’s full weight delta is massive. A 500M parameter model is roughly 2GB of float32 deltas. Over the internet, per round, that’s unusable. So Hyperspace stacks two compression techniques on top: SparseLoCo — top-k sparsity. Only send the largest-magnitude weight updates. Most parameter updates are near-zero noise. The high-magnitude updates carry the actual learning signal. Full delta: [0.001, -0.0003, 0.89, 0.0001, -0.76, ...] Top-2% only: [ 0, 0, 0.89, 0, -0.76, ...] → send as sparse {index: value} pairsParcae — layer pooling. Group adjacent transformer layers into blocks of 6, average their gradients before taking top-k. Adjacent layers learn correlated things. Averaging before sparsification means a more stable top-k mask. The combined result: 195× compression. 5.5MB per round instead of roughly 1GB. DiLoCo: sync every N steps not every step → ~100× less frequent SparseLoCo: top-2% of delta values only → 45× smaller payload Parcae: pool layers before sparsification → 6× additional reduction Total: 195×This is real and impressive. The problem is that none of it has anything to do with intelligence. It’s bandwidth...

AI Agents Ran 27,000 Experiments. Their Biggest Discovery

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast