Teaching LLMs to one-shot complex back ends at scale, report #1

Teaching LLMs to one-shot complex backends at scale, report #1 – Blog

The LLM is not all that matters in AI coding. What the LLM is targeting matters a great deal. A simpler target that requires less reasoning will produce better results.

Attempts to get LLMs to produce complex backends have been lackluster. A recent paper, Constraint Decay: The Fragility of LLM Agents in Backend Code Generation, shows that even on a simple CRUD app, end-to-end success on the full test suite tops out at 33% once realistic structural constraints are imposed.

Conventional backends are made of many separate systems glued together, each with its own model and failure modes. Most of the failures observed in these benchmarks show up at the seams between these systems. The LLM is not asked to reason about one coherent system – it’s asked to coordinate across many.

Along those lines, we believe Rama is ideally positioned to take LLM coding to the next level for backends. Rama collapses the typical backend stack (databases, queues, stream processors, application logic) into one integrated system. The seams that current LLMs trip over largely don’t exist in a Rama application. A horizontally scalable, fault-tolerant backend is expressed as one coherent program rather than as glue across half a dozen systems.

In the past few months we’ve been working on a project to teach LLMs to one-shot complex backends at scale with Rama as the substrate. Our results so far are very promising, as I’ll review later in this post, but we have a ways to go. The major milestone we’re working towards is one-shotting the entire Matrix spec, which also has a thorough set of tests available that can be used to verify an implementation. What we’re looking to produce is:

A generated implementation of Matrix that passes all the reference tests

Transcript showing every step of how the LLM one-shotted the project

Benchmarks automatically written and executed by the LLM that demonstrate high performance and horizontal scalability

Matrix is orders of magnitude more difficult than the backends current LLMs can handle, particularly with these scalability and fault-tolerance requirements, so one-shotting it will be a huge milestone. However, the overarching goal is for this to work on any backend problem. We don’t expect what we’re building to one-shot every possible backend. Humans remain vastly better than LLMs at broad systems design where many tradeoffs must be considered. What we think is achievable, and what this project is targeting, is a workflow where humans assist with high-level design decisions and the agent handles lower-level decisions and implementation, including achieving horizontal scalability and fault-tolerance. By “fault tolerant” we mean the system continues operating correctly through infrastructure failures (e.g. node deaths) without data loss, data duplication, or downtime, and recovers automatically when failed components return.

Whether our goal is possible remains to be seen, but I’ll be documenting our progress as we go via these progress reports.

Our workflow

We work through the rama-ai-learn project, which we just open-sourced. It’s a benchmark and harness for measuring how well LLMs can produce production Rama code, along with the skill content the agent uses to do the work.

Each task we throw at an agent is a “challenge.” A challenge directory (example) contains a README.md stating operations, latency targets, and other constraints. It also has an interface the agent must implement. The directory also contains private artifacts that are encrypted before runs so the agent can’t see them: tests covering functional correctness, fault-tolerance, and performance, and a reference implementation. After an agent finishes its implementation and passes its own tests, the challenge runner runs the formerly encrypted tests to determine whether the agent succeeded or failed.

Agents are run inside a Docker container with full permissions. We capture every agent invocation’s full transcript, including thinking, tool uses, tool results, and the final response. Thinking is particularly valuable. It’s how we discover failure modes that don’t show up in the produced code, like an agent identifying a fault-tolerance gap, going back and forth on possible solutions, and then saying “this is getting complicated” and failing to address it at all.

Rama has Java and Clojure APIs. We’re focused on Clojure for now but will produce an equally capable Java version of the skills later. The REPL is the main reason, as with a long-running REPL session, the agent evaluates code and inspects results in milliseconds instead of constantly paying for JVM startup and dependency loading. We expect this gap to matter more as challenges get harder and converging on a correct design takes the agent many iterations.

Working on improving LLM performance involves making a new challenge and then iterating on the skill files until the agent passes...

Teaching LLMs to one-shot complex back ends at scale, report #1

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan