An MCP server that gives agents a sandboxed Lisp REPL

andreasronge1 pts1 comments

The Right Tool for Code Mode | ptc_runner blog

Skip to the content.

The Right Tool for Code Mode

A while back I built my own personal assistant and wired it to my email, calendar, and phone over MCP. One of the first things I asked it to do was show me the last 10 emails.

That was already enough to feel the problem. Email bodies are long, quoted threads are longer, and every message the tool returned was now sitting in the model’s context. The model had not done any real work yet. It had just fetched data, and the context window was already being spent.

Then I asked for something more useful. Go through my mail, suggest a few categories, and give me some statistics. This was ordinary data work: grouping, counting, comparing. But the model tried to do it by reading messages one at a time in its head. The context filled up, the answers got vague, and the counts were wrong.

Some of that was on my Gmail MCP. But the failure pointed at something deeper. The model was being asked to be the computer, and it is not good at that. It is good at deciding what to compute. Those are different jobs, and I had handed it the wrong one.

That failure is the reason I built the agentic framework ptc_runner, and lately a standalone MCP server on top of it. I wanted the model to stop doing computation in its head and instead write small programs.

This post is about why I think code mode needs a smaller execution language than Python or JavaScript. Along the way it changed how I design tools, which is the part I keep coming back to.

Code mode is having a moment

The idea going around is a good one. Instead of the model calling tools one at a time and dragging every result back into the conversation, it writes a small program. The program calls the tools, processes the results where they sit, and returns only the part that matters.

Cloudflare shipped this as Code Mode, turning MCP tools into a TypeScript API the model can write against. Anthropic made a similar case in Code execution with MCP: keep intermediate results in the execution environment and return only the chosen value. Different languages, same move. Stop making the model the runtime. Let it write code instead.

I have been convinced by that direction for a while. Where I part ways is the next decision. Which language should the model use?

Code mode hides two choices inside one feature. One is the sandbox. Where does generated code run, and how cheaply can it be limited and thrown away? The other is the language surface. What is the model allowed to write inside that sandbox?

Most systems start with a familiar general-purpose language, usually Python or JavaScript, and then constrain it from the outside. ptc_runner goes the other way. The language is small enough that it removes whole categories of behavior before the sandbox has to.

This job needs a different kind of language

Think about what the code actually is in this setting. It is short. It is thrown away the moment it runs. It was written by something that cannot be fully trusted, and it executes on your machine or in your account. When it goes wrong, what you need most is a clear signal back so the model can fix itself on the next attempt.

That is a narrow job. You need to run many small, disposable, untrusted snippets, safely, with feedback good enough to recover from. Python and JavaScript were not designed for that. They were designed for people writing programs they intend to keep. They bring their whole surface area along, including imports, file access, network calls, a vast standard library, and a thousand ways to do almost anything.

When the author is an LLM, all of that reach turns into something you have to manage. You can make it safe. People do it every day, with containers, V8 isolates, and hardened interpreters that strip capabilities back out. But look at what that work is. It lives outside the language. You pick a tool that can do anything, then spend real engineering making sure it cannot. The constraint you actually wanted is not part of the tool. It is bolted on, which means you own it, and you have to keep owning it.

The real cost of a general-purpose language here is the ongoing work of removing power you never wanted in the first place.

There is a quieter cost too. A big language gives the model a big space to be subtly wrong in. It can write something that parses fine and still means the wrong thing. For one-shot generation that has to run untrusted, that surface is working against you.

Make the language the sandbox

ptc_runner did not start with PTC-Lisp. The first version used a tiny JSON-shaped language because it was easy to describe with a schema. It worked for simple calls, but the schema grew, the language stayed awkward, and it never felt like a real workspace.

PTC-Lisp is the version that stuck. I wanted a syntax that was small, regular, compact in tokens, and natural to use in a REPL. There is no filesystem access, no network access, no import, no eval....

model language code mode tool back

Related Articles