SkyPilot Sandboxes: Run Agent Code on Your Own Kubernetes, at Scale

covi1 pts0 comments

SkyPilot Sandboxes: Run Agent Code on Your Own Kubernetes, at Scale | SkyPilot Blog

Table of ContentsWhat is a sandbox, and why do you need one?<br>SkyPilot Sandboxes run on your own infra<br>Example: RL-training a code-generation model, with sandboxed reward<br>Competitive with Modal, on your own clustersPerformance: faster to first command, scales with your clusters<br>Cost: up to 10x cheaper

Takeaways

Every agent, coding assistant, and RL pipeline eventually hits the same wall: the model wrote code, and now someone has to run it. Today, most teams hand that code to a hosted sandbox vendor paying a multiple of raw compute to execute untrusted code on someone else&rsquo;s machines, while their prompts, test cases, and model outputs leave their cloud. Meanwhile, the Kubernetes cluster they already operate sits right there, capable of running 50,000 sandboxes at once. This post is about closing that gap: SkyPilot Sandboxes, a BYOC code execution layer, with a full RL post-training example and head-to-head benchmarks against Modal.<br>The full pricing math is worked out in the cost section below.<br>What is a sandbox, and why do you need one?#<br>LLMs generate code. Whether it is an agent, a coding assistant, or an RL reward loop scoring the output of a half-trained model, at some point you have to run that code, and you cannot trust it. It can loop forever, exhaust memory, write files, spawn processes, or import something that tries to phone home. You need a disposable, isolated place to run it, and you usually need a lot of them at once.<br>Today that means reaching for a hosted sandbox vendor. It works, but the trade is real:<br>Cost. You pay the vendor&rsquo;s per-sandbox rate on top of the compute you already own.<br>Privacy. Your code and data (the model&rsquo;s output, your test cases, your prompts) leave your environment for a third party.<br>Latency for non-US users. The vendor runs in their regions. Reach them from somewhere else and every call pays a network-distance tax.<br>SkyPilot Sandboxes run on your own infra#<br>A SkyPilot Sandbox is a lightweight, isolated pod you create on demand, run commands in, and tear down, running on the Kubernetes you already have (BYOC: bring your own cloud).<br>Per-pod isolation. Each sandbox is its own pod with a dedicated image, CPU, and memory. Code that misbehaves is contained to its pod, and the pod is destroyed when you are done.<br>Massively parallel. Launch many sandboxes in a single call and fan commands out across them concurrently.<br>Sub-second launches with warm pools. A pool keeps pre-provisioned pods idle and ready, so creating a sandbox claims a running pod instead of waiting on Kubernetes scheduling and an image pull. That cuts a single sandbox&rsquo;s launch time by more than 50%.<br>Your infra, your data. Code and data never leave your cloud. If grading needs credentials (a private package index, a database for integration tests), they are injected from the SkyPilot Secrets Manager at create time, never baked into an image.<br>Modal-style API. create(), exec(), terminate(), each with an async sibling on .aio for massive fan-out. If you have used a hosted sandbox SDK, you already know this one.<br>Create & exec<br>Fan out<br>Async<br>import sky.sandbox

sb = sky.sandbox.create(image="python:3.12", cpus=1, memory_gb=2)<br>result = sb.exec("python", "-c", "print(2 + 2)")<br>print(result["stdout"]) # "4" (also: stderr, exit_code)<br>sb.terminate()

# One call returns a LIST of isolated sandboxes.<br>sandboxes = sky.sandbox.create(image="python:3.12", num_sandboxes=100)<br>for sb in sandboxes:<br>sb.exec("pytest", "-q", "tests/")

# Every entrypoint has an async sibling on `.aio`.<br>sandboxes = await sky.sandbox.create.aio(image="python:3.12", num_sandboxes=64)<br>results = await asyncio.gather(<br>*(sb.exec.aio("python", "-c", code) for sb in sandboxes))<br>await asyncio.gather(*(sb.terminate.aio() for sb in sandboxes))

Example: RL-training a code-generation model, with sandboxed reward#<br>Untrusted code at volume shows up most sharply in reinforcement learning. This example post-trains a code-generation LLM, a policy model that, given a programming problem, writes a Python function to solve it. The training goal is simple to state: make the model&rsquo;s generated functions pass the tests more often.<br>On every training step, for every rollout in the batch, we execute code that a half-trained model just wrote (buggy, occasionally infinite-looping, untrusted by definition) and that execution sits on the critical path of training. This is the same shape of problem HuggingFace&rsquo;s Open R1 hit when they used hosted sandboxes for their RL reward; here, the execution runs on your own Kubernetes cluster via SkyPilot Sandboxes.<br>We use a standard distributed RL layout: five services in a SkyPilot job group, talking over HTTP.<br>The Data Server serves prompts (MBPP-style problems with hidden tests) to the Rollout Server<br>The Rollout Server (SGLang) has the current policy generate candidate solutions and sends them to the reward server.<br>The Sandbox...

code sandboxes sandbox skypilot model create

Related Articles