Nango runs untrusted customer code at scale

rguldener1 pts0 comments

tag is parsed. Two tags cover both CORS-tagged and non-CORS fetches. -->How Nango runs untrusted customer code at scale | Nango Blog<br>Theme

Table of Contents

Nango is a code-first platform for building product API integrations. Customers connect their apps to Salesforce, Google Calendar, Slack, and a few hundred other APIs. Much of the code behind those integrations is written by our customers and deployed to us.<br>That code is untrusted, and can try to do anything: fetch an API, transform data, throw an exception, leak memory, or intentionally try to break out. We run more than 150 million of these functions a month across different workload shapes.<br>Our requirements for the code runtime<br>We run three very different workloads:<br>On-demand calls (Actions): run for a user or agent, so they must start and finish fast. Cold starts hurt.<br>Long-running jobs (Syncs): replicate data in the background, sometimes for hours across millions of records. They need resumable execution.<br>Bursty events (webhooks): arrive in unpredictable spikes, so they have to absorb sudden floods.<br>Running untrusted code then adds more requirements:<br>Isolation from our systems: a function must never reach our database, secrets, or internal network.<br>Isolation between tenants: one customer’s code must not reach another’s, and one customer’s heavy job must not starve everyone else.<br>Isolation between executions: A customer can have a mix of long and short-running jobs. Jobs should not fight for the same resources.<br>Cost and elasticity: We will not pay for idle compute or hand-scale a fleet.<br>Meeting these requirements is not easy. Reducing cold starts means keeping environments warm, which costs money; long syncs need extended run time; spiky webhooks need cheap compute that scales from zero.<br>Every runtime we built had some tradeoffs but we’ve always tried to prioritize security.<br>We started with an in-process sandbox (vm2)<br>In our first years, we ran customer code inside vm2, a Node.js sandbox, in the same process as the worker that ran the job. The customer’s function executed right alongside our own code, and vm2 blocked it from our database, secrets, and other customers’ data. It was simple, and required no extra infrastructure.<br>Then, in 2023, vm2’s maintainer temporarily archived the project after a series of sandbox-escape vulnerabilities: code inside the sandbox could reach the host and run on the worker. A malicious integration could do the same to us.<br>We learned that an in-process JavaScript sandbox is not a real security boundary. Share a process with untrusted code, and you are one escape away from a serious problem.<br>Isolating untrusted code in a runner<br>So we stopped running customer code in the same process as anything that mattered.<br>We split the system into two. A dispatcher hands each customer’s code to a runner over HTTP, and a separate runner executes it. Each customer gets their own long-lived runner, scaled independently: more CPU or memory, or extra replicas for heavy accounts.<br>We also built an orchestration layer that spins up runners on demand, retires idle ones, and rolls out updates across hundreds of them. The scheduler behind the dispatcher initially ran on Temporal; we later moved it to Postgres (a separate story).<br>Crucially, runners get no direct database access. To read or write records, a function calls an SDK method like nango.batchSave(...) that goes to a separate persist service over the network; persist talks to the database, the runner never does. A runner holds the customer’s code and the minimum it needs, nothing more. We ran the runners as separate services on Render.<br>Moving to AWS Lambda<br>By late 2025, the runner model was struggling with resource fairness and observability. A runner ran all of a customer’s executions together, so one heavy job, say a connection replicating millions of records, could starve that customer’s other functions. And when a runner ran out of memory, we could not reliably say which of its thousands of functions caused it. We wanted to isolate each execution and observe each one.<br>AWS Lambda gave us both. With Lambda, each execution runs in its own hardware-virtualized microVM with its own kernel, far stronger than a shared process, and AWS handles the scaling. We had looked at other alternatives like Knative and WASM-based runtimes, but they were similar to Lambda with far less maturity.<br>We started seeing improvements right away.<br>A memory/CPU issue now points to a single connection’s function, visible in that function’s own logs, instead of a vague “this runner is struggling,” and one bad code function no longer affects others. For a small team supporting hundreds of customers, this resulted in significant reductions in time spent debugging these issues.<br>However, there was a problem. A Lambda function runs for at most 15 minutes, and our syncs ran far longer. So we solved it on the product side: a 10-minute cap per run for syncs, and a checkpoints feature that lets a sync resume across runs...

code customer runner function untrusted process

Related Articles