Making NumPy-ts as fast as native

Making numpy-ts as fast as native — nico.codes

Jun 8, 2026 Making numpy-ts as fast as native How numpy-ts went from 15x slower than NumPy to performance parity

When I first started building numpy-ts, everybody said there was no way it could reach performance parity with NumPy’s native implementation and its decades of optimization.

I set out to prove those naysayers wrong.

Turns out, they were right.

Pure JavaScript/TypeScript was never going to match NumPy on raw numerical performance. Not for the operations where NumPy is really not “Python” anymore, but C, BLAS/LAPACK, pocketfft, and a very mature memory model, all hiding behind a Python API.

The path to making numpy-ts competitive with native NumPy was not making JS magically faster. It wasn’t even “use WASM,” at least not on its own. It was changing who owned the bytes.

Functional parity was the easy part

The first major goal for numpy-ts was compatibility. I wanted something that felt like NumPy, worked naturally in TypeScript, and sported the same API surface as Python NumPy.

After several months, numpy-ts reached broad functional parity, with comprehensive API coverage and strong cross-validation test suites. You could write NumPy-style code in TypeScript and get the right answers.

It was also painfully slow.

At that stage, numpy-ts was roughly 15x slower than native NumPy across the benchmark suite.

This was not because JavaScript engines are bad. V8, JavaScriptCore, and other modern runtimes are extraordinary pieces of engineering, and they handle exactly this shape of work well. A tight, monomorphic loop over a contiguous Float64Array compiles to something pretty close to the equivalent scalar loop in C.

The problem is that scalar C is not the bar. NumPy isn’t running scalar loops. Its hot paths are SIMD-vectorized, dispatch into mature kernels like BLAS, LAPACK, and pocketfft, and in places parallelize across cores. Portable JavaScript can’t express most of that: no explicit SIMD, no native BLAS, and every number is a float64. So even a perfectly JIT-compiled JS loop ends up competing one lane at a time against kernels doing four, eight, or sixteen.

For small operations, where that vectorization advantage barely matters and fixed overheads dominate, JavaScript could be surprisingly competitive. For large arrays and compute-heavy kernels, where NumPy’s vectorized, specialized machinery is doing real work, it could not keep up.

The obvious next step was WebAssembly.

WASM kernels: good, but not a silver bullet

The first instinct was simple: move the slow functions to WASM.

That was directionally right, but there was an important constraint. I did not want numpy-ts to become one giant native blob. It still needed to feel like a TypeScript library:

tree-shakeable (small when you only import a few functions)

ergonomic from JavaScript

portable across Node, Bun, Deno, and browsers

So instead of rewriting the entire library in native code, I opted to move performance-critical kernels into small, self-contained WASM modules.

I found Zig to be a good fit for this problem. Zig can produce small WASM artifacts, has no mandatory runtime, gives direct control over memory, and is pleasant for writing the kind of low-level loops numerical kernels need.

This helped a lot. The performance gap dropped from about 15x slower than NumPy to roughly 2x slower .

At this point, many of the expensive loops were already running in compiled WASM. The kernels themselves were fast, but 50% native speed appeared to be a ceiling. Why was that?

My first suspicion was FFI overhead. Maybe calling into WASM was just too expensive. Maybe the boundary between JavaScript and WASM was the bottleneck. Maybe lots of small native calls were killing performance.

Modern JS engines have done a lot of work to make WASM calls and execution fast. SpiderMonkey, for example, optimized call overhead between JS and WASM years ago, while V8 has continued pushing WASM optimization further with work like speculative inlining and deoptimization support.

In my case, the remaining gap was not explained by call overhead alone.

The real problem was the copy.

In my initial approach, the ndarray data lived in JavaScript-owned TypedArray objects. When numpy-ts needed to run a WASM kernel, it had to copy data into WASM linear memory, run the kernel, then copy the result back out into JavaScript memory.

JS Float64Array ↓ copy into WASM linear memory WASM kernel ↓ copy result back to JS JS Float64Array result

For an isolated operation, this was tolerable. For NumPy-style code, it was disastrous.

Numerical programs are full of chained operations. Add, multiply, reshape, reduce, broadcast, slice, take, compare, accumulate. If every operation pays a copy-in/copy-out tax, then the actual computation is no longer the only thing being benchmarked. In many cases, it isn’t even the dominant cost.

The missing element: no-copy calls

Instead of storing ndarray data in JavaScript and...

Making NumPy-ts as fast as native

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs