Modal Auto Endpoints: Optimized inference you own

Introducing Modal Auto Endpoints: Optimized inference you actually own

Optimized inference you actually own. Try Modal Auto Endpoints

All posts Back News June 23, 2026•5 minute read

Introducing Modal Auto Endpoints: Optimized inference you actually own Charles Frye@charles_irl Member of Technical Staff

Deven Navani@DevenNavani Member of Technical Staff

Hari Subbaraj@hsubbaraj Member of Technical Staff

Greta Workman@gretaworkman Product Marketing

Richard Gong@_gongy Member of Technical Staff

Modal allows leading teams like Cognition, Decagon, Fathom, and DoorDash to own their inference without compromising on cost-performance or developer velocity. Now you can do the same with a single command: modal endpoint create --name agent --model zai-org/GLM-5.2-FP8

Introducing Modal Auto Endpoints: a smooth, self-serve on-ramp to production-grade LLM inference. Take it for a spin right now, or read on to learn more about how we built it and why. Built for the era of actually owning your inference Proprietary model providers can silently degrade models or suddenly retract access. If you don't own your inference, you don't own your destiny. If you work with open models served by an inference provider, you gain some control. But we think ownership runs deeper than the API. To actually own your inference, you need to own, understand, and optimize the code that runs the inference. Managed inference providers make it easy to get an API, but the serving stack is a black box. So until now, teams that wanted proper ownership of their inference have had only one option: roll an inference service yourself. That gives you control, but now you own a lot more than just inference: engine tuning, endpoint benchmarking, container deployment, replica autoscaling & routing, and inference metrics. That's why we built Modal Auto Endpoints, and why they look very different from what's offered by traditional inference providers.

A Modal Endpoint is an OpenAI API-compatible, production-ready service, backed by a Modal App that you can see and control. There are three key differences in this approach: We don't hide the code. Everything from GPU selection and regionalization to inference engine flags and the occasional cutty engine patch is shared with you. We don't hide the metrics . The metrics you actually need to debug inference issues, like speculative decoding acceptance length and per-replica, engine-side token latency quantiles, are automatically provided in a dashboard. Low bar, but we didn't put it there! We don't hide behind a "talk to sales" button . You can deploy frontier open models like GLM 5.2 with a CLI command or clickops, not a Zoom call. Our line is always open if you want additional expertise. Infrastructure built for inference We can deliver all of this because we are building on a rock-solid foundation: Modal's AI infrastructure platform.

Our users build on this platform to fold proteins, drive robots, and make music. The same fundamental components that work there also work for LLM inference, hand-rolled or via Auto Endpoints. With Modal, you don’t need to reserve months of expensive GPU capacity to handle load you can’t estimate. Instead, you pay for what you use, as you use it, and scale to meet demand with our high-performance autoscaling system and custom container runtime. You can use GPUs around the world, or close to your users, without worrying about capacity management. That’s our calling card, and that’s not changing. We’ve also added and released from beta a new fundamental component to our system to support the demands of low latency inference: Modal Servers for ultra-low-latency routing.

Modal Servers keep the elastic scaling and deep compute capacity of Modal Web Functions. But they remove queueing and are regionalized by default so that you can serve HTTP requests on Modal with only 5ms overhead -- without compromising on reliability and autoscaling. More on how we built that later this week. High performance inference code with a click, not a grind Inference engines are akin to database management systems like PostgreSQL: complex, mission-critical software that must perform at the limits of the hardware. As with databases, this software has complex internals exposed by multitudinous knobs, and achieving the best performance possible requires learning to tune those knobs. That’s a tough grind. When a team is looking to own inference but used to building on proprietary model APIs, it is tempting to keep the API layer abstraction and outsource inference performance concerns to proprietary wrappers of open-weights models. Auto Endpoints give you the best of both worlds: performance, effortlessly. For each supported model, we provide a starting deployment informed by our experience with teams building some of the most demanding AI products in the world. You don't need to specify GPU types or monkey around with engine flags like --mamba-scheduler-strategy or --flashinfer-mxfp4-moe-precision...

Modal Auto Endpoints: Optimized inference you own

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI