The AI Decoupling

The Ai Decoupling | Vintage Data

The AI Decoupling<br>Pierre-Carl Langlais, May 24, 2026

It started exactly one year ago: in May 2025, the tech economy split into software and AI.

The SaaS compounded index from Bessemer vs. general Nasdaq

The AI/software decoupling went unnoticed for months, save for early signs of structural strain. The market finally came to a full realization in early 2026: SaaS and cloud services experienced the largest sell-off since the start of the pandemic, while AI labs and associated infrastructure soared. There are now two completely disconnected ecosystems, with different growth perspectives, valuation multiples, investing networks, talent attractiveness. In short, the AI economy decoupled from the tech economy that gave it birth.

Last year I predicted that the model was becoming the product and on trajectory to absorb the application layer built around AI. I didn't fully anticipate it would extend to the entire category of software and redefine many long-held assumptions — after all, what is even a "product" now? While many things have been written or speculated about the consequences of the AI take-off (starting with the Citrini memo), the actual root cause is surprisingly hidden. What changed exactly with AI, LLMs or agents that made all this possible in 2025-2026, but not before?

This blogpost will attempt something different: build a more general understanding of model economics. We assume from the start that the current generation of models is a general purpose technology and the actual driver of social and economic changes lies at the level of fundamental research and engineering innovations. Yet it has also remained a hidden signal until now as technological expertise in economic analysis remained anchored in the software world. In practice, this will be an exercise in hermeneutics, taking seriously the disconnected literature produced by leading labs at a technological or commercial level and striving to reconcile it with the scarce available economic data to provide a unified picture.

High-margin, scalable, concentrated: MoE inference economics

The tech split is exclusively rooted in disruptive innovation. Over the past three years, AI labs developed increasingly sparse mixture-of-expert architecture that simultaneously made inference high-margin and collapsed the margins of software production.

Claude Code and Codex finally came to exist because they overcame a hard trade-off: by 2026, leading labs run billions of interleaved agentic sessions with large prefilled contexts. You couldn't do that eighteen months ago, even if you happened to have the same training data infrastructure. For this to happen, you need to simultaneously ensure high performance (to meet the critical accuracy level at long horizons), high throughput (as each session is now fractally expanded at inference time) and high context (as people routinely consume hundreds of thousands tokens).

The dominant architecture choice right now is highly sparse mixture of experts with native quantization. Expert routing is in itself, a form of economic optimization: it works because most tasks are intrinsically modular and at a given time, only need a relatively bounded search space. In a typical bitter-lesson fashion, properly trained models are better parameter allocators than deterministic scaffolding. Similarly, long context inference has become affordable, as models learned to manage their context. Fundamentally, you don't need to hold hundred of thousands of tokens equally in memory, just to recall the one that matter.

More than benchmarks, economic viability is the primary driver of architecture innovation. Models that fail to properly meet the demand of inference economics are simply failed products. OpenAI disabled Sora as the model was reportedly losing millions daily in inference bill: likely the particular mix of diffusion and AR you find in video models is still very weakly optimized.

The MoE market is high margin by design: inference get cheaper by several orders of magnitude provided you have simultaneously enough compute and enough demand. Even just at the compute level, there is a high barrier to entry where medium-sized dense models stopped being competitive with large MoE. And, the barrier is also technical and intellectual: at a near-frontier level, it requires building system-level intuitions of how highly complex components could work out in practice.

The sudden emergence of MoE economics of scale retroactively accounts for the current GPU shortage. Counter-intuitively, the hardware chain did not really anticipate an AI boom. Demand was driven primarily by high compute demand from platforms (Google, Meta, TikTok) and even kept it intentionally compressed as intermediaries were still scarred by the fast succession of boom/burst cycles after the pandemic. MoE economics require a few large labs (and, to a much lesser extent, new infras and neo-labs) to scale up their capability very...

The AI Decoupling

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models