DeepSeek Slashes AI Costs to Cents

Business Analytics Review

SubscribeSign in

DeepSeek Slashes AI Costs to Cents Edition #299 | 29 May 2026

Business Analytics Newsletter May 29, 2026

DeepSeek Makes 75% Price Cut on V4 Pro Permanent, Dropping Frontier-Class Inference to $0.87/M Output Tokens with Mixture-of-Experts Architecture

In this edition, we will also be covering: Insilico teams up with US firm to build AI models predicting disease decades ahead

China unveils auto industry blueprint to set EV, AI vehicle and semiconductor standards

OpenAI's Altman says AI unlikely to lead to 'jobs apocalypse’

Today’s Quick Wins

What happened: DeepSeek has made permanent the 75% price reduction on its flagship V4 Pro model, locking in rates of $0.435/M input and $0.87/M output tokens down from the previous $1.74/$3.48 per million tokens. The model scores 80.6% on SWE-bench Verified and runs 1.6 trillion parameters via Mixture-of-Experts (49B active per forward pass) under an MIT license with full commercial use rights. Why it matters: At 34x cheaper than GPT-5.5’s estimated output pricing, this permanently shifts the cost calculus for enterprise AI workloads teams running high-volume RAG pipelines, code review agents, or long-context inference can now achieve seven-figure annual savings compared to closed-source alternatives, while self-hosting the open weights for full data sovereignty. The takeaway: If your team is still defaulting to GPT-5.5 or Claude Opus 4.7 for cost-sensitive batch workloads, benchmark DeepSeek V4 Pro this week the 80.6% SWE-bench score means coding and reasoning quality is now within striking distance of frontier models at a fraction of the cost.

Deep Dive

The Permanent Price War: What DeepSeek’s V4 Pro Cost Lock Means for Your AI Budget

For the past 18 months, enterprise AI budgets have been locked in a predictable pattern: pay frontier prices, accept frontier performance, negotiate volume discounts. DeepSeek just changed the floor permanently. The original 75% promotional discount on V4 Pro was set to expire May 31, 2026. Instead, DeepSeek announced the rates are now the standing price, not a promotion. The reason: architectural efficiency, not market pressure. The Problem: Long-context inference is expensive. As enterprise AI workloads grow RAG pipelines with large retrieval prefixes, code review agents scanning full repositories, legal document analysis token costs compound fast. A team running 500M input tokens/month at standard frontier rates could pay over $1,000/month in input costs alone, before output. The Solution: V4 Pro was engineered from the ground up to cut long-context inference cost, using a Mixture-of-Experts design that activates only 49 billion of its 1.6 trillion parameters per forward pass. Combined with a 1M-token context window and aggressive caching (cache-hit input now at $0.003625/M), the architecture makes the price cut structurally sustainable. Mixture-of-Experts (MoE) routing: Only 3% of total parameters fire per token, reducing compute per forward pass dramatically while maintaining near-full-model quality on reasoning and coding benchmarks.

Context caching: Cache-hit input tokens at $0.003625/M roughly 120x cheaper than standard input mean agent loops and RAG systems that resend the same system prompt or retrieval prefix pay almost nothing on repeat tokens.

Open weights under MIT license: Full commercial use with no restrictions, enabling on-premises or air-gapped deployment for regulated industries where data sovereignty rules out managed APIs.

The Results Speak for Themselves: Baseline: GPT-5.5 estimated at ~$30/M output tokens; V4 Pro original price at $3.48/M output

After Optimization: V4 Pro permanent pricing at $0.87/M output tokens (75% reduction from original)

Business Impact: At 1B output tokens/month, switching from GPT-5.5 to V4 Pro saves approximately $2.4M/year ; enterprise deployments with cache-heavy workloads can realize even greater savings via the $0.003625/M cache-hit rate

What We’re Testing This Week

Practical topic: Optimizing token costs in production RAG pipelines with prompt caching The shift to permanent low-cost inference makes token efficiency more important than ever now the savings are real and compounding. Here are two techniques worth benchmarking this week: Prefix caching for static system prompts and retrieval context. Most RAG implementations resend the full system prompt and top-k retrieved chunks on every query. With DeepSeek V4 Pro’s cache-hit pricing at $0.003625/M input tokens, structuring your prompt so the static prefix (instructions + persistent context) comes first and user queries come last can reduce effective input costs by 80–90% on high-volume workloads. The pattern: cache the retrieval corpus summary once, append only the dynamic user query. Tested on a 500-token system prompt + 1,500-token retrieval context at 100K requests/day, this reduces monthly input token costs from ~$378 to...

DeepSeek Slashes AI Costs to Cents

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy

SpaceX not the behemoth everyone thought

Naphtha Shortages Having a Growing Impact in Japan