How to Reduce AI Compute Costs with our Claude Code Plugin: Routing Lightweight AI Tasks to Small Language Models | by ZeroGPU | ZeroGPU | Jun, 2026 | MediumSitemapOpen in appSign up<br>Sign in
Medium Logo
Get app<br>Write
Search
Sign up<br>Sign in
ZeroGPU
ZeroGPU is where engineers, founders, and builders learn how to deploy fast, lightweight AI using edge devices, SLMs, and distributed inference all with real examples from the ZeroGPU ecosystem.
An Explainer on ZeroGPU’s new Router for Claude Code<br>How to Reduce AI Compute Costs with our Claude Code Plugin: Routing Lightweight AI Tasks to Small Language Models
Our new plug-in lets Claude Code route repeatable or lightweight tasks to specialized small and nano language models.
ZeroGPU
2 min read·<br>Jun 3, 2026
Listen
Share
Developers using Claude Code can now offload lightweight AI tasks like classification, extraction, tagging, and PII redaction to ZeroGPU’s small and nano language models directly from the terminal.
Use our ZeroGPU Router with Claude Code to automatically reduce costs on relevant tasksThe new zerogpu-router plugin integrates ZeroGPU’s inference platform into Claude Code’s plugin system, exposing ZeroGPU commands as Claude-accessible skills and slash commands. Instead of sending every task to a frontier model, developers can selectively or automatically route narrow NLP workloads to smaller, specialized models designed for speed and cost efficiency.<br>The release reflects a growing shift in agentic coding workflows: not every request inside an AI coding session needs Claude-level reasoning.<br>Turning Claude Code Into a Multi-Model Router<br>Claude Code, Anthropic’s terminal-based coding agent, allows developers to extend sessions with plugins, slash commands, and auto-invoked skills. The zerogpu-router plugin adds ZeroGPU’s inference layer directly into that workflow.<br>Once installed, Claude can automatically detect and route requests like:<br>PII redaction<br>Named entity extraction<br>IAB taxonomy classification<br>Sentiment and topic labeling<br>JSON extraction from free text<br>Short single-turn chat responses<br>Claude can also auto-invoke skills based on intent. Requests mentioning “redact,” “extract,” or “classify” automatically trigger the appropriate ZeroGPU model behind the scenes.<br>Specialized Nano Models for Structured Tasks<br>Our plugin routes requests to a catalog of smaller models hosted on ZeroGPU’s serverless inference platform.<br>Examples include:<br>gliner-multi-pii-v1 for PII extraction and redaction<br>gliner2-base-v1 for entity extraction and structured classification<br>deberta-v3-small for zero-shot classification<br>zlm-v1-iab-classify-edge for IAB taxonomy tagging<br>LFM2.5–1.2B-Instruct for lightweight chat responses
Using a ZeroGPU model to redact PII, an example of a lightweight AI task that can be automatically routed to a SLM.Use large reasoning models where reasoning matters, and use smaller edge-optimized models for deterministic NLP tasks.<br>Instead of overspending on premium inference costs, developers can route those requests through more cost-efficient, specialized models.<br>Keep Claude focused on higher-context reasoning work. Let ZeroGPU do the rest.<br>Built for Faster, More Cost-Effective Inference<br>Claude Code plugin extends that positioning into developer tooling, where inference routing is becoming increasingly important as teams balance cost, latency, and model capability. As AI coding agents become more central to engineering workflows, infrastructure layers that decide which model should handle which task are becoming part of the stack itself.<br>Get started today:<br>📑Read the full Claude Code Plugin docs📑<br>Review the ⭐️zerogpu-router README on GitHub ⭐️<br>ℹ️ Learn more about ZeroGPU ℹ️<br>Press enter or click to view image in full size
ZeroGPU x Claude Code
Zerogpu
Claude Code
Anthropic Claude
Claude Plugin
Generative Ai Tools
Published in ZeroGPU<br>9 followers<br>·Last published 1 day ago
ZeroGPU is where engineers, founders, and builders learn how to deploy fast, lightweight AI using edge devices, SLMs, and distributed inference all with real examples from the ZeroGPU ecosystem.
Written by ZeroGPU<br>3 followers<br>·4 following
The compute efficient layer for AI inference.
Help
Status
About
Careers
Press
Blog
Store
Privacy
Rules
Terms
Text to speech