Local AI without memory limits: how QVAC’s latest upgrade unlocks 5x more context on your device - QVAC by Tether
SDK
Fabric
Genesis
Models
Workbench
Health
Our Vision
Contact Us
Blog
Discord Forum
Keet P2P Room
Feature Requests
Contact Us
Announcements
Updated<br>June 1, 2026
Local AI without memory limits: how QVAC’s latest upgrade unlocks 5x more context on your device
If you have ever pasted a long document into a local AI app and watched the model stop mid-page with "context limit exceeded", you have hit the memory ceiling that has shaped local AI for years. The model wasn’t the bottleneck. The memory aka Key-Value cache was.
QVAC SDK 0.12.0 changes that.
What is the KV cache?
The KV cache is the working memory an LLM keeps during a conversation. Every token of your prompt, every previous assistant turn, every attached document is stored as Key-Value pairs on-device. This cache lets the model maintain coherence across long contexts without reprocessing everything from scratch on each token.
The trade-off: the cache grows linearly with context length and model depth. A Qwen3.5-4B at 262K tokens stores roughly 8 GB of KV data in 16-bit precision. That is twice the size of the Q8 weights themselves. The KV cache, not the model, is what blows past your VRAM.
Local AI has two memory walls. First, the model weights have to fit on your device: too big and you can’t run it at all. Once they fit, the KV cache becomes the second wall: it caps how much context you can hold. TurboQuant attacks the second wall.
What changes for your app in SDK 0.12.0
TurboQuant compresses the KV cache from 16 bits down to roughly 3 bits per value while preserving accuracy across long-context benchmarks. The practical effect:
GPUVRAMKV budget<br>(VRAM − 4.3GB)Context before 0.12.0With TurboQuantRTX 50608 GB3.7 GB~120K tokens262K tokens (full)RTX 507012 GB7.7 GB~250K tokens262K tokens (full)RTX 509032 GB27.7 GB~262K tokens (already full)262K tokensAMD Ryzen AI Max+ 395 / Strix Halo128 GB123.7 GB~262K tokens (already full)262K tokens
Estimates assume a 4B model at Q8 quantization. Real ceilings depend on the model size and other memory consumers on the device.
Note: These figures do not account for the computation buffer (temporary tensors allocated during inference), so they are approximate estimates.
The table above shows how all hardwares benefit from Turboquant:
Devices with low VRAM are now able to increase their maximum context size
Devices with high VRAM are saving total memory space thanks to a reduced KV budget
What this unlocks in practice:
Local coding assistant with full codebase in context
Long-document analysis (legal contracts, research papers, codebases)
Local 4B+ model with 200K+ context on a single consumer-grade GPU
On-prem enterprise inference for HIPAA / GDPR workloads on a dedicated AI server
How to use TurboQuant in your app
Update to SDK 0.12.0:
npm install @qvac/sdk@latest
To enable TurboQuant on any model you load, pass the turboquant flag in your parameters. That is it.
Currently, TurboQuant is supported only for AMD & NVIDIA GPUs, support for iOS, Android & Apple Silicon coming next.
Why this matters
The context ceiling has, in practice, been an access ceiling. If you could afford a cloud API, you had no KV cache problem. Server farms have effectively unlimited memory. Long context was a feature you bought.
If you wanted to run AI on a device you actually own, where your data stays local, you hit the wall.
TurboQuant narrows that gap. The same model files you already use gain six times more memory headroom on the device you already own. More devices become capable of running real workloads. More people get direct access to intelligence that lives on their own hardware, not in a data center they will never see.
Frequently Asked Questions
What is TurboQuant?
TurboQuant is a KV-cache quantization algorithm published by Google Research at ICLR 2026 (Zandieh et al.). It reduces the running context memory of an LLM by up to 5x with no measurable accuracy loss across major long-context benchmarks.
Does TurboQuant reduce model accuracy?
No. The QVAC team validated TurboQuant across four long-context benchmarks (LongBench, ZeroSCROLLS, RULER, L-Eval, NIAH) with LLama, Qwen and Mistral models. Nearly no accuracy loss was reported across all five. More details here.
Do I need to retrain my model to use TurboQuant?
No. TurboQuant is data-oblivious. It works with any standard transformer loaded as GGUF in the QVAC SDK without retraining, calibration, or fine-tuning.
Is TurboQuant automatic in SDK 0.12.0 or do I have to opt in?
Opt-in. Pass the TurboQuant flag when you load the model. Without it, the default KV cache behavior is used.
Does TurboQuant compress my model file?
No. It only compresses the KV cache during inference. Your GGUF file size is unchanged. The compression happens in memory at runtime.
Get started
Update the QVAC SDK:
npm install...