Tether's AI arm turned a Google Research paper into production code that cuts the memory large language models need during long sessions by as much as 5x.
Tether's AI arm turned a Google Research paper into production code that cuts the memory large language models need during long sessions by as much as 5x.

Tether's AI Research Group on Monday released an open-source implementation of TurboQuant, a Google Research algorithm that compresses the key-value cache — the working memory transformer models use to track context — by up to 5x without retraining or fine-tuning existing models, making it feasible to run capable AI on laptops, phones, and edge devices rather than routing every task through cloud data centers.
"If long context AI only works inside the largest data centers, then AI will be shaped by whoever owns the most hardware," Paolo Ardoino, chief executive officer of Tether, said in a statement. "TurboQuant changes what local AI can do by making memory less of a wall."
The KV cache is the bottleneck that forces long AI sessions into the cloud. At roughly 262,000 tokens — the equivalent of several hours of conversation or a few hundred pages of text — the KV cache for a 4-billion-parameter model consumes about 8 gigabytes of memory on its own. Four concurrent sessions at that length push the cache past 32 GB before accounting for the model weights themselves. TurboQuant compresses that footprint to roughly 1.6 GB per session, or 6.4 GB for four, bringing the total within reach of consumer hardware with 16 GB to 32 GB of unified memory.
The release is part of QVAC SDK 0.12.0, Tether's broader platform for decentralized AI that also added text-to-video generation and robot control capabilities in the same update. The SDK includes a full quantization pipeline, adapters for common inference frameworks, documentation, and workload-tuned deployment profiles. Developers can apply TurboQuant to existing models without starting from scratch — no retraining or fine-tuning required.
Why memory matters for the AI stack
The memory constraint has been one of the structural barriers keeping AI workloads concentrated in hyperscale data centers. A model that needs 16 GB of working memory for its KV cache alone cannot run on a MacBook Air or a mid-range Android phone. Cutting that to 3.2 GB changes the deployment math entirely, opening the door for on-device assistants that can process hundred-page documents, retain full project context, and handle private data locally.
Tether's implementation builds on several prior compression techniques the company has stacked into QVAC, including PolarQuant and Quantized Johnson-Lindenstrauss. Each targets a different part of the efficiency problem. TurboQuant is the latest layer, adapted from a Google Research paper published March 24.
The open-source nature of the release is a strategic play to grow the ecosystem around QVAC and position Tether's platform as the default toolkit for decentralized AI. Any developer can grab the code and integrate it into their inference pipeline immediately. That puts Tether in direct competition with established local AI frameworks like llama.cpp and Ollama, as well as cloud providers whose business models depend on routing inference through their data centers.
What this means for investors
Tether, best known as the issuer of the $140 billion USDT stablecoin, has been expanding aggressively into AI infrastructure. The company's thesis is that the next phase of AI will be defined by software efficiency and portability rather than raw compute scale. If TurboQuant's 5x compression claim holds across different model architectures and context lengths — independent benchmarks have not yet been published — it could accelerate the shift of inference workloads from centralized cloud services to local devices, potentially squeezing revenue growth for cloud GPU providers while expanding the addressable market for edge AI hardware.
This article is for informational purposes only and does not constitute investment advice.