OSCAR : Together AI open-sources a 2-bit KV cache quantization that reduces memory by 8
🔎 The LLM serving bottleneck is no longer compute, it's memory
Since late 2025, serving language models in long context has hit a physical wall. GPUs have underutilized teraflops of compute, but sorely lack the memory to store the keys and values (KV cache) of attention over 128K, 256K, or 1M token windows.
Together AI just released OSCAR, a 2-bit KV cache quantization system that divides the memory footprint of this cache by 8, while maintaining near-identical accuracy to BF16 on long-range benchmarks. The paper is published on arXiv (May 2026), and the code is open-source.
This is a technical breakthrough. Until now, KV cache quantization capped at 4 bits (GPTQ, AWQ, KVQuant) with measurable losses beyond 32K tokens. Moving to 2 bits without collapsing quality means solving the equation that makes long-context economically viable in production.
The essentials
- OSCAR quantizes the KV cache to INT2, dividing cache memory by 8 compared to BF16, which makes it possible to serve models like Claude Opus 4.7 or GPT-5.5 over 256K token windows with realistic GPU configurations.
- The technical key: offline spectral rotations that eliminate channel-wise outliers before quantization, a problem that made INT2 KV cache impossible until now.
- RULER and Needle in a Haystack benchmarks show near-BF16 accuracy, surpassing Google's QTIP on long-context scenarios, with negligible compute overhead at inference.
Recommended tools
| Outil | Main usage | Price (May 2026, check on together.ai) | Ideal for |
|---|---|---|---|
| OSCAR (GitHub) | KV cache INT2 quantization | Open-source (Apache 2.0) | Long-context LLM serving |
| Together AI Platform | Model serving | Pay-as-you-go (pay-per-token) | Production deployment |
| vLLM | Inference engine | Open-source | OSCAR backend integration |
Why the KV cache is the real problem of long-context
A model like GPT-5.5 in BF16 consumes about 2 bytes per weight parameter. But for each input token, the KV cache stores two vectors (key and value) whose size depends on the model dimension and the number of attention heads.
For a 70B parameter model with a 128K token window, the KV cache alone can exceed 16 GB. At 256K tokens, we exceed the memory of an H100. GPU compute is largely underutilized: memory bandwidth is the bottleneck, not the FLOPs.
This is exactly the problem that the meilleurs LLM pour la recherche encounter when ingesting massive documents. The model must store the entire context in memory before it can answer, and this memory is expensive.
Weight quantization (INT4, INT8) is mature today. But KV cache quantization is a fundamentally different problem: it happens online, during prefill, and does not benefit from a static calibration dataset.
What makes INT2 KV cache impossible without OSCAR
2-bit quantization means each value only has 4 possible levels. The quantization error is therefore structurally massive. But the real problem isn't the value range: it's the outliers.
In the KV cache, some channels feature outliers up to 100x larger than the median. These outliers are channel-wise, meaning a classic uniform quantizer is forced to use its entire dynamic range to accommodate these extremes.
Result: 99% of normal values are crushed into 1 or 2 quantization levels, losing almost all information. This is why KVQuant (2024) capped at 4 bits and why 2-bit approaches produced incoherent responses beyond 8K tokens.
Google's QTIP (2025) partially solved the problem by using per-head quantization and a learned codebook, but at the cost of non-negligible computational overhead and significant integration complexity.
How OSCAR solves the outlier problem with spectral rotations
OSCAR's fundamental insight is simple but elegant: instead of quantizing KV vectors directly, we project them into a space where outliers no longer exist, and then quantize in this space.
Offline spectral rotations
OSCAR applies an orthogonal rotation to the key and value vectors, computed offline from a calibration dataset. This rotation is constructed via an SVD (Singular Value Decomposition) of the attention projection matrices.
The goal: redistribute the energy of the outliers across all dimensions, so that each channel has a comparable variance. Once the vectors are rotated, a uniform INT2 quantizer works correctly because there are no longer any dominant channels.
The rotation is offline, meaning zero cost at inference. It is stored as a lightweight matrix (a few KB per layer) and applied as a simple matrix multiplication during prefill, an operation already vectorized in existing CUDA kernels.
Attention-aware quantization
OSCAR does not treat keys and values identically. Keys are used for the attention score computation (query × key), while values are used for the weighted combination (attention × value). The quantization error therefore has a different impact depending on whether it affects K or V.
The system adapts its quantization strategy: more effective bits for the keys (where the error propagates exponentially through the softmax), and more aggressive quantization for the values (where the error is linearly dampened by the attention weights).
This attention-aware distinction is what allows OSCAR to remain in global INT2 while preserving quality. As seen in the ranking of the best agentic LLMs, attention precision over long sequences is what differentiates a reliable agent from one that hallucinates.
Benchmarks : near-BF16 on RULER and Needle in a Haystack
The results published by Together AI are solid and measured on standard domain benchmarks.
RULER (Long-Context Reasoning)
RULER is the reference benchmark for evaluating long-context understanding. It tests information retrieval, multi-step reasoning, and deduction over windows of up to 128K tokens.
| Configuration | 16K tokens | 64K tokens | 128K tokens |
|---|---|---|---|
| BF16 (baseline) | 94.2% | 91.7% | 88.3% |
| KVQuant INT4 | 93.8% | 89.1% | 82.6% |
| QTIP INT2 | 92.1% | 85.4% | 76.8% |
| OSCAR INT2 | 94.0% | 91.2% | 87.1% |
OSCAR INT2 loses only 0.2% at 16K tokens and 1.2% at 128K tokens compared to BF16. QTIP collapses at 128K (-11.5 points), while even KVQuant INT4 loses 5.7 points.
Needle in a Haystack
The "needle in a haystack" test places a unique fact at a random position in a long context and checks whether the model retrieves it.
OSCAR INT2 maintains a success rate of over 99% up to 128K tokens, compared to 97% for QTIP and 99.5% for BF16. Degradation only becomes significant beyond 256K tokens, a window where BF16 itself begins to weaken on certain models.
These figures are consistent with the performance observed on the meilleurs LLM in long contexts: attention quality is the limiting factor, not model capacity.
OSCAR vs QTIP vs OScaR académique : three different approaches
There is a naming confusion to clarify. OScaR (with a lowercase "c") is also the name of a computational algebraic system developed as part of university projects (cf. the papers on number theory in OSCAR, monomial bases in Lie theory, and matroids in OSCAR). It has nothing to do with KV cache quantization.
The relevant comparison is between OSCAR (Together AI), QTIP (Google), and academic approaches to KV cache quantization.
| Criterion | OSCAR (Together AI) | QTIP (Google) | KVQuant (CMU) |
|---|---|---|---|
| Target bits | INT2 | INT2 | INT4 |
| Approach | Offline spectral rotations | Learned per-head codebooks | Mixed-precision quantization |
| Inference cost | Negligible (rotation = matmul) | Moderate (lookup tables) | Low |
| Quality at 128K | -1.2% vs BF16 | -11.5% vs BF16 | -5.7% vs BF16 |
| Integration complexity | Low (vLLM plugin) | High | Moderate |
| Open-source | Yes (Apache 2.0) | No | Yes |
OSCAR wins on practically all axes. Offline spectral rotation is an elegant solution that shifts the problem from inference to a one-time preprocessing step, a virtually free trade-off for production deployments where the served model doesn't change every few hours.
For teams that want to test these approaches locally with meilleurs LLM locaux, OSCAR's integration into vLLM makes experimentation accessible without cloud infrastructure.
Concrete impact on serving cost
Let's take a real-world case: serving GPT-5.4 Pro (91 points in the overall ranking) over a 128K token context window with a batch size of 32 concurrent requests.
KV cache memory (estimation per request, 70B-equivalent model)
| Format | KV memory/request | 32 requests | Ratio |
|---|---|---|---|
| BF16 | ~4.8 GB | ~153 GB | 1x |
| INT8 | ~2.4 GB | ~76 GB | 2x |
| INT4 | ~1.2 GB | ~38 GB | 4x |
| INT2 (OSCAR) | ~0.6 GB | ~19 GB | 8x |
Going from 153 GB to 19 GB of KV cache means going from 8 H100s to 2 H100s for the same workload. At $2/hour per H100, that's a saving of ~$12/hour, or ~$8,640/month running continuously.
For hosts serving thousands of requests per minute, this reduction transforms the economic viability of long-context. A provider like Hostinger could offer LLM serving at radically lower prices by integrating OSCAR into its stack.
What this changes for AI agents
Agents that search, code, and create over the long term accumulate context throughout their iterations. An agent like ByteDance's DeerFlow can generate tens of thousands of intermediate reasoning tokens before producing its final output.
With OSCAR, the memory required to maintain this intermediate context is divided by 8. This means either agents capable of reasoning over longer windows, or more concurrent agents on the same hardware.
The connection with open-source search agents is direct: these agents ingest entire web pages, store them in context, and iterate. The KV cache is their main memory cost driver.
More surprisingly, research shows that grep is all AI agents need rather than vector search. But whether the agent uses grep or a retriever, the problem of the KV cache in long context remains identical: search results must be stored in memory to reason over them.
Limitations of OSCAR
Despite its impressive results, OSCAR has limitations that the paper honestly mentions.
Sensitivity to the calibration dataset
Spectral rotations are calculated on a calibration dataset. If the production data distribution differs significantly (for example, switching from French to Arabic, or from prose to very dense code), the effectiveness of the rotation may degrade. Together AI recommends recalibrating for significantly different distributions.
This is an important point for the best French LLMs, whose attentional distributions in French may differ enough from English to justify a recalibration.
Degradation beyond 256K tokens
Benchmarks show more marked degradation beyond 256K tokens, where OSCAR INT2 loses about 3-4% compared to BF16. The paper suggests that spectral rotations become less effective when the sequence length greatly exceeds that of the calibration dataset.
Non-standard models
OSCAR was validated on classic transformer architectures (GPT, LLaMA, DeepSeek). Architectures with non-standard attention mechanisms (linear attention, mamba-hybrid) are not covered. The local inference engine DS4 targeting DeepSeek V4, for example, would require specific adaptation.
Practical integration: how to use OSCAR
OSCAR is designed to integrate into vLLM, the most widely used open-source serving engine. The workflow consists of two phases.
Phase 1: Offline calibration
Calibration involves passing a representative dataset (typically 512 to 2048 sequences) through the model to compute the SVD rotation matrices per layer. This operation takes between 10 and 30 minutes on a single GPU for a 70B model, and is only required once per model configuration.
Phase 2: Serving with quantized KV cache
Once the rotations are computed, they are loaded at vLLM server startup. Serving operates normally, with an additional rotation step during prefill. The measured overhead is less than 3% on the total prefill time, because the rotation is fused with the existing QKV projection in the CUDA kernels.
For those who prefer to avoid the complexity of vLLM, installing a local LLM with LM Studio or Ollama does not yet natively support OSCAR, but the kernels are simple enough to be integrated into open-source AI agents with Ollama in the medium term.
Positioning in the quantization ecosystem
OSCAR does not replace weight quantization. It is complementary. An optimal production setup combines INT4 weights (via GPTQ or AWQ) with an INT2 KV cache (via OSCAR).
The full quantization stack looks like this:
| Component | Format | Memory | Tool |
|---|---|---|---|
| Model weights | INT4 | 35 GB (70B) | AWQ / GPTQ |
| KV cache | INT2 (OSCAR) | 0.6 GB/req | OSCAR |
| Activations | BF16 | Negligible | Native |
| Total per request | — | ~0.6 GB | — |
Compare this stack with the meilleurs LLM gratuits running in full BF16 on the provider's cloud: the difference in serving cost is an order of magnitude.
Historical perspective: from mathematical OSCaR to quantization OSCAR
The name OSCAR is a recursive acronym in Together AI's case. But the coincidence with the OSCAR algebraic system is interesting.
The academic OSCAR project is an open-source computer algebra system that implements algorithms in number theory, Lie theory, and algebraic geometry. Papers published in 2024 describe calculs de théorie des nombres dans OSCAR, bases monomiales en théorie de Lie via OSCAR, and matroïdes dans OSCAR.
There is also a 1994 paper on Oscar Klein et la théorie de jauge which is unrelated to either project. And the astrophysics project CoReCon on reionization constraints is in a completely different field.
The lesson: when you search for "OSCAR" on arXiv, specify your field. Together AI might have chosen a less ambiguous name, but the technical acronym is justified by the mechanism (Orthogonal Spectral Channel-Aware Rotation).
❌ Common mistakes
Mistake 1: Confusing weight quantization and KV cache quantization
These are two distinct problems. Quantizing weights reduces the model's storage memory (fixed). Quantizing the KV cache reduces the memory per request (dynamic, scales with context length). OSCAR only does KV cache. Combining both is necessary for optimal deployment.
Mistake 2: Thinking that INT2 = INT2 everywhere
Not all 2-bit quantizations are created equal. A naive quantizer on the KV cache produces unusable results. OSCAR achieves viable INT2 precisely because spectral rotations transform the space before quantization. Comparing "OSCAR INT2" with "generic INT2" makes no sense.
Mistake 3: Ignoring the calibration dataset
Using spectral rotations computed on English text to serve a model in Chinese or Arabic exposes you to a silent degradation in quality. Calibration is a mandatory step, not a detail.
❓ Frequently Asked Questions
Does OSCAR work with code models like Claude Opus 4.7 or GPT-5.3 Codex?
The paper validates OSCAR on standard decoder-only architectures. Code models have different attentional distributions (more local focus, less global spread), which may require recalibration. Preliminary results are encouraging but not yet published for purely code cases.
What is the latency overhead?
The measured overhead is less than 3% on prefill time and zero on token-by-token generation (decode). The rotation is fused with the existing QKV projection, so there is no additional kernel.
Does OSCAR replace long-window architectures like Mamba or Ring Attention?
No. OSCAR reduces the memory of the existing KV cache. KV cache-free architectures (Mamba, RWKV) eliminate the problem differently. But for standard transformers that dominate the meilleurs LLM pour coder rankings, OSCAR is the most pragmatic solution.
Can OSCAR be used with LLM pour agents like those in the Hermes/OpenClaw list?
Yes, provided the underlying model is a standard transformer with a classic attention mechanism. Agents that accumulate context over long sessions are actually the first to benefit from the memory reduction.
✅ Conclusion
OSCAR is the first INT2 KV cache quantization solution that holds up in production, with a measurable loss of only 1.2% at 128K tokens on RULER. By dividing cache memory by 8, Together AI makes long-context LLM serving economically viable without architectural compromises. The code is open-source, the vLLM integration is simple, and the impact on serving cost is immediate. If you serve LLMs in production with context windows larger than 32K tokens, OSCAR is not an option: it is a necessity.
Sources : MarktechPost — Together AI open-sources OSCAR, OSCAR — Attention-Aware 2-bit KV Cache Quantization (arXiv, May 2026), Together AI.