Subquadratic stealth sort with SubQ: 12 million context tokens, the end of quadratic attention?

LLM & Modèles 🟢 Beginner ⏱️ 15 min read 📅 2026-05-09

Stealth subquadratic sort with SubQ: 12 million context tokens, the end of quadratic attention?

🔎 12 million tokens, 4 people, 29 million dollars

On May 5, 2026, a four-person Miami-based startup burst onto the AI scene with a bombshell announcement: a language model capable of handling 12 million context tokens, with an architecture that would render quadratic attention obsolete. Subquadratic, founded by former Meta Head of Generative AI Alexander Whedon and CEO Justin Dangel, raises $29M in seed funding at a $500M valuation.

AI hadn't seen such a claimed gap since the arrival of Mixture of Experts architectures. Except here, the promise is mathematical: replacing the quadratic scaling of attention with quasi-linear scaling. If true, it doesn't just lower the bill. It changes the very nature of what an LLM can do with massive data.

The key points

SubQ is an LLM with 12M functional context tokens, based on an SSA (Subquadratic Selective Attention) architecture that claims linear scaling in compute and memory.
The startup Subquadratic raises $29M in seed (May 2026), co-founded by Alexander Whedon (ex-Meta) and Justin Dangel, at a $500M valuation.
The claims: 52x faster in prefill at 1M tokens vs FlashAttention 2 on B200, up to 300x cheaper than GPT and Claude on long-context tasks.
Controversy: researchers point out that SubQ resembles a sparse finetune of Kimi or DeepSeek. Independent benchmarks are lacking.
Products available at launch: API, SubQ Code (coding agent) and SubQ Search (deep research).

Recommended tools

Tool	Main usage	Price (May 2026, check on subq.ai)	Ideal for
SubQ API	Long-context LLM via API	Pay-per-token, ~$8 for a full long-context test	Massive document processing, research agents
SubQ Search	Deep research with 12M tokens context	Pay-per-query	Large corpus analysis
SubQ Code	Coding agent with extended context	Pay-per-session	Entire codebases in context

The problem: why quadratic attention is a ceiling

Every developer who has worked with LLMs knows this: the more you increase the context window, the more the bill explodes. This isn't a bug; it's a fundamental mathematical property of transformers.

Standard attention computes an N×N matrix where N is the number of tokens. Double the context, the compute is multiplied by four. Triple it, it's multiplied by nine. This is quadratic scaling.

In practice, this means that Claude Opus 4.7 or GPT-5.5 announce context windows of 1 to 2M tokens, but the cost of filling them completely is prohibitive. And above all, performance collapses well before the theoretical limit — the model "forgets" the middle of the context, a phenomenon documented in the literature since 2023.

Subquadratic attacks this problem directly. Not by optimizing existing attention, but by changing its nature. This is the meaning of their name: subquadratic. A compute that grows quasi-linearly with context length, not quadratically.

To understand the economic stakes, see our article on LLM billing: tokens, context and costs.

SSA: how Subquadratic claims to break the ceiling

SubQ's core innovation is called SSA (Subquadratic Selective Attention). The principle is appealing in its simplicity: instead of computing attention between every pair of tokens, a sparse routing mechanism first selects the relevant tokens, then computes exact attention only on this subset.

Specifically, SSA works in two steps. First, content-dependent routing identifies, for each token, which other tokens in the context are truly relevant. Then, standard attention is computed only on these selected pairs. The result: a complexity that drops from O(N²) to quasi-linear scaling.

Subquadratic claims this is not approximation. The attention computed on the selected tokens is exact. The difference with classic sparse approaches (like OpenAI's Sparse Transformer in 2019) is that the routing is dynamic and content-dependent, not fixed in advance.

The figures put forward are impressive. According to ExplainX's technical analysis, SSA would be 52x faster in prefill at 1M tokens compared to FlashAttention 2 on a B200. In terms of cost, BuildToThrive reports that SubQ can process the same long-context tests as a frontier model for about $8 — that's 300x cheaper.

The official Subquadratic website specifies that the model works up to 12M tokens where other frontier models "collapse well before their announced limit of 1M". 12M tokens is about 9 million words, the equivalent of 120 books.

Benchmarks: solid on retrieval, nuances needed

This is where things get complicated. Subquadratic claims that SubQ outperforms GPT-5.5 on long-context retrieval benchmarks. The RULER benchmark, the reference in the field, shows parity with Claude Opus 4.6 according to the Data Science Collective's analysis.

But the devil is in the details. Jake Cuthbertson broke down the claims and found that SubQ's MRCR v2 score (65.9) is lower than GPT-5.5's (74.0). The "outperforms" framing would be technically true only on a cherry-picked per-model basis — SubQ would be compared to GPT-5.5 on certain retrieval sub-benchmarks, not on the overall score.

In short: SubQ seems particularly strong on the specific task of retrieving information in a very long context. This is exactly what SSA is supposed to optimize. But on general reasoning tasks, it does not yet rival established frontier models.

For now, the first available version is SubQ 1M-Preview, according to Diverse Daily. The 12M version is presented as functional in research but not yet deployed at scale via the API.

This positioning as a long-context specialist places it in direct competition with the best LLMs for research like Perplexity or NotebookLM, but with a different proposition: rather than an end product, SubQ offers the foundational building block.

What this changes for AI agents

Where SubQ could truly make a difference is in the field of AI agents. An agent that needs to navigate a codebase, analyze logs, or reason over a documentary corpus needs context — and reliable context, not a summary that loses details.

Today, when you give a large context to an LLM, two problems arise. The cost explodes quadratically. And the quality degrades: the model is less precise on information in the middle of the context than at the beginning or the end. This is the famous "lost in the middle" effect.

With linear scaling, an agent could ingest an entire Git repo, a week's worth of logs, or a complete set of legal documents — for the same cost as a standard call to a frontier model. The difference isn't incremental. It's structural.

For developers building agents, context management is a central issue. Our article on conversation sessions and context with Hermes Agent details how agents currently manage these constraints. SubQ could make some of these contortions obsolete.

Similarly, context files like CLAUDE.md or AGENTS.md currently serve to inject structured information into a limited context. With 12M tokens, the very concept of a "context file" could evolve into something much richer.

SubQ is not yet listed in our comparison of the best LLMs for AI agents, but if the retrieval benchmarks are confirmed, there is little doubt that it will enter it quickly.

The team: four people, a solid pedigree

The story of Subquadratic is almost as striking as its technical claims. According to Refresh Miami, the idea was born during a bike ride in Broward between Justin Dangel and Alexander Whedon.

Whedon's profile is the startup's credibility anchor. Former Head of Generative AI at Meta, he supervised projects at the scale of one of the world's largest AI deployments. CEO Justin Dangel complements with a business profile. The investors in this $29M seed round include Justin Mateen (co-founder of Tinder) and Javier Villamizar (ex-SoftBank Vision Fund), according to SiliconANGLE.

Four people. $29M at a $500M valuation. That's aggressive, even by 2026 standards. But Whedon's pedigree and the fundamentally new nature of the approach were enough to convince investors.

FelloAI notes that the startup is founded in Miami, an emerging ecosystem that doesn't yet have the technical density of the Bay Area or Paris. A detail that matters when recruiting researchers in attention mechanisms.

Skepticism: 1000x, really? Researchers want proof

This is the mandatory passage in any article about SubQ: the skepticism of the research community. And it is largely justified.

The claim of "1000x efficiency" circulates in the headlines, but the technical reality is more nuanced. The 52x speedup on prefill is measured against FlashAttention 2 — not FlashAttention 3, or more recent optimizations. The 300x cost reduction compares a SubQ call to a GPT-5.5 or Claude Opus call with the same number of tokens, without taking into account that these models don't need 12M tokens for most tasks.

But the real point of friction lies elsewhere. VentureBeat reports that AI engineer Will Depue noted that SubQ is "almost surely a sparse attention finetune of Kimi or DeepSeek". Whedon confirmed on X that the startup uses "open-source model weights as a starting point".

This is a common practice in the industry — DeepSeek V4 Pro and Kimi K2.6 are themselves built on open-source foundations. But it changes the narrative: SubQ is not a model trained from scratch with a new architecture. It's an existing model, finetuned with SSA on top.

Jake Cuthbertson points out that Subquadratic's marketing framing is particularly clever: by comparing by benchmark category rather than on overall scores, they can technically say "outperforms GPT-5.5" even when the composite score is lower.

AI Start News summarizes the community's position: until independent benchmarks are published by teams with no ties to Subquadratic, the 1000x efficiency claims remain unproven. If Subquadratic validates its results, it could reshape the economics of AI development. The "if" carries a lot of weight.

Available products: API, Code, Search

Beyond the research, Subquadratic launched three products as early as May 5, 2026, according to The New Stack.

SubQ API provides access to the model with a 12M token window. The pricing model is per-token, with the displayed advantage of a linear cost — you don't pay the quadratic penalty. Developers can integrate it into their own pipelines.

SubQ Search is a deep research tool that leverages the massive context window to analyze entire corpora without chunking or RAG. The idea: instead of slicing your documents into pieces and hoping retrieval works, you throw everything into the context and let the model find it itself.

SubQ Code is a coding agent that uses the extended context to work on entire codebases. This is potentially the most immediate use case: a developer giving their entire repo in context and asking for a cross-cutting refactoring.

For these last two, the comparison with the best LLMs for coding like Claude Opus 4.7 or GPT-5.3 Codex will be decisive. The massive context is an asset, but if the code generation quality is inferior, the advantage evaporates.

SubQ vs. established models: where does it stand?

To position SubQ in the landscape, we need to separate two dimensions: the model's raw quality and its long-context capability.

In terms of general quality, SubQ does not compete with the top tier. Gemini 3.1 Pro (score 92), GPT-5.5 (91), or Claude Opus 4.7 (90) dominate the overall ranking. Even in specialized research, the best LLMs on the market offer a more mature ecosystem.

For long-context, it's different. No model in the general ranking functionally handles 12M tokens. Announcements of 1-2M windows are mostly theoretical — real performance degrades well before. If SubQ holds up to its claims on RULER at 1M+ tokens, it has a real and defensible advantage.

The following table compares the different approaches:

Model	Announced context	Real functional context	Attention scaling	Long-context cost
SubQ 1M-Preview	12M tokens	~1M+ (claimed)	Linear (SSA)	~$8 full test
GPT-5.5	2M tokens	Degrades before 1M	Quadratic	Very high
Claude Opus 4.7	2M tokens	Solid up to ~500K	Quadratic	High
Gemini 3.1 Pro	1M tokens	Good up to ~500K	Quadratic	Moderate
Kimi K2.6	1M tokens	Variable	Quadratic	Moderate

The real question isn't "Is SubQ better than GPT-5.5?" but "Is SubQ better than GPT-5.5 when the context exceeds 500K tokens?" On this specific question, early benchmarks suggest yes, but independent evidence is lacking.

What needs to happen now for SubQ to establish itself

The claims are on the table. The pressure is now on Subquadratic to demonstrate, not just declare. Several steps are necessary.

First, independent benchmarks. Not those from Subquadratic, nor those from a partner blog. Evaluations conducted by researchers with no financial ties, using public protocols. MRCR v2 is a good starting point, but it needs to be complemented with real, not synthetic, long-context reasoning tests.

Second, architectural transparency. SSA is described in general terms, but the details of the sparse routing — how tokens are selected, what the routing overhead is itself, how it behaves on non-optimized data distributions — remain vague. Codiste's analysis notes that SSA combines "content-dependent sparse routing" and "exact attention," but without publishing the equations.

Third, validation at 12M tokens. Current benchmarks are mainly at 1M. If the scaling is truly linear, the results at 12M should be proportionally good. But it is precisely at these scales that sparse approaches can break — the routing itself can become a bottleneck.

Fourth, honest comparison with existing alternatives. The best local LLMs via Ollama or LM Studio, optimized RAG solutions, hierarchical approaches — SubQ must prove that it is not only better on paper, but also in practice against a well-tuned RAG pipeline. For those who want to test locally, our local LLM installation guide remains the reference.

❌ Common mistakes

Mistake 1: Confusing announced context with functional context

Announcing 12M tokens doesn't mean the model is reliable at 12M tokens. Claude Opus 4.7 announces 2M but degrades before 500K. The only metric that matters is real performance at different lengths, measured by third parties. Never take a context figure at face value.

Mistake 2: Comparing SubQ on general tasks

SubQ is designed for long-context. Comparing it to GPT-5.5 on short reasoning or creative generation makes no sense — it's not optimized for that. Evaluate it where SSA provides an advantage: retrieval in a 500K+ token context.

Mistake 3: Ignoring the fact that SubQ is a finetune

SubQ starts from open-source weights (most likely Kimi or DeepSeek, according to community observations). This doesn't disqualify the approach, but it means the innovation lies in the attention architecture, not the base model. This is important for evaluating the barrier to entry — if SSA is applicable to other models, SubQ's advantage could be copied.

Mistake 4: Taking the "1000x" seriously

The 1000x is a marketing figure that combines several metrics (speed, cost, tokens) in optimized scenarios. The technically defensible figures are the 52x on prefill and the 300x on cost, which are already impressive enough not to need exaggeration.

❓ Frequently asked questions

Does SubQ replace Claude or GPT for daily use?

No. SubQ is specialized for long-context. For general reasoning, coding, or writing, models like Claude Opus 4.7 or GPT-5.5 remain superior. SubQ excels when you need to put an enormous amount of data into context simultaneously.

Is SSA really new?

Sparse attention has existed since 2019 (Sparse Transformer). SubQ's innovation is content-dependent routing combined with exact attention on the selected subset, with scaling claimed to be linear. The concept is not unprecedented, the technical execution would be.

Can SubQ be used for free?

SubQ is not on the list of best free LLMs. The API is paid per token. Subquadratic has not announced a free tier or an open-source version of the model.

Does SubQ work in French?

No French-specific benchmark has been published. Being a finetune of open-source models that support French (Kimi, DeepSeek), it is likely that its French capabilities are correct but not optimized. For native French, see our comparison of the best LLMs in French.

What is the risk if the claims don't hold up?

The main risk is to their reputation. If independent benchmarks show that the scaling isn't linear beyond 1-2M tokens, or that sparse routing significantly degrades quality, the $500M valuation will be hard to justify. But even a partially valid SSA would be a significant contribution.

✅ Conclusion

Subquadratic made an ambitious bet: quadratic attention is not a law of nature, and 12M tokens of reliable context are possible with linear scaling. The initial benchmarks are promising, the team has the pedigree, and the use case for AI agents is obvious.

But independent evidence is lacking, the "1000x" is marketing, and the model is a finetune of an existing one, not a from-scratch revolution. The community is right to be skeptical — and Subquadratic has every interest in publishing transparent evaluations rather than letting doubt set in.

If SSA holds up at 12M tokens, SubQ will not be a better general LLM. It will be the context infrastructure that every serious AI agent uses under the hood. The final verdict belongs to the independent benchmarks that should arrive in the coming weeks.

#subquadratic #subq #attention-quadratique #12-millions-de-tokens #ia #llm

📚 Related articles

LLM & Modèles 🟢 Débutant 12 min

Claude Sonnet 5: Anthropic's most agentic model, Opus performance at Sonnet price

2026-07-01 15:02