The Quadratic Bottleneck That Has Held AI Back Since 2017
Every large language model in production today shares the same fundamental constraint: the cost of processing a conversation grows with the square of its length. Double the context window, and you quadruple the compute and the bill. At 200,000 tokens, that math is manageable. At a million tokens, it becomes brutal. At twelve million tokens, it is prohibitive under standard transformer attention.
This quadratic scaling is not a tuning parameter or an engineering gap. It is a structural property of the self-attention mechanism: every token must compare itself with every other token, producing an N x N attention matrix for a sequence of length N. For a 10,000-token document, that is 100 million operations just for one attention pass. For a 10-million-token context, it becomes 100 trillion.
On May 5, 2026, a thirteen-person Miami startup called Subquadratic came out of stealth with $29 million in seed funding and a claim to have cracked this problem. Their model, SubQ, uses a new mechanism called Subquadratic Sparse Attention (SSA) and ships a twelve-million-token context window. The company claims a 1,000x reduction in attention compute over dense transformers at that scale. Researchers are paying attention. They are also demanding proof.
What SubQ Actually Does
The core idea behind SSA is content-dependent sparse routing. Instead of computing attention across every possible token pair, SubQ selectively identifies which token comparisons actually matter and computes exact attention only on those. The attention matrix stays sparse throughout, rather than being dense and then pruned after the fact.
This is meaningfully different from earlier approaches like FlashAttention, which optimizes the memory access pattern of dense attention without changing its O(N squared) computational character. SSA aims to change the computational character itself, targeting near-linear growth in both compute and memory as context length increases.
The claimed performance numbers are specific. At one million tokens, Subquadratic says SubQ runs roughly 52 times faster than FlashAttention. API pricing is listed at approximately one-fifth the cost of Claude Opus or GPT-5.5 for comparable workloads. On published benchmarks, SubQ reports 81.8% on SWE-Bench Verified and 95% on RULER@128K, a long-context retrieval benchmark that most frontier models struggle with beyond 512K tokens.
If those numbers hold under independent scrutiny, a twelve-million-token context window at frontier quality and a fraction of the compute cost would be genuinely significant. Entire codebases, genomic datasets, years of correspondence — all processable in a single context, cheaply.
Why Researchers Are Not Convinced
Subquadratic launched without releasing model weights, without a technical report, and without a reproducible evaluation suite. That combination is a red flag in AI research. Benchmarks are easy to game when you control which ones you publish and how you measure them. The community has learned this the hard way — a problem well documented by now.
Theory researchers are also pointing to formal constraints. A 2024 paper, “Fundamental Limitations on Subquadratic Alternatives to Transformers” (arxiv.org/abs/2410.04271), establishes conditional hardness results showing that certain attention computations cannot be approximated sub-quadratically without sacrificing quality on adversarial inputs. Whether SSA navigates around those constraints — or simply avoids the adversarial cases in its benchmarks — is something only a full technical report and open weights could reveal.
Access to SubQ is currently limited to a waitlist, with no public API. That makes community replication impossible. The claims rest entirely on what the company says about its own model, with no external check.
The Dismal Track Record of “Transformers Are Over”
Subquadratic architectures have been declared the future of AI at least twice in the last three years, and have consistently failed to deliver at frontier scale.
Mamba, introduced by Albert Gu and Tri Dao in late 2023, is the most technically rigorous of these alternatives. Its state-space model approach achieves true linear scaling with sequence length. In independent benchmarks at up to around 7B parameters, Mamba performs comparably to transformers. At 70B and beyond, it consistently underperforms. It is not used in any frontier production model today.
RWKV follows a similar arc. Elegant in theory, it merges recurrent and attention-based processing to reduce long-range dependency costs. In practice, it hits a quality ceiling at scale that dense attention does not. Hybrid architectures like Jamba, which interleaves Mamba and transformer layers, have shown promise but have not displaced pure-transformer models at the frontier.
Most relevant is Magic.dev’s August 2024 announcement of a 100-million-token context window model, also claiming roughly a 1,000x efficiency advantage. As of mid-2026, there is no public evidence of that model being used in production outside the company itself. The pattern is familiar: extraordinary efficiency claims, limited public access, no independent replication, gradual quiet retreat.
Subquadratic knows this history. Their launch materials directly address the Mamba comparison, arguing that SSA preserves the token-mixing expressiveness that state-space models sacrifice. Whether that argument holds at scale is the central open question.
What Would Change the Picture
A few specific developments would make SubQ worth treating as a real architectural shift rather than a well-funded press release.
Open weights or a full technical report. A public SubQ model that the research community could run, evaluate, and probe would settle the benchmark debate quickly. The absence of either is the single biggest signal that the company is not yet confident in broad external scrutiny.
Independent evaluation on harder benchmarks. Subquadratic cites 95% on RULER@128K, a synthetic retrieval task, but RULER is known to be easier than real-world long-context challenges. MRCR v2 and needle-in-a-haystack variants with adversarial distractors are harder to game. Those results would be more credible.
Quality parity at normal context lengths. SubQ’s efficiency advantage is most pronounced at 1M-plus tokens. If the model underperforms GPT-5.5 or Claude Opus at 8K tokens — the regime where most production queries live — then the architecture involves quality tradeoffs that have not been disclosed. That matters for anyone considering building on it.
The business case for a working subquadratic model is real. Enterprises running retrieval-augmented generation over large document collections, genomics teams processing whole-genome data in a single context, legal teams analyzing entire case histories — all of them would benefit enormously from linear-scaling attention at frontier quality. The market for a genuinely working 12M-token model at one-fifth the cost is enormous. That makes the claims worth watching carefully, and the absence of independent proof worth flagging just as carefully.
Further Reading
- VentureBeat: Miami startup Subquadratic claims 1,000x AI efficiency gain — the original launch coverage with researcher reactions and context on the claims
- Fundamental Limitations on Subquadratic Alternatives to Transformers — the formal theory paper establishing hardness results that SubQ’s SSA must navigate around
- DataCamp: SubQ AI Explained — a clear technical breakdown of how SSA works and where the claimed efficiency gains come from

