Skip to content

Cloudflare Unweight: LLMs 22% Smaller, 3x Faster Inference

6 min read

Cloudflare Unweight: LLMs 22% Smaller, 3x Faster Inference
Photo by Brett Sayles on Pexels

The Weight Problem Nobody Was Talking About

Every time a GPU loads a large language model for inference, it shuffles billions of floating-point values from high-bandwidth memory (HBM) to its tensor cores. That transfer is slow, power-hungry, and — as Cloudflare’s research team has now demonstrated — surprisingly wasteful. Their new system, Unweight, compresses LLM weight tensors by 15–22% without touching a single output bit. The model still produces identical results. It just gets there faster and with less hardware.

Cloudflare published the technical details in a blog post and an accompanying research paper last month, alongside their broader infrastructure post on running frontier-scale models on Workers AI. The timing matters: Cloudflare had just added Moonshot AI’s Kimi K2.5 — a 256k-context, frontier-scale open-source model — to Workers AI. Unweight let them make it 3x faster in production.

How Unweight Works: A Compression Problem Hidden in Plain Sight

Modern LLMs store weights in BF16 format — 16 bits per value, split into a sign bit, a 7-bit mantissa, and an 8-bit exponent. The exponent carries most of the magnitude information. And here’s the structural inefficiency Cloudflare’s team spotted: in a trained LLM, out of 256 possible exponent values, the top 16 typically account for over 99% of all weights in a given layer.

That’s extraordinary redundancy. Unweight exploits it with a classic information-theory tool — Huffman coding. Short bit sequences map to common exponent values; longer sequences handle the rare ones. The result is that the exponent byte compresses substantially, while the sign and mantissa bits are handled separately. For the right models and layers, total size shrinks 15–22%.

The clever part is where decompression happens. Unweight doesn’t decompress to disk or DRAM. Instead, it decompresses directly into the GPU’s on-chip shared memory — the fast scratchpad between HBM and the tensor cores. A custom matrix multiplication kernel fuses decompression with computation in a single operation: load compressed data from HBM, reconstruct BF16 values in shared memory, feed tensor cores. The GPU never sees uncompressed weights in the bandwidth-constrained path. The GPU kernels are open-source on GitHub under the cloudflareresearch organization, targeting NVIDIA Hopper GPUs (H100, H200).

Real-World Impact: Kimi K2.5 Goes from 100ms to 20-30ms per Token

Unweight isn’t a lab experiment. Cloudflare has already deployed it as part of a broader infrastructure overhaul that made Kimi K2.5 3x faster on Workers AI. The headline metric: p90 time-per-output-token dropped from roughly 100ms (with high variance) to 20–30ms. For a model used in agentic tasks with multi-turn tool calls, that’s the difference between a responsive experience and one where users abandon the session.

The 3x gain didn’t come from Unweight alone. Cloudflare also overhauled how they balance load between prefill and decode stages. Their new token-aware load balancing system estimates in-flight prefill and decode tokens per endpoint and routes requests accordingly — rather than treating all inference requests as equivalent. Prefill (processing the prompt) and decode (generating tokens) have very different compute profiles; mixing them naively on the same endpoint kills throughput. Separating them properly restores predictable latency.

Kimi K2.5 is now available on Workers AI, along with the newer Kimi K2.6. Both benefit from this infrastructure. Cloudflare’s stated goal is to extend Unweight to the larger models in their serving fleet as the engineering matures.

The Trade-offs: Not a Free Lunch at Every Batch Size

Cloudflare is candid about the limits. On Llama 3.1 8B, their tests show Unweight saves approximately 13% of total model memory — less than the headline 22%, because not all layers compress equally. More importantly, at typical serving batch sizes, they observe roughly a 30% throughput penalty compared to uncompressed inference.

That trade-off reflects a real tension: decompression adds compute work per forward pass. At small batch sizes (1–4 sequences), the bandwidth savings dominate and the system runs faster. As batch sizes grow toward the regime where most production inference servers operate, the added compute cost starts to bite. The result is a tool well-suited to latency-sensitive, low-concurrency deployments — which is precisely the profile of agentic workloads with long context windows and frequent tool calls. It’s less attractive for high-throughput batch inference where you want to maximize sequences-per-second.

This is an honest engineering trade-off, not a flaw. The use case drives the choice. Cloudflare’s Workers AI targets low-latency, per-request inference for developers — not batch jobs. Unweight fits that use case well.

Why This Matters Beyond Cloudflare

The significance of Unweight goes beyond Cloudflare’s own infrastructure. Three things make this work notable.

First, it’s hardware-agnostic within a GPU generation. Unweight doesn’t require new silicon. It runs on H100s and H200s that inference providers already operate. The optimization is software-side, which means it can be adopted (or adapted) by any provider running Hopper-class hardware.

Second, it’s lossless. This matters enormously in production. Quantization-based compression (int8, int4, GPTQ) changes model outputs — sometimes in acceptable ways, sometimes not. Unweight produces bit-exact outputs. There’s no need to re-evaluate the model, run evals, or worry about edge-case regressions. The math is identical.

Third, it points to a broader insight: LLM weights are structurally redundant in ways we haven’t fully exploited. Most optimization work targets activations, KV caches, and attention patterns. The weight tensors themselves have largely been treated as fixed data. Unweight opens a different compression surface, and it’s likely other researchers will now examine what else is hiding in BF16 exponent distributions.

For context on what this kind of infrastructure work costs at frontier scale, see our earlier piece on what trillion-parameter AI actually costs — and how Cloudflare’s edge-native approach, including their earlier Project Think for durable agents, is building a different kind of inference stack than the hyperscalers.

What’s Next

Cloudflare’s research paper signals that Unweight is a program, not a one-off technique. Their roadmap covers three deployment scenarios: dense inference (what they’ve published), model distribution (reducing bandwidth costs when moving weights across the network), and Mixture-of-Experts serving (where expert weights are loaded on demand and compression could meaningfully change the economics). MoE is the interesting one — models like Kimi K2.5 use sparse expert activation, meaning most weights are dormant during any given forward pass. Compressing dormant expert weights before they’re needed is a natural fit.

The GPU kernels are already open-source. If Unweight holds up under broader testing, expect other inference providers to evaluate it. The technique is general: it targets BF16 exponent redundancy, which is a property of trained LLMs, not of any specific model family. The compression ratio will vary by model and architecture, but the underlying observation — that exponent distributions are heavily concentrated — appears to hold across model types.

Further Reading

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Don’t miss on Ai tips!

We don’t spam! We are not selling your data. Read our privacy policy for more info.

Enjoyed this? Get one AI insight per day.

Join engineers and decision-makers who start their morning with vortx.ch. No fluff, no hype — just what matters in AI.