What Google Released — and Why It Matters
On April 2, 2026, Google DeepMind released Gemma 4 under the Apache 2.0 license — meaning it’s free to use, modify, and deploy commercially. The family ships in four sizes: E2B (2.3B effective parameters), E4B (4.5B effective), a 26B Mixture-of-Experts model with only 3.8B parameters active during inference, and a 31B dense model. Every variant supports multimodal input; the smaller E2B and E4B models add native audio understanding on top.
The headline result is striking: the 31B dense model currently ranks third on the Chatbot Arena text leaderboard among all open-weight models globally. The 26B MoE sits at sixth. Both positions beat proprietary and open models with parameter counts 10 to 20 times higher. The question worth asking is why — because architecture, not just scale, explains this gap.
Three Architecture Changes That Explain the Performance
The jump from Gemma 3 to Gemma 4 is the largest generational improvement in the family’s history, driven in part by architectural research from the Gemini 3 program. Three innovations stand out.
Per-Layer Embeddings
Standard transformers assign each token a single embedding vector at input, which the residual stream then builds on across all layers. Per-Layer Embeddings (PLE), first used in Gemma 3n and refined here, add a parallel lower-dimensional conditioning pathway. For each token, PLE computes a small dedicated vector for every decoder layer — combining a token-identity signal from an embedding lookup with a context-aware signal from a learned projection of the main embeddings.
Each decoder layer then uses its corresponding PLE vector to modulate hidden states via a lightweight residual block, applied after the attention and feed-forward operations. The result: each layer receives richer per-token information without widening the main model dimensions. You extract more capability from the same parameter budget.
Alternating Local and Global Attention
Gemma 4 alternates between two attention regimes: local sliding-window layers that operate on a limited context window, and global full-context layers that see the entire sequence. Local layers are fast and memory-efficient; global layers provide the long-range reasoning that tasks like multi-step math or extended code generation require. By interleaving the two, Gemma 4 achieves near-full-context reasoning at significantly lower compute cost than an all-global-attention model of the same size.
A Smarter Mixture-of-Experts Design
The 26B A4B variant uses a MoE design with 128 experts in total, 8 of which are activated per token. What makes it distinctive is the permanently active shared expert — which is three times the size of any individual routed expert. The shared expert holds general knowledge that should apply to every token; the routed experts handle domain-specific reasoning. Only 3.8 billion parameters are computed during each forward pass, even though all 26 billion must reside in memory. That makes the inference compute cost comparable to a much smaller dense model, while retaining the representational capacity of the full network.
Benchmark Results in Context
Gemma 4 31B scores 89.2% on AIME 2026 (mathematical reasoning without tools) and 85.2% on MMLU Pro (broad knowledge). On LiveCodeBench v6 (coding), it reaches 80.0%, and 84.3% on GPQA Diamond (scientific graduate-level reasoning). These results are on par with or ahead of models that are significantly larger by parameter count.
To put the competition in context: DeepSeek V4, released earlier this year with a reported 1 trillion parameters, scores in the low 40s on AIME 2026 and GPQA Diamond — substantially behind Gemma 4 31B on those specific benchmarks. Mistral Large 3 at 675B is also outpaced on several academic benchmarks.
One honest caveat: benchmark rankings don’t always transfer to production. AIME 2026 and MMLU Pro measure specific reasoning abilities under controlled conditions. Real-world performance on domain-specific tasks — code generation in proprietary frameworks, structured data extraction, long-document analysis — can differ substantially. The Arena ranking is more meaningful precisely because it reflects real human preferences across diverse prompts, not just curated test sets.
What This Means for Teams Deploying Open Models
The practical implication is compute cost. A 31B dense model in float16 fits on a single A100 80GB GPU, or two smaller GPUs in mixed precision. Running inference at scale is feasible with infrastructure most engineering teams already own. Compare that to a 600B dense model, which requires a minimum of six to eight high-memory GPUs just to load.
The 26B MoE variant is even more interesting for constrained deployments. With only 3.8B active parameters per step, inference throughput is closer to a 4B model than a 26B one — but you’re drawing on the full 26B parameter space for routing decisions. The tradeoff is that all 26B parameters still need to be in memory, so the memory footprint doesn’t shrink proportionally. Teams will need to benchmark this on their specific hardware before optimizing for it.
For edge and on-device scenarios, the E2B model is the real story: it can run in under 1.5GB of RAM using quantized weights and memory-mapped per-layer embeddings. This makes Gemma 4 viable on modern smartphones and edge devices without a network round-trip — useful for latency-sensitive applications, offline environments, and privacy-conscious deployments where data cannot leave the device.
All Gemma 4 variants support multimodal input (text and images) natively. The two smaller models additionally handle audio, enabling speech recognition and audio understanding workflows on-device. For document-heavy or vision-grounded engineering workflows, multimodal support at this efficiency level opens up use cases that previously required cloud inference.
The Bigger Picture for Open-Weight AI
Gemma 4’s release continues a trend that was already clear by early 2026: the open-weight frontier is closing the gap with proprietary models faster than most organizations expected. Architectural techniques like PLE, alternating attention, and optimized MoE routing are no longer exclusive to the labs with the largest compute budgets — they’re showing up in models that run on a single workstation.
The Apache 2.0 license matters too. Unlike some recent open-weight releases that restrict commercial use or impose usage policies, Gemma 4 is genuinely open for deployment, fine-tuning, and redistribution. For teams evaluating whether to build on proprietary APIs or self-host, the cost and control calculus just shifted again.
The next question is fine-tuning accessibility. At 31B parameters, supervised fine-tuning still demands significant GPU hours and memory. LoRA and QLoRA techniques bring that down, but not to trivial levels. The 26B MoE’s inference efficiency doesn’t fully translate to training efficiency — routing mechanisms add complexity that full fine-tuning workflows need to account for. How the community tackles this over the next few months will determine how much of the benchmark performance survives domain-specific adaptation.
Further Reading
- Google’s Gemma 4 launch post — the official announcement with release details and intended use cases
- A Visual Guide to Gemma 4 by Maarten Grootendorst — clear diagrams of the PLE and MoE architectures if you want to understand the internals without reading the technical report
- Artificial Analysis: Gemma 4 31B — live price/performance benchmarks across API providers, useful for deployment cost estimation
