NVIDIA Nemotron 3 Super: 5x Faster Agentic Reasoning

NVIDIA released Nemotron 3 Super on March 11, 2026 — a 120B hybrid Mamba-Transformer MoE model with only 12B active parameters that delivers 2.2x higher throughput than GPT-OSS-120B. It scores 60.47% on SWE-Bench Verified (versus 41.90% for GPT-OSS) and maintains 91.75% accuracy on the RULER benchmark at 1 million token context. Here's what the architecture actually does and where the caveats live.

Photo by Pixabay on Pexels

Introduction

The intuition that better models must be slower models has quietly governed AI deployment decisions for years. NVIDIA’s Nemotron 3 Super, released on March 11, 2026 at GTC, is a direct challenge to that assumption. The model packs 120 billion total parameters into an architecture that activates only 12 billion per token — and delivers up to 2.2x higher inference throughput than OpenAI’s GPT-OSS-120B at comparable or better accuracy on most benchmarks.

That gap matters. At production scale, throughput isn’t a nice-to-have; it’s the difference between a viable agentic system and an expensive prototype. Nemotron 3 Super is built from the ground up to close that gap, using a combination of three architectural ideas that, until now, hadn’t been fused at this scale.

A Hybrid Architecture That Earns Its Complexity

Most large language models today are either pure Transformers or, increasingly, Mixture-of-Experts Transformers. Nemotron 3 Super takes a different path: it interleaves three distinct layer types — Mamba-2 state space model layers, standard Transformer attention layers, and MoE feed-forward layers — into a single backbone. Each layer type is doing a different job, and the design avoids redundancy rather than adding it.

The Mamba-2 layers handle the bulk of sequence processing. State space models compute in linear time relative to context length, which is what makes Nemotron 3 Super’s 1 million token native context window practical rather than theoretical. The Transformer attention layers complement this by providing precise token-to-token recall for tasks that require structural reasoning — code manipulation, mathematical proofs, multi-step planning. The MoE layers then handle parameter scaling without proportional compute scaling.

The MoE design itself includes a notable innovation NVIDIA is calling LatentMoE. Standard expert routing operates on full-dimensional token embeddings. LatentMoE first compresses tokens from a hidden dimension of 4096 down to 1024 before routing, then expands back. The effect: activating 4x more experts at the same computational cost. The model has 512 total experts with 22 active per token — a ratio that would be prohibitively expensive without the compression step.

One more piece: Multi-Token Prediction (MTP) layers. Rather than predicting one next token at a time, MTP predicts several simultaneously. On SPEED-Bench, Nemotron 3 Super achieves an average acceptance length of 3.45 tokens per verification step, compared to DeepSeek-R1’s 2.70. This built-in speculative decoding enables up to 3x wall-clock speedups for structured generation without requiring a separate smaller draft model.

The Numbers: What the Benchmarks Actually Show

NVIDIA’s throughput claims are specific: on a standard 8K input / 64K output configuration running on B200 GPUs with vLLM or TRT-LLM, Nemotron 3 Super reaches 449–478 output tokens per second, compared to GPT-OSS-120B’s roughly 215 tokens per second. The 7.5x advantage over Qwen3.5-122B is less surprising — that comparison uses Qwen in BF16 without quantization, which is not a fair fight in inference-optimized production environments.

The accuracy story is more nuanced. On SWE-Bench Verified — arguably the most meaningful benchmark for software engineering agents — Nemotron 3 Super scores 60.47% against GPT-OSS-120B’s 41.90% (both evaluated via OpenHands). That’s a substantial gap that holds up across independently reproduced results. On RULER at 1 million token context, Nemotron 3 Super reaches 91.75% while GPT-OSS-120B drops to 22.30%. GPT-OSS loses over half its accuracy between 256K and 1M tokens; Nemotron 3 Super loses under 5 percentage points across the same range.

Where Nemotron 3 Super underperforms is on conversational quality. On Arena-Hard V2, it scores 73.88% against GPT-OSS-120B’s 90.26%. This is not a flaw in the design — it’s a deliberate choice. The model is optimized for agentic execution: long-horizon task completion, tool use, multi-step reasoning. Chat pleasantness was not the objective function.

Independent evaluation from Artificial Analysis placed Nemotron 3 Super at score 36 on their Intelligence Index, ahead of GPT-OSS-120B (33) but behind the more recent Qwen3.5 122B A10B (42). They also flagged a real production concern: extreme verbosity. During their evaluation suite, Nemotron 3 Super generated 110 million tokens compared to GPT-OSS-120B’s 77 million in high-effort mode. That verbosity can erase much of the throughput advantage in practice, depending on the task.

How NVIDIA Built It: Training at 25 Trillion Tokens

The training pipeline follows three sequential phases. Pretraining runs on 25 trillion tokens in NVFP4 — NVIDIA’s 4-bit floating-point format native to Blackwell GPUs. The key distinction: the model trains in 4-bit precision from the first gradient update rather than being quantized after full-precision training. NVIDIA claims 4x improved memory and compute efficiency on B200 compared to FP8 on H100.

Supervised fine-tuning follows, drawing from 7 million curated samples out of a 40 million sample post-training corpus. The final phase is reinforcement learning via NeMo Gym, NVIDIA’s open-source RL environment library, run across 21 environment configurations with approximately 1.2 million rollouts. The RL phase is what NVIDIA credits for the model’s agentic performance — specifically its ability to sustain accuracy over multi-step task sequences rather than degrading after a few rounds of tool calls.

The model supports 7 languages including English, French, German, Japanese, Spanish, Italian, and Chinese, plus 43 programming languages. Weights, datasets, and training recipes are available on Hugging Face under an open license, which means teams can customize and redeploy on their own infrastructure rather than routing through NVIDIA’s API.

Who Is Using It and For What

NVIDIA announced several early adopters at GTC. On the software engineering side, CodeRabbit, Factory, and Greptile are integrating Nemotron 3 Super into their AI agents alongside proprietary models — typically using it for tasks where long-context recall and coding accuracy matter more than conversational fluency. On the enterprise side, Amdocs, Palantir, Cadence, Dassault Systèmes, and Siemens are deploying and customizing the model for workflow automation. That last group — industrial and ERP-adjacent companies — is a signal. When Siemens is running open-weight models fine-tuned on internal data, inference throughput becomes a cost-of-goods-sold question, not a research metric.

For teams building agentic pipelines with coding agents or evaluating models for production agentic deployments, Nemotron 3 Super’s combination of open weights, long-context stability, and throughput efficiency makes it worth serious evaluation — particularly for tasks that would ordinarily require a frontier proprietary model.

What to Watch

The verbosity issue is real and worth monitoring. A model that generates 42% more tokens than GPT-OSS-120B to reach the same answer erases its throughput advantage in token-billed or latency-sensitive environments. NVIDIA’s published numbers use token throughput as the primary metric; per-task cost may tell a different story depending on prompt structure and task type.

Nemotron 3 Super also sits in the middle of NVIDIA’s announced roadmap. Nemotron 3 Nano (30B) is already available for edge deployments. Nemotron 3 Ultra, at 500 billion parameters, is expected later in 2026. The interesting question is whether the hybrid Mamba-Transformer architecture scales cleanly to Ultra’s size, or whether the tradeoffs that make Super efficient become harder to manage at significantly larger scale.

Conclusion

Nemotron 3 Super is the most compelling evidence yet that MoE architecture, state space models, and native low-precision training can be combined into a single system without sacrificing quality to achieve speed. Its 2.2x throughput advantage over GPT-OSS-120B at comparable accuracy on coding and long-context tasks is not a cherry-picked number — it holds up across independent evaluation. The verbosity problem and the chat quality gap are real limitations worth tracking, but neither undermines the core claim: for agentic workloads at scale, this model changes the inference economics calculus. The release of open weights and training recipes means those economics are now accessible without going through NVIDIA’s cloud.

NVIDIA Nemotron 3 Super: 5x Faster Agentic Reasoning

Introduction

A Hybrid Architecture That Earns Its Complexity

The Numbers: What the Benchmarks Actually Show

How NVIDIA Built It: Training at 25 Trillion Tokens

Who Is Using It and For What

What to Watch

Conclusion

Further Reading

Don’t miss on GenAI tips!

Don’t miss on GenAI tips!

Related Posts

Factory AI Agents: Siemens, Samsung, and the 2030 Bet

Self-Driving Labs: AI Takes Over the Experiment