DeepSeek V4 Pro: Frontier Coding, Open Weights, Safety Gap

Image: Picsum Photos (CC0)

The Open-Weight Model That Matches Closed-Model Coding Performance

DeepSeek released V4 Pro on April 24, 2026, and two months of production data have confirmed the headline numbers. The model scores 80.6% on SWE-bench Verified — tied with Gemini 3.1 Pro, and the highest score among all open-weight models. For software engineering teams evaluating whether an open-weight model can replace a closed-model API, that number is significant. The catch is that a May 2026 NIST evaluation found a safety gap that deserves equal attention.

V4 Pro uses a Mixture-of-Experts (MoE) architecture with 1.6 trillion total parameters but only 49 billion activated per token — the same strategy that made DeepSeek V3 cost-efficient, taken further. The context window is 1 million tokens with a 384,000-token maximum output, the model is released under the MIT License, and the weights are available on Hugging Face (865GB for the Pro variant). The API price has been $0.87 per million output tokens since May 22, when DeepSeek made permanent a 50% discount that was originally set to expire.

What the Benchmarks Actually Show

The 80.6% SWE-bench Verified score is the most credible coding benchmark available because it uses real GitHub issues rather than synthetic tasks. At that level, V4 Pro sits alongside Gemini 3.1 Pro and 0.1 points ahead of MiniMax M3, which we covered last week. On Codeforces, V4 Pro’s rating of 3206 places it 23rd among all human competitors — not a synthetic benchmark, but live competitive programming contests.

On LiveCodeBench, V4 Pro scores 93.5%. The Artificial Analysis Intelligence Index ranks it second among open-weight reasoning models, behind Kimi K2.6. Against closed models, the picture is more mixed: GPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs 67.9%) and GPQA Diamond (93.6% vs 90.1%), while on SWE-bench Pro for real-world GitHub issues, GPT-5.5 scores 58.6% to V4 Pro’s 55.4%, with Claude Opus 4.7 ahead of both at 64.3%.

The architecture has a specific innovation worth understanding: a hybrid attention mechanism combining Compressed Sparse Attention (CSA), which selects the 1,024 most relevant KV pairs per query, and Heavily Compressed Attention (HCA), which provides cheap global context from distant tokens. In the 1-million-token context setting, this means V4 Pro requires only 27% of the single-token inference FLOPs and 10% of the KV cache compared with DeepSeek V3.2. That efficiency gain is what makes 1M-context inference economically viable at current pricing. The model was pre-trained on 32 trillion tokens using the Muon optimizer, with a two-stage post-training pipeline: domain-expert fine-tuning followed by unified consolidation via on-policy distillation.

The NIST Evaluation That Changes the Calculation

In May 2026, NIST’s Center for AI Standards and Innovation (CAISI) completed an evaluation of DeepSeek V4 Pro using non-public benchmarks across five domains: cybersecurity, software engineering, natural sciences, abstract reasoning, and mathematics. The conclusion: V4 Pro performs similarly to GPT-5 — a model released roughly eight months before the NIST report. In other words, V4 Pro is frontier-grade by open-weight standards, but still behind the current closed-model frontier by approximately two generations.

More consequential for enterprise buyers is the safety finding. NIST found that DeepSeek’s most secure model responded to 94% of malicious jailbreak requests. US frontier reference models responded to 8%. That is not a marginal gap — it is a qualitative difference in how the model handles adversarial inputs. For internal tooling with trusted users, this may be acceptable. For customer-facing deployments, regulated industries, or any system where untrusted users can submit inputs, it requires explicit mitigation.

This is not a reason to discard V4 Pro. It is a reason to be precise about where it fits. The NIST report frames the gap as one that Chinese AI labs have historically closed within 12 to 18 months. Whether that holds for the safety dimension as well as the capability dimension remains an open question.

The Price Calculation Teams Are Running

The cost gap between V4 Pro and closed-model alternatives is large enough to force a decision. At $0.87 per million output tokens, V4 Pro is 28.7x cheaper than Claude Opus 4.8 ($25/M output) and 34.5x cheaper than GPT-5.5. At scale — tens of millions of tokens per day — that difference is not a rounding error; it is a budget category.

Microsoft disclosed on June 18 that it is evaluating V4 Pro as a lower-cost engine for its Copilot Cowork enterprise agent, which signals that the model’s quality bar is high enough for serious enterprise use cases. The open-weight release also matters for a different reason: teams that host V4 Pro on their own infrastructure eliminate data residency concerns and gain an exit ramp from API vendor dependency. The original V4 launch already shifted the conversation toward on-premise deployment; the Pro variant extends that calculus to frontier-grade coding tasks.

The practical decision framework that most teams end up with looks like this: V4 Pro handles the bulk of agentic coding work where the task is well-defined and the inputs are trusted. Claude Opus 4.8 or GPT-5.5 handles the harder 5% — complex multi-step reasoning, high-stakes agentic loops, or any context where the safety difference matters. The pricing gap is too large to ignore, and the 80% SWE-bench score means V4 Pro is not a compromise for the tasks it handles best.

What to Watch Next

DeepSeek has historically followed major releases with safety-focused updates within two to four months. Whether V4 Pro’s 94% jailbreak response rate drops meaningfully in a subsequent patch will determine whether enterprise deployments currently blocked on safety compliance can proceed. NIST has indicated it will re-evaluate if a significant safety update is released.

The broader signal from V4 Pro is that the open-weight tier is now competitive on coding performance in ways it was not six months ago. For teams building coding agents, the question is no longer whether open weights are viable — it is how to manage the safety and compliance differences that remain. V4 Pro makes that a concrete operational problem rather than a theoretical one.

DeepSeek V4 Pro: Frontier Coding, Open Weights, Safety Gap

The Open-Weight Model That Matches Closed-Model Coding Performance

What the Benchmarks Actually Show

The NIST Evaluation That Changes the Calculation

The Price Calculation Teams Are Running

What to Watch Next

Further Reading

Don’t miss on Ai tips!

Don’t miss on Ai tips!

DeepSeek V4 Pro: Frontier Coding, Open Weights, Safety Gap

The Open-Weight Model That Matches Closed-Model Coding Performance

What the Benchmarks Actually Show

The NIST Evaluation That Changes the Calculation

The Price Calculation Teams Are Running

What to Watch Next

Further Reading

Don’t miss on Ai tips!

Don’t miss on Ai tips!

Enjoyed this? Get one AI insight per day.

Related Articles

DeepSeek V4 Pro: Frontier Coding, Open Weights, Safety Gap

Anthropic’s $47B Run Rate: What 80x Growth Means

MiniMax M3: Open-Weight AI Tops SWE-Bench Pro at 59%