Apr 25, 2026

Other LLMNEW

DeepSeek V4 Pro and Flash: 1.6T Parameters, 1M Context, Frontier Pricing

DeepSeek released V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active) on April 24, featuring 80.6% SWE-bench and near-frontier performance at a fraction of rival costs.

#DeepSeek#DeepSeek V4#Open Source LLM#LLM#SWE-bench

DeepSeek V4 Pro and Flash: 1.6T Parameters, 1M Context, Frontier Pricing

AI Summary

DeepSeek released V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active) on April 24, featuring 80.6% SWE-bench and near-frontier performance at a fraction of rival costs.

DeepSeek Returns with V4 — and Another Shock to the Market

Exactly one year after its V3 release rattled Silicon Valley, Chinese AI lab DeepSeek unveiled preview versions of DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24, 2026. The two models extend a pattern that has come to define the lab: near-frontier performance delivered at prices that force every major competitor to reconsider their cost structures.

Both models are open-weight, distributed under the MIT license, and available via Hugging Face as well as DeepSeek's own API and web chat interface.

Architecture: Three Key Innovations

DeepSeek built V4 around three architectural advances that distinguish it from its V3.2 predecessor.

Hybrid Attention Architecture combines Compressed Sparse Attention (compression rate 4) with Heavily Compressed Attention (compression rate 128), interleaved throughout the model's layers. This dramatically reduces the KV cache required for long-context inference. At one million tokens, V4-Pro requires only 27% of the inference FLOPs and 10% of the KV cache compared to V3.2 — a 3.7x FLOPs improvement and a 9.5x KV cache reduction. V4-Flash is even more aggressive: 10% of FLOPs (9.8x improvement) and 7% of KV cache (13.7x reduction).

Manifold-Constrained Hyper-Connections address training stability for deep stacks, preventing the loss spikes that have historically plagued trillion-parameter models. Two additional mechanisms — Anticipatory Routing and SwiGLU Clamping — were introduced specifically to stabilize training at this scale.

Muon Optimizer replaces the standard AdamW optimizer for most parameters, delivering faster convergence and more stable training according to DeepSeek's technical report.

Model Specifications

DeepSeek-V4-Pro:

1.6 trillion total parameters; 49 billion activated per forward pass
61 layers, 7,168 hidden size
384 routed experts (6 active) plus 1 shared expert
Pre-trained on 33 trillion tokens
FP8 precision (FP4 for routed experts)

DeepSeek-V4-Flash:

284 billion total parameters; 13 billion activated per forward pass
43 layers, 4,096 hidden size
256 routed experts (6 active) plus 1 shared expert
Pre-trained on 32 trillion tokens

Both models support native 1-million-token context and dual thinking modes with three effort levels: Non-Think (fastest), High (balanced), and Max (extended reasoning with reduced length penalties).

Benchmark Results

DeepSeek's headline numbers are striking, especially on coding tasks:

Benchmark	V4-Pro Score	Notes
SWE-bench Verified	80.6%	Competitive with frontier models
LiveCodeBench	93.5%	Top coding benchmark
Codeforces Rating	3,206	23rd percentile among human competitors
HMMT 2026	95.2%	Advanced mathematics
Putnam-200 Pass@8	81.0%	University-level math
CorpusQA 1M	62.0%	Long-context retrieval

Honest disclosure: V4-Pro trails Gemini 3.1 Pro on general knowledge tasks (SimpleQA, MMLU-Pro), falls behind GPT-5.5 on agentic coding benchmarks (Terminal Bench: 67.9% vs. 75.1%), and underperforms Claude Opus 4.6 on long-document retrieval tasks (83.5% vs. 92.9% on MRCR). These are real gaps that enterprise users evaluating V4 for production should assess carefully.

Pricing: The DeepSeek Differentiator

Final public API pricing had not been formally announced at time of publication, but DeepSeek's established pricing philosophy makes the range predictable. V4-Pro is reported at $3.48 per million output tokens, while V4-Flash is targeted at $0.28 per million output tokens — compared to GPT-5.5's $30 per million output tokens and Claude Opus 4.7's $25 per million output tokens. The API supports both OpenAI ChatCompletions and Anthropic message formats, with no long-context price premium.

China Hardware Independence

One strategically significant detail in the V4 release: Huawei announced full Ascend supernode support for V4 deployment. This means Chinese organizations can run DeepSeek V4 entirely on domestic AI hardware without dependence on US GPU exports — a geopolitically meaningful capability given ongoing semiconductor export controls.

Reasoning Modes and Post-Training

Post-training employed separate domain-specialist experts trained via Supervised Fine-Tuning and Group Relative Policy Optimization, unified through On-Policy Distillation where the full model learns from all domain teachers simultaneously. The three reasoning effort levels show meaningful scaling: on SimpleQA-Verified, V4-Pro scores 45.0% in Non-Think mode and 57.9% in Max mode.

Known Limitations

DeepSeek's technical documentation is notably candid about limitations. Long-context retrieval accuracy degrades above 128K tokens, reaching 66% on MRCR at the full 1M context length. Multimodal support is under development and not yet available. The architecture is acknowledged as "relatively complex" by the developers themselves, which could affect third-party deployment and fine-tuning efforts.

Conclusion

DeepSeek V4 Pro and Flash represent a meaningful technical step forward from V3.2, with genuine innovations in attention efficiency, training stability, and reasoning capability. The combination of SWE-bench 80.6%, a 1-million-token context window, and sub-$4 per-million-token output pricing will exert continued downward pressure on frontier AI pricing. For development teams primarily focused on coding and mathematical reasoning who are cost-sensitive, DeepSeek V4 is a serious option. For organizations requiring top-tier multimodal performance, long-document retrieval, or agentic computer use, frontier proprietary models currently hold a measurable edge.

Pros

Best-in-class pricing at $3.48/M output tokens for a frontier-grade coding model
SWE-bench 80.6% and LiveCodeBench 93.5% are genuinely competitive on coding benchmarks
Native 1M-token context with significant KV cache efficiency gains vs V3.2
Open weights under MIT license enable self-hosting and fine-tuning
Transparent benchmark reporting including acknowledged weaknesses vs competitors

Cons

Trails GPT-5.5 on agentic computer use tasks (Terminal Bench 67.9% vs 75.1%)
Long-context retrieval degrades meaningfully above 128K tokens (66% vs 92.9% Claude Opus 4.6 at 1M on MRCR)
No multimodal support at launch; text-only limits applicability for vision-heavy workflows
Complex architecture may complicate third-party deployment and fine-tuning efforts

References

DeepSeek V4 Preview Release — DeepSeek API Docs DeepSeek V4 — almost on the frontier, a fraction of the price — Simon Willison DeepSeek unveils V4 model, with rock-bottom prices and close integration with Huawei's chips — Fortune DeepSeek returns with V4-Pro and V4-Flash, a year after its Sputnik moment — The Next Web DeepSeek rolls out V4 update with 1M-token context and stronger reasoning — TechXplore

Comments0

Key Features

1. DeepSeek-V4-Pro: 1.6T total parameters, 49B activated per token, 61 layers, pre-trained on 33 trillion tokens 2. DeepSeek-V4-Flash: 284B total parameters, 13B activated per token, 256 routed experts, pre-trained on 32 trillion tokens 3. Hybrid Attention Architecture reduces V4-Pro KV cache by 90% vs V3.2 at 1M token context 4. SWE-bench Verified 80.6% and LiveCodeBench 93.5% — competitive with frontier proprietary models on coding 5. V4-Pro at $3.48/M output tokens vs GPT-5.5 at $30/M — roughly 8.6x cheaper 6. Full Huawei Ascend chip support enables China deployment without US GPU dependencies 7. MIT license with open weights on Hugging Face

Key Insights

DeepSeek's efficiency gains (27% FLOPs, 10% KV cache vs V3.2 at 1M context) represent a genuine architectural advance, not just incremental scaling
The $3.48 per million output tokens price point will force further downward pressure on frontier model pricing across the industry
SWE-bench 80.6% is competitive with Claude Opus 4.7 and marks DeepSeek's strongest coding performance to date
Huawei Ascend chip compatibility signals that China's domestic AI ecosystem is maturing toward hardware independence from US suppliers
Honest benchmark disclosure — acknowledging gaps versus Gemini, GPT-5.5, and Claude — increases developer trust in V4's evaluation claims
The dual-model strategy (Pro + Flash) mirrors Anthropic's Opus/Haiku and OpenAI's flagship/mini tiering, serving both performance and cost-optimization use cases
MIT licensing continues DeepSeek's open-weight philosophy, enabling enterprise self-hosting without API dependency
Long-context retrieval degradation above 128K remains a practical limitation for enterprise document processing workflows

Was this review helpful?

Twitter/X

Related AI Reviews

NEWOther LLM

xAI Launches Grok Voice Think Fast 1.0: #1 on τ-voice Bench, Powers Starlink Support

xAI released Grok Voice Think Fast 1.0 on April 25, 2026, topping the τ-voice Bench at 67.3% and powering Starlink's customer support with a 70% autonomous resolution rate.

Kimi K2.6: Moonshot AI's 1-Trillion Parameter Open-Weight Model Challenges US Frontier LLMs

Moonshot AI released Kimi K2.6 on April 20, 2026 — a 1-trillion parameter open-weight model with 300-agent swarm support and benchmark scores that rival GPT-5.4 and Claude Opus 4.6.

xAI Grok 4.3 Beta Launches for SuperGrok Heavy: 500B Parameters with 1T on the Way

xAI released Grok 4.3 beta on April 17, 2026, exclusively for SuperGrok Heavy subscribers, with a live 500B-parameter checkpoint and a full 1 trillion-parameter version finishing training within days.

DeepSeek R2 Review: 32B Open-Weight Model Hits 92.7% on AIME at 70% Lower Cost

DeepSeek releases R2, a 32B dense transformer reasoning model that achieves frontier-level math scores on a single consumer GPU, priced 70% below Western alternatives.

Apr 15, 2026

DeepSeek+7

Visit Official Site

🟠Anthropic Claude 💎Google Gemini 🤖OpenAI GPT

DeepSeek V4 Pro and Flash: 1.6T Parameters, 1M Context, Frontier Pricing

DeepSeek Returns with V4 — and Another Shock to the Market

Architecture: Three Key Innovations

Model Specifications

Benchmark Results

Pricing: The DeepSeek Differentiator

China Hardware Independence

Reasoning Modes and Post-Training

Known Limitations

Conclusion

Pros

Cons

References

Comments0

Key Features

Key Insights

Was this review helpful?

Share

Related AI Reviews

xAI Launches Grok Voice Think Fast 1.0: #1 on τ-voice Bench, Powers Starlink Support

Kimi K2.6: Moonshot AI's 1-Trillion Parameter Open-Weight Model Challenges US Frontier LLMs

xAI Grok 4.3 Beta Launches for SuperGrok Heavy: 500B Parameters with 1T on the Way

DeepSeek R2 Review: 32B Open-Weight Model Hits 92.7% on AIME at 70% Lower Cost