Back to list
Apr 25, 2026
71
0
0
Other LLMNEW

DeepSeek V4 Pro and Flash: 1.6T Parameters, 1M Context, Frontier Pricing

DeepSeek released V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active) on April 24, featuring 80.6% SWE-bench and near-frontier performance at a fraction of rival costs.

#DeepSeek#DeepSeek V4#Open Source LLM#LLM#SWE-bench
DeepSeek V4 Pro and Flash: 1.6T Parameters, 1M Context, Frontier Pricing
AI Summary

DeepSeek released V4-Pro (1.6T params, 49B active) and V4-Flash (284B params, 13B active) on April 24, featuring 80.6% SWE-bench and near-frontier performance at a fraction of rival costs.

DeepSeek Returns with V4 — and Another Shock to the Market

Exactly one year after its V3 release rattled Silicon Valley, Chinese AI lab DeepSeek unveiled preview versions of DeepSeek-V4-Pro and DeepSeek-V4-Flash on April 24, 2026. The two models extend a pattern that has come to define the lab: near-frontier performance delivered at prices that force every major competitor to reconsider their cost structures.

Both models are open-weight, distributed under the MIT license, and available via Hugging Face as well as DeepSeek's own API and web chat interface.

Architecture: Three Key Innovations

DeepSeek built V4 around three architectural advances that distinguish it from its V3.2 predecessor.

Hybrid Attention Architecture combines Compressed Sparse Attention (compression rate 4) with Heavily Compressed Attention (compression rate 128), interleaved throughout the model's layers. This dramatically reduces the KV cache required for long-context inference. At one million tokens, V4-Pro requires only 27% of the inference FLOPs and 10% of the KV cache compared to V3.2 — a 3.7x FLOPs improvement and a 9.5x KV cache reduction. V4-Flash is even more aggressive: 10% of FLOPs (9.8x improvement) and 7% of KV cache (13.7x reduction).

Manifold-Constrained Hyper-Connections address training stability for deep stacks, preventing the loss spikes that have historically plagued trillion-parameter models. Two additional mechanisms — Anticipatory Routing and SwiGLU Clamping — were introduced specifically to stabilize training at this scale.

Muon Optimizer replaces the standard AdamW optimizer for most parameters, delivering faster convergence and more stable training according to DeepSeek's technical report.

Model Specifications

DeepSeek-V4-Pro:

  • 1.6 trillion total parameters; 49 billion activated per forward pass
  • 61 layers, 7,168 hidden size
  • 384 routed experts (6 active) plus 1 shared expert
  • Pre-trained on 33 trillion tokens
  • FP8 precision (FP4 for routed experts)

DeepSeek-V4-Flash:

  • 284 billion total parameters; 13 billion activated per forward pass
  • 43 layers, 4,096 hidden size
  • 256 routed experts (6 active) plus 1 shared expert
  • Pre-trained on 32 trillion tokens

Both models support native 1-million-token context and dual thinking modes with three effort levels: Non-Think (fastest), High (balanced), and Max (extended reasoning with reduced length penalties).

Benchmark Results

DeepSeek's headline numbers are striking, especially on coding tasks:

BenchmarkV4-Pro ScoreNotes
SWE-bench Verified80.6%Competitive with frontier models
LiveCodeBench93.5%Top coding benchmark
Codeforces Rating3,20623rd percentile among human competitors
HMMT 202695.2%Advanced mathematics
Putnam-200 Pass@881.0%University-level math
CorpusQA 1M62.0%Long-context retrieval

Honest disclosure: V4-Pro trails Gemini 3.1 Pro on general knowledge tasks (SimpleQA, MMLU-Pro), falls behind GPT-5.5 on agentic coding benchmarks (Terminal Bench: 67.9% vs. 75.1%), and underperforms Claude Opus 4.6 on long-document retrieval tasks (83.5% vs. 92.9% on MRCR). These are real gaps that enterprise users evaluating V4 for production should assess carefully.

Pricing: The DeepSeek Differentiator

Final public API pricing had not been formally announced at time of publication, but DeepSeek's established pricing philosophy makes the range predictable. V4-Pro is reported at $3.48 per million output tokens, while V4-Flash is targeted at $0.28 per million output tokens — compared to GPT-5.5's $30 per million output tokens and Claude Opus 4.7's $25 per million output tokens. The API supports both OpenAI ChatCompletions and Anthropic message formats, with no long-context price premium.

China Hardware Independence

One strategically significant detail in the V4 release: Huawei announced full Ascend supernode support for V4 deployment. This means Chinese organizations can run DeepSeek V4 entirely on domestic AI hardware without dependence on US GPU exports — a geopolitically meaningful capability given ongoing semiconductor export controls.

Reasoning Modes and Post-Training

Post-training employed separate domain-specialist experts trained via Supervised Fine-Tuning and Group Relative Policy Optimization, unified through On-Policy Distillation where the full model learns from all domain teachers simultaneously. The three reasoning effort levels show meaningful scaling: on SimpleQA-Verified, V4-Pro scores 45.0% in Non-Think mode and 57.9% in Max mode.

Known Limitations

DeepSeek's technical documentation is notably candid about limitations. Long-context retrieval accuracy degrades above 128K tokens, reaching 66% on MRCR at the full 1M context length. Multimodal support is under development and not yet available. The architecture is acknowledged as "relatively complex" by the developers themselves, which could affect third-party deployment and fine-tuning efforts.

Conclusion

DeepSeek V4 Pro and Flash represent a meaningful technical step forward from V3.2, with genuine innovations in attention efficiency, training stability, and reasoning capability. The combination of SWE-bench 80.6%, a 1-million-token context window, and sub-$4 per-million-token output pricing will exert continued downward pressure on frontier AI pricing. For development teams primarily focused on coding and mathematical reasoning who are cost-sensitive, DeepSeek V4 is a serious option. For organizations requiring top-tier multimodal performance, long-document retrieval, or agentic computer use, frontier proprietary models currently hold a measurable edge.

Pros

  • Best-in-class pricing at $3.48/M output tokens for a frontier-grade coding model
  • SWE-bench 80.6% and LiveCodeBench 93.5% are genuinely competitive on coding benchmarks
  • Native 1M-token context with significant KV cache efficiency gains vs V3.2
  • Open weights under MIT license enable self-hosting and fine-tuning
  • Transparent benchmark reporting including acknowledged weaknesses vs competitors

Cons

  • Trails GPT-5.5 on agentic computer use tasks (Terminal Bench 67.9% vs 75.1%)
  • Long-context retrieval degrades meaningfully above 128K tokens (66% vs 92.9% Claude Opus 4.6 at 1M on MRCR)
  • No multimodal support at launch; text-only limits applicability for vision-heavy workflows
  • Complex architecture may complicate third-party deployment and fine-tuning efforts

Comments0

Key Features

1. DeepSeek-V4-Pro: 1.6T total parameters, 49B activated per token, 61 layers, pre-trained on 33 trillion tokens 2. DeepSeek-V4-Flash: 284B total parameters, 13B activated per token, 256 routed experts, pre-trained on 32 trillion tokens 3. Hybrid Attention Architecture reduces V4-Pro KV cache by 90% vs V3.2 at 1M token context 4. SWE-bench Verified 80.6% and LiveCodeBench 93.5% — competitive with frontier proprietary models on coding 5. V4-Pro at $3.48/M output tokens vs GPT-5.5 at $30/M — roughly 8.6x cheaper 6. Full Huawei Ascend chip support enables China deployment without US GPU dependencies 7. MIT license with open weights on Hugging Face

Key Insights

  • DeepSeek's efficiency gains (27% FLOPs, 10% KV cache vs V3.2 at 1M context) represent a genuine architectural advance, not just incremental scaling
  • The $3.48 per million output tokens price point will force further downward pressure on frontier model pricing across the industry
  • SWE-bench 80.6% is competitive with Claude Opus 4.7 and marks DeepSeek's strongest coding performance to date
  • Huawei Ascend chip compatibility signals that China's domestic AI ecosystem is maturing toward hardware independence from US suppliers
  • Honest benchmark disclosure — acknowledging gaps versus Gemini, GPT-5.5, and Claude — increases developer trust in V4's evaluation claims
  • The dual-model strategy (Pro + Flash) mirrors Anthropic's Opus/Haiku and OpenAI's flagship/mini tiering, serving both performance and cost-optimization use cases
  • MIT licensing continues DeepSeek's open-weight philosophy, enabling enterprise self-hosting without API dependency
  • Long-context retrieval degradation above 128K remains a practical limitation for enterprise document processing workflows

Was this review helpful?

Share

Twitter/X