Nvidia Vera Rubin NVL72: First Hardware Samples Deliver 10x Cheaper Inference Than Blackwell
CNBC gets exclusive first look at Nvidia's Vera Rubin system with 72 GPUs delivering 3.6 EFLOPS, 288GB HBM4 per GPU, and 100% liquid cooling as first samples ship to partners.
CNBC gets exclusive first look at Nvidia's Vera Rubin system with 72 GPUs delivering 3.6 EFLOPS, 288GB HBM4 per GPU, and 100% liquid cooling as first samples ship to partners.
From Announcement to Silicon: Vera Rubin Gets Real
On February 25, 2026, CNBC published an exclusive first look at Nvidia's Vera Rubin AI system, confirming that first hardware samples have been delivered to partners and the platform is in full production. Originally announced at CES 2026 in January, Vera Rubin is now transitioning from roadmap to reality with concrete specifications, partner commitments, and delivery timelines for the second half of 2026.
This is not a paper launch. Nvidia is shipping physical silicon to AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, CoreWeave, Lambda, Nebius, and Nscale for integration and testing ahead of general availability.
Six Chips, One System: Extreme Co-Design
Vera Rubin represents what Nvidia calls "extreme co-design," a platform built from six custom chip types engineered to work together as a unified system rather than assembled from off-the-shelf components:
| Component | Specification |
|---|---|
| Rubin GPU | 50 PFLOPS NVFP4 inference, 35 PFLOPS training per GPU |
| Vera CPU | 88 cores, 176 threads, 1.5TB LPDDR5x, 1.2 TB/s bandwidth |
| NVLink 6 Switch | 260 TB/s scale-up bandwidth across 72 GPUs |
| ConnectX-9 SuperNIC | Network interface for scale-out connectivity |
| BlueField-4 DPU | Data processing unit for infrastructure offload |
| Spectrum-6 Switch | Ethernet switching for multi-rack clusters |
The NVL72 rack configuration packs 72 Rubin GPUs and 36 Vera CPUs into a single system. Every component was designed in tandem, from the memory subsystem to the networking fabric, to eliminate the bottlenecks that emerge when individual chips are optimized independently.
Memory: 288GB HBM4 Per GPU
Each Rubin GPU package contains eight stacks of HBM4 memory delivering 288GB of capacity and 22 TB/s of bandwidth per GPU. At the rack level, the NVL72 system provides 20.7TB of HBM4 capacity and 54TB of LPDDR5x capacity.
This memory architecture is designed for the trillion-parameter models that define the current frontier of AI development. Training and serving models at this scale requires not just raw compute but the ability to keep model weights, activations, and intermediate states in fast memory close to the processing units.
The 22 TB/s bandwidth per GPU represents a substantial leap over Blackwell's HBM3e implementation, reducing the memory bottleneck that limits throughput in large-model inference.
Performance: 10x Cost Reduction Over Blackwell
Nvidia's headline claims are aggressive but specific:
- 10x reduction in inference token cost compared to Blackwell
- 4x reduction in GPUs needed to train MoE models compared to Blackwell
- 5x greater inference performance per GPU over Blackwell GB200
- 3.6 EFLOPS of NVFP4 inference at the rack level
- 2.5 EFLOPS of training at the rack level
The 10x cost reduction for inference is the most commercially significant number. If validated in production environments, this would fundamentally change the economics of serving large language models. Current inference costs are a primary barrier to deploying AI at scale, and a 10x reduction enables use cases that are currently uneconomical.
The 4x reduction in GPU count for training MoE models addresses the growing adoption of Mixture of Experts architectures by models like Mixtral, DeepSeek, and the recently released Liquid AI LFM2-24B-A2B. Fewer GPUs per training run translates directly to lower costs and faster iteration cycles.
100% Liquid Cooling: A First for Nvidia
Vera Rubin is Nvidia's first system that is entirely liquid cooled. This is not a minor engineering detail. At the power densities required for next-generation AI systems, air cooling becomes physically inadequate and water-based evaporative cooling consumes enormous amounts of water.
Liquid cooling enables higher power delivery per GPU, denser rack configurations, and significantly lower water consumption than traditional data center cooling. As AI infrastructure scales to hundreds of thousands of GPUs, the cooling system becomes as strategically important as the chips themselves.
Cloud Partner Deployment Timeline
Among the first cloud providers to deploy Vera Rubin-based instances in the second half of 2026:
| Cloud Provider | Status |
|---|---|
| AWS | Confirmed deployment partner |
| Google Cloud | Confirmed deployment partner |
| Microsoft Azure | Confirmed deployment partner |
| Oracle Cloud Infrastructure | Confirmed deployment partner |
| CoreWeave | Confirmed deployment partner |
| Lambda | Confirmed deployment partner |
| Nebius | Confirmed deployment partner |
| Nscale | Confirmed deployment partner |
The breadth of the partner list suggests Nvidia expects demand to exceed supply, a pattern that has repeated with every recent GPU generation. Organizations planning large-scale AI infrastructure for late 2026 and 2027 will need to factor Vera Rubin availability into their hardware roadmaps.
Competitive Landscape
Vera Rubin arrives as Nvidia faces increasing competition from custom AI chip makers. AMD's MI400 series, Google's TPU v6, and startups like MatX, Taalas, and Cerebras are all targeting Nvidia's dominance with alternative approaches.
However, Nvidia's co-design strategy, building the CPU, GPU, network, and software stack as an integrated system, creates a switching cost that individual chip competitors cannot easily replicate. The CUDA ecosystem, combined with deep integration into every major cloud provider, gives Nvidia a structural advantage that extends beyond raw silicon performance.
Conclusion
Nvidia's Vera Rubin NVL72 moves from announcement to production with specifications that promise a generational leap in AI infrastructure economics. The 10x inference cost reduction, 288GB HBM4 per GPU, and 100% liquid cooling address the three primary constraints of current AI deployment: cost, memory, and power efficiency. For organizations planning AI infrastructure investments, Vera Rubin sets the benchmark that every competitor will be measured against through 2027 and beyond.
Pros
- 10x inference cost reduction over Blackwell would make previously uneconomical AI use cases viable
- 288GB HBM4 per GPU with 22 TB/s bandwidth addresses memory bottlenecks for trillion-parameter models
- Six-chip co-design ensures CPU, GPU, networking, and memory work as an optimized system
- 100% liquid cooling reduces water consumption and enables denser rack configurations
- Broad cloud partner commitment ensures availability across all major platforms
Cons
- Performance claims are Nvidia's own figures and have not been independently validated in production workloads
- Pricing has not been disclosed and Vera Rubin systems may carry significant premium over Blackwell
- Second half 2026 availability means organizations cannot deploy until at least six months from now
- Supply constraints that affected previous GPU generations may limit initial Vera Rubin availability
References
Comments0
Key Features
Nvidia delivered first Vera Rubin hardware samples to partners on February 25, 2026, with full production confirmed. The NVL72 system packs 72 Rubin GPUs and 36 Vera CPUs with 288GB HBM4 per GPU at 22 TB/s bandwidth. Nvidia claims 10x inference cost reduction and 4x fewer GPUs for MoE training versus Blackwell. The system is 100% liquid cooled and delivers 3.6 EFLOPS of inference and 2.5 EFLOPS of training at rack level.
Key Insights
- First Vera Rubin hardware samples are being delivered to eight cloud partners including AWS, Google Cloud, and Microsoft Azure
- Each Rubin GPU delivers 50 PFLOPS inference with 288GB HBM4 memory and 22 TB/s bandwidth per GPU
- Nvidia claims 10x reduction in inference token cost compared to Blackwell, which would fundamentally change AI deployment economics
- The six-chip extreme co-design approach eliminates bottlenecks from optimizing components independently
- 100% liquid cooling is a first for Nvidia, addressing water consumption and power density constraints at scale
- The NVL72 rack delivers 3.6 EFLOPS of inference and 20.7TB of HBM4 capacity in a single system
- Vera Rubin-based cloud instances are expected from all major providers in the second half of 2026
Was this review helpful?
Share
Related AI Reviews
Apple's Core AI Will Replace Core ML at WWDC 2026: What Developers Need to Know
Apple plans to introduce Core AI at WWDC 2026, replacing the decade-old Core ML framework with a modernized platform designed for today's AI ecosystem and third-party model integration.
Nvidia Posts Record $68.1B Q4 Revenue as Jensen Huang Declares Agentic AI Inflection Point
Nvidia crushes estimates with $68.1B quarterly revenue, 73% year-over-year growth, and $78B Q1 guidance as data center segment drives 75% of total sales.
Samsung Galaxy S26 Launches With Three AI Agents: Perplexity, Gemini, and Bixby
Samsung's Galaxy S26 debuts a multi-agent AI ecosystem with Perplexity, Google Gemini, and a revamped Bixby, letting users choose their AI assistant with dedicated wake words.
MatX Raises $500M to Build LLM-Specific Chips That Challenge Nvidia's Dominance
Former Google TPU engineers secure $500M Series B led by Jane Street and Leopold Aschenbrenner's fund to build a chip designed exclusively for large language model inference.
