Back to list
Feb 26, 2026
20
0
0
IT NewsNEW

Nvidia Vera Rubin NVL72: First Hardware Samples Deliver 10x Cheaper Inference Than Blackwell

CNBC gets exclusive first look at Nvidia's Vera Rubin system with 72 GPUs delivering 3.6 EFLOPS, 288GB HBM4 per GPU, and 100% liquid cooling as first samples ship to partners.

#Nvidia#Vera Rubin#NVL72#HBM4#AI infrastructure
Nvidia Vera Rubin NVL72: First Hardware Samples Deliver 10x Cheaper Inference Than Blackwell
AI Summary

CNBC gets exclusive first look at Nvidia's Vera Rubin system with 72 GPUs delivering 3.6 EFLOPS, 288GB HBM4 per GPU, and 100% liquid cooling as first samples ship to partners.

From Announcement to Silicon: Vera Rubin Gets Real

On February 25, 2026, CNBC published an exclusive first look at Nvidia's Vera Rubin AI system, confirming that first hardware samples have been delivered to partners and the platform is in full production. Originally announced at CES 2026 in January, Vera Rubin is now transitioning from roadmap to reality with concrete specifications, partner commitments, and delivery timelines for the second half of 2026.

This is not a paper launch. Nvidia is shipping physical silicon to AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, CoreWeave, Lambda, Nebius, and Nscale for integration and testing ahead of general availability.

Six Chips, One System: Extreme Co-Design

Vera Rubin represents what Nvidia calls "extreme co-design," a platform built from six custom chip types engineered to work together as a unified system rather than assembled from off-the-shelf components:

ComponentSpecification
Rubin GPU50 PFLOPS NVFP4 inference, 35 PFLOPS training per GPU
Vera CPU88 cores, 176 threads, 1.5TB LPDDR5x, 1.2 TB/s bandwidth
NVLink 6 Switch260 TB/s scale-up bandwidth across 72 GPUs
ConnectX-9 SuperNICNetwork interface for scale-out connectivity
BlueField-4 DPUData processing unit for infrastructure offload
Spectrum-6 SwitchEthernet switching for multi-rack clusters

The NVL72 rack configuration packs 72 Rubin GPUs and 36 Vera CPUs into a single system. Every component was designed in tandem, from the memory subsystem to the networking fabric, to eliminate the bottlenecks that emerge when individual chips are optimized independently.

Memory: 288GB HBM4 Per GPU

Each Rubin GPU package contains eight stacks of HBM4 memory delivering 288GB of capacity and 22 TB/s of bandwidth per GPU. At the rack level, the NVL72 system provides 20.7TB of HBM4 capacity and 54TB of LPDDR5x capacity.

This memory architecture is designed for the trillion-parameter models that define the current frontier of AI development. Training and serving models at this scale requires not just raw compute but the ability to keep model weights, activations, and intermediate states in fast memory close to the processing units.

The 22 TB/s bandwidth per GPU represents a substantial leap over Blackwell's HBM3e implementation, reducing the memory bottleneck that limits throughput in large-model inference.

Performance: 10x Cost Reduction Over Blackwell

Nvidia's headline claims are aggressive but specific:

  • 10x reduction in inference token cost compared to Blackwell
  • 4x reduction in GPUs needed to train MoE models compared to Blackwell
  • 5x greater inference performance per GPU over Blackwell GB200
  • 3.6 EFLOPS of NVFP4 inference at the rack level
  • 2.5 EFLOPS of training at the rack level

The 10x cost reduction for inference is the most commercially significant number. If validated in production environments, this would fundamentally change the economics of serving large language models. Current inference costs are a primary barrier to deploying AI at scale, and a 10x reduction enables use cases that are currently uneconomical.

The 4x reduction in GPU count for training MoE models addresses the growing adoption of Mixture of Experts architectures by models like Mixtral, DeepSeek, and the recently released Liquid AI LFM2-24B-A2B. Fewer GPUs per training run translates directly to lower costs and faster iteration cycles.

100% Liquid Cooling: A First for Nvidia

Vera Rubin is Nvidia's first system that is entirely liquid cooled. This is not a minor engineering detail. At the power densities required for next-generation AI systems, air cooling becomes physically inadequate and water-based evaporative cooling consumes enormous amounts of water.

Liquid cooling enables higher power delivery per GPU, denser rack configurations, and significantly lower water consumption than traditional data center cooling. As AI infrastructure scales to hundreds of thousands of GPUs, the cooling system becomes as strategically important as the chips themselves.

Cloud Partner Deployment Timeline

Among the first cloud providers to deploy Vera Rubin-based instances in the second half of 2026:

Cloud ProviderStatus
AWSConfirmed deployment partner
Google CloudConfirmed deployment partner
Microsoft AzureConfirmed deployment partner
Oracle Cloud InfrastructureConfirmed deployment partner
CoreWeaveConfirmed deployment partner
LambdaConfirmed deployment partner
NebiusConfirmed deployment partner
NscaleConfirmed deployment partner

The breadth of the partner list suggests Nvidia expects demand to exceed supply, a pattern that has repeated with every recent GPU generation. Organizations planning large-scale AI infrastructure for late 2026 and 2027 will need to factor Vera Rubin availability into their hardware roadmaps.

Competitive Landscape

Vera Rubin arrives as Nvidia faces increasing competition from custom AI chip makers. AMD's MI400 series, Google's TPU v6, and startups like MatX, Taalas, and Cerebras are all targeting Nvidia's dominance with alternative approaches.

However, Nvidia's co-design strategy, building the CPU, GPU, network, and software stack as an integrated system, creates a switching cost that individual chip competitors cannot easily replicate. The CUDA ecosystem, combined with deep integration into every major cloud provider, gives Nvidia a structural advantage that extends beyond raw silicon performance.

Conclusion

Nvidia's Vera Rubin NVL72 moves from announcement to production with specifications that promise a generational leap in AI infrastructure economics. The 10x inference cost reduction, 288GB HBM4 per GPU, and 100% liquid cooling address the three primary constraints of current AI deployment: cost, memory, and power efficiency. For organizations planning AI infrastructure investments, Vera Rubin sets the benchmark that every competitor will be measured against through 2027 and beyond.

Pros

  • 10x inference cost reduction over Blackwell would make previously uneconomical AI use cases viable
  • 288GB HBM4 per GPU with 22 TB/s bandwidth addresses memory bottlenecks for trillion-parameter models
  • Six-chip co-design ensures CPU, GPU, networking, and memory work as an optimized system
  • 100% liquid cooling reduces water consumption and enables denser rack configurations
  • Broad cloud partner commitment ensures availability across all major platforms

Cons

  • Performance claims are Nvidia's own figures and have not been independently validated in production workloads
  • Pricing has not been disclosed and Vera Rubin systems may carry significant premium over Blackwell
  • Second half 2026 availability means organizations cannot deploy until at least six months from now
  • Supply constraints that affected previous GPU generations may limit initial Vera Rubin availability

Comments0

Key Features

Nvidia delivered first Vera Rubin hardware samples to partners on February 25, 2026, with full production confirmed. The NVL72 system packs 72 Rubin GPUs and 36 Vera CPUs with 288GB HBM4 per GPU at 22 TB/s bandwidth. Nvidia claims 10x inference cost reduction and 4x fewer GPUs for MoE training versus Blackwell. The system is 100% liquid cooled and delivers 3.6 EFLOPS of inference and 2.5 EFLOPS of training at rack level.

Key Insights

  • First Vera Rubin hardware samples are being delivered to eight cloud partners including AWS, Google Cloud, and Microsoft Azure
  • Each Rubin GPU delivers 50 PFLOPS inference with 288GB HBM4 memory and 22 TB/s bandwidth per GPU
  • Nvidia claims 10x reduction in inference token cost compared to Blackwell, which would fundamentally change AI deployment economics
  • The six-chip extreme co-design approach eliminates bottlenecks from optimizing components independently
  • 100% liquid cooling is a first for Nvidia, addressing water consumption and power density constraints at scale
  • The NVL72 rack delivers 3.6 EFLOPS of inference and 20.7TB of HBM4 capacity in a single system
  • Vera Rubin-based cloud instances are expected from all major providers in the second half of 2026

Was this review helpful?

Share

Twitter/X