Nvidia Vera Rubin NVL72: First Hardware Samples Deliver 10x Cheaper Inference Than Blackwell

CNBC gets exclusive first look at Nvidia's Vera Rubin system with 72 GPUs delivering 3.6 EFLOPS, 288GB HBM4 per GPU, and 100% liquid cooling as first samples ship to partners.

#Nvidia#Vera Rubin#NVL72#HBM4#AI infrastructure

Nvidia Vera Rubin NVL72: First Hardware Samples Deliver 10x Cheaper Inference Than Blackwell

AI Summary

CNBC gets exclusive first look at Nvidia's Vera Rubin system with 72 GPUs delivering 3.6 EFLOPS, 288GB HBM4 per GPU, and 100% liquid cooling as first samples ship to partners.

From Announcement to Silicon: Vera Rubin Gets Real

On February 25, 2026, CNBC published an exclusive first look at Nvidia's Vera Rubin AI system, confirming that first hardware samples have been delivered to partners and the platform is in full production. Originally announced at CES 2026 in January, Vera Rubin is now transitioning from roadmap to reality with concrete specifications, partner commitments, and delivery timelines for the second half of 2026.

This is not a paper launch. Nvidia is shipping physical silicon to AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure, CoreWeave, Lambda, Nebius, and Nscale for integration and testing ahead of general availability.

Six Chips, One System: Extreme Co-Design

Vera Rubin represents what Nvidia calls "extreme co-design," a platform built from six custom chip types engineered to work together as a unified system rather than assembled from off-the-shelf components:

Component	Specification
Rubin GPU	50 PFLOPS NVFP4 inference, 35 PFLOPS training per GPU
Vera CPU	88 cores, 176 threads, 1.5TB LPDDR5x, 1.2 TB/s bandwidth
NVLink 6 Switch	260 TB/s scale-up bandwidth across 72 GPUs
ConnectX-9 SuperNIC	Network interface for scale-out connectivity
BlueField-4 DPU	Data processing unit for infrastructure offload
Spectrum-6 Switch	Ethernet switching for multi-rack clusters

The NVL72 rack configuration packs 72 Rubin GPUs and 36 Vera CPUs into a single system. Every component was designed in tandem, from the memory subsystem to the networking fabric, to eliminate the bottlenecks that emerge when individual chips are optimized independently.

Memory: 288GB HBM4 Per GPU

Each Rubin GPU package contains eight stacks of HBM4 memory delivering 288GB of capacity and 22 TB/s of bandwidth per GPU. At the rack level, the NVL72 system provides 20.7TB of HBM4 capacity and 54TB of LPDDR5x capacity.

This memory architecture is designed for the trillion-parameter models that define the current frontier of AI development. Training and serving models at this scale requires not just raw compute but the ability to keep model weights, activations, and intermediate states in fast memory close to the processing units.

The 22 TB/s bandwidth per GPU represents a substantial leap over Blackwell's HBM3e implementation, reducing the memory bottleneck that limits throughput in large-model inference.

Performance: 10x Cost Reduction Over Blackwell

Nvidia's headline claims are aggressive but specific:

10x reduction in inference token cost compared to Blackwell
4x reduction in GPUs needed to train MoE models compared to Blackwell
5x greater inference performance per GPU over Blackwell GB200
3.6 EFLOPS of NVFP4 inference at the rack level
2.5 EFLOPS of training at the rack level

The 10x cost reduction for inference is the most commercially significant number. If validated in production environments, this would fundamentally change the economics of serving large language models. Current inference costs are a primary barrier to deploying AI at scale, and a 10x reduction enables use cases that are currently uneconomical.

The 4x reduction in GPU count for training MoE models addresses the growing adoption of Mixture of Experts architectures by models like Mixtral, DeepSeek, and the recently released Liquid AI LFM2-24B-A2B. Fewer GPUs per training run translates directly to lower costs and faster iteration cycles.

100% Liquid Cooling: A First for Nvidia

Vera Rubin is Nvidia's first system that is entirely liquid cooled. This is not a minor engineering detail. At the power densities required for next-generation AI systems, air cooling becomes physically inadequate and water-based evaporative cooling consumes enormous amounts of water.

Liquid cooling enables higher power delivery per GPU, denser rack configurations, and significantly lower water consumption than traditional data center cooling. As AI infrastructure scales to hundreds of thousands of GPUs, the cooling system becomes as strategically important as the chips themselves.

Cloud Partner Deployment Timeline

Among the first cloud providers to deploy Vera Rubin-based instances in the second half of 2026:

Cloud Provider	Status
AWS	Confirmed deployment partner
Google Cloud	Confirmed deployment partner
Microsoft Azure	Confirmed deployment partner
Oracle Cloud Infrastructure	Confirmed deployment partner
CoreWeave	Confirmed deployment partner
Lambda	Confirmed deployment partner
Nebius	Confirmed deployment partner
Nscale	Confirmed deployment partner

The breadth of the partner list suggests Nvidia expects demand to exceed supply, a pattern that has repeated with every recent GPU generation. Organizations planning large-scale AI infrastructure for late 2026 and 2027 will need to factor Vera Rubin availability into their hardware roadmaps.

Competitive Landscape

Vera Rubin arrives as Nvidia faces increasing competition from custom AI chip makers. AMD's MI400 series, Google's TPU v6, and startups like MatX, Taalas, and Cerebras are all targeting Nvidia's dominance with alternative approaches.

However, Nvidia's co-design strategy, building the CPU, GPU, network, and software stack as an integrated system, creates a switching cost that individual chip competitors cannot easily replicate. The CUDA ecosystem, combined with deep integration into every major cloud provider, gives Nvidia a structural advantage that extends beyond raw silicon performance.

Conclusion

Nvidia's Vera Rubin NVL72 moves from announcement to production with specifications that promise a generational leap in AI infrastructure economics. The 10x inference cost reduction, 288GB HBM4 per GPU, and 100% liquid cooling address the three primary constraints of current AI deployment: cost, memory, and power efficiency. For organizations planning AI infrastructure investments, Vera Rubin sets the benchmark that every competitor will be measured against through 2027 and beyond.

Editor's Verdict

Nvidia Vera Rubin NVL72: First Hardware Samples Deliver 10x Cheaper Inference Than Blackwell earns a solid recommendation within the it news space.

The strongest case for paying attention is 10x inference cost reduction over Blackwell would make previously uneconomical AI use cases viable, which raises the bar for what readers should now expect from peers in this space. Reinforcing that, 288GB HBM4 per GPU with 22 TB/s bandwidth addresses memory bottlenecks for trillion-parameter models adds practical value rather than just headline appeal. The broader signal worth registering is straightforward: first Vera Rubin hardware samples are being delivered to eight cloud partners including AWS, Google Cloud, and Microsoft Azure. On the other side of the ledger, performance claims are Nvidia's own figures and have not been independently validated in production workloads is a real constraint, not a marketing footnote, and it should factor into any serious decision. Layered on top of that, pricing has not been disclosed and Vera Rubin systems may carry significant premium over Blackwell narrows the set of teams for whom this is an obvious yes.

For AI industry watchers, strategy teams, and decision-makers tracking platform shifts, this is a serious evaluation candidate, not just a curiosity to bookmark. For everyone else, the safer posture is to monitor coverage and revisit once the use cases that matter to your team are demonstrated in the wild.

Pros

10x inference cost reduction over Blackwell would make previously uneconomical AI use cases viable
288GB HBM4 per GPU with 22 TB/s bandwidth addresses memory bottlenecks for trillion-parameter models
Six-chip co-design ensures CPU, GPU, networking, and memory work as an optimized system
100% liquid cooling reduces water consumption and enables denser rack configurations
Broad cloud partner commitment ensures availability across all major platforms

Cons

Performance claims are Nvidia's own figures and have not been independently validated in production workloads
Pricing has not been disclosed and Vera Rubin systems may carry significant premium over Blackwell
Second half 2026 availability means organizations cannot deploy until at least six months from now
Supply constraints that affected previous GPU generations may limit initial Vera Rubin availability

References

First look at Nvidia's AI system Vera Rubin and how it beats Blackwell - CNBC NVIDIA Vera Rubin NVL72 Detailed: 72 GPUs, 36 CPUs, 260 TB/s Scale-Up Bandwidth - VideoCardz Nvidia delivers first Vera Rubin AI GPU samples to customers - Tom's Hardware NVIDIA Kicks Off the Next Generation of AI With Rubin - NVIDIA Newsroom

Comments0

Key Features

Nvidia delivered first Vera Rubin hardware samples to partners on February 25, 2026, with full production confirmed. The NVL72 system packs 72 Rubin GPUs and 36 Vera CPUs with 288GB HBM4 per GPU at 22 TB/s bandwidth. Nvidia claims 10x inference cost reduction and 4x fewer GPUs for MoE training versus Blackwell. The system is 100% liquid cooled and delivers 3.6 EFLOPS of inference and 2.5 EFLOPS of training at rack level.

Key Insights

First Vera Rubin hardware samples are being delivered to eight cloud partners including AWS, Google Cloud, and Microsoft Azure
Each Rubin GPU delivers 50 PFLOPS inference with 288GB HBM4 memory and 22 TB/s bandwidth per GPU
Nvidia claims 10x reduction in inference token cost compared to Blackwell, which would fundamentally change AI deployment economics
The six-chip extreme co-design approach eliminates bottlenecks from optimizing components independently
100% liquid cooling is a first for Nvidia, addressing water consumption and power density constraints at scale
The NVL72 rack delivers 3.6 EFLOPS of inference and 20.7TB of HBM4 capacity in a single system
Vera Rubin-based cloud instances are expected from all major providers in the second half of 2026