May 11, 2026

IT NewsNEW

Cloudflare's Infire Engine: How It's Reengineering LLM Inference at Global Scale

Cloudflare published its high-performance LLM inference architecture in May 2026, detailing the Infire engine, Unweight compression, and disaggregated prefill/decode that cut inter-token latency 3x.

#Cloudflare#LLM Inference#AI Infrastructure#Workers AI#Infire

Cloudflare's Infire Engine: How It's Reengineering LLM Inference at Global Scale

AI Summary

Cloudflare published its high-performance LLM inference architecture in May 2026, detailing the Infire engine, Unweight compression, and disaggregated prefill/decode that cut inter-token latency 3x.

Key Highlights

Cloudflare released a detailed technical account in May 2026 of how it rebuilt its AI inference infrastructure from the ground up to serve large language models across its global network. The announcement covers three primary innovations: a disaggregated prefill/decode architecture, a proprietary Rust-based inference engine called Infire, and a model weight compression system called Unweight. Together, these changes allow Cloudflare to serve trillion-parameter models cost-effectively on Workers AI — a claim that was not practically achievable on its previous infrastructure.

The significance of this release extends beyond Cloudflare itself. As AI workloads shift from occasional large-batch processing to continuous, low-latency inference for agentic applications, every cloud provider faces the same architectural problem. Cloudflare's public documentation of its approach provides a detailed technical benchmark for how edge-first CDN providers are competing in the AI infrastructure race.

Disaggregated Prefill and Decode Architecture

Traditional LLM serving runs both the prefill stage (processing input tokens and populating the key-value cache) and the decode stage (generating output tokens) on the same hardware. This approach wastes resources because the two stages have fundamentally different compute profiles.

Cloudflare separated them. Prefill is compute-bound — it processes all input tokens in parallel and dominates GPU compute during the request startup phase. Decode is memory-bound — it generates tokens sequentially and requires fast memory bandwidth more than raw compute. By running each stage on purpose-matched hardware with distinct scaling properties, Cloudflare can tune resources independently for each workload.

The impact on latency is measurable. Cloudflare reports that p90 inter-token latency improved from approximately 100ms to 20-30ms — a 3x reduction. For agentic applications where a model might call tools dozens of times per session, that latency difference compounds significantly across the full workflow execution.

The system includes token-aware load balancing that estimates the number of in-flight tokens per endpoint before routing requests, preventing decode-stage bottlenecks from affecting prefill throughput.

Infire: A Purpose-Built Inference Engine in Rust

Cloudflare built Infire specifically for its distributed, edge-first infrastructure. The engine is written in Rust and replaces vLLM — the open-source inference server that most providers use as a starting point — with a system designed around Cloudflare's specific operational requirements.

Several capabilities distinguish Infire from standard inference serving:

Multi-GPU parallelism: Infire supports both pipeline-parallel and tensor-parallel modes, along with expert-parallelism for Mixture-of-Experts models. This allows it to run models that exceed single-GPU memory capacity without the overhead typical of naive multi-GPU implementations.

Memory efficiency: Infire achieves lower KV-cache overhead than vLLM. The company cites running Llama 4 Scout on just two H200 GPUs with more than 56 GB remaining for KV-cache, and serving Kimi K2.5 (a 1-trillion-parameter MoE model) on eight H100 GPUs with more than 30 GB available for cache. These numbers matter for long-context, multi-turn agentic sessions where cache sizes grow continuously.

Cold-start speed: For large models like Kimi K2.5, Infire begins serving requests in under 20 seconds from cold start, bounded primarily by disk read speed. This is a significant operational improvement for distributed edge deployments where nodes may not have models warm at all times.

Throughput: On unconstrained systems, Infire delivers up to 20% higher tokens-per-second than the baseline configuration.

Speculative decoding: Infire incorporates NVIDIA's EAGLE-3 draft model to accelerate token generation for structured outputs and tool calls — two patterns that are disproportionately common in agentic workflows.

Unweight: Model Compression Without Accuracy Loss

Beyond the serving architecture, Cloudflare introduced Unweight, a compression system that reduces LLM weight sizes by approximately 15-22% without measurable accuracy degradation.

For a 560 GB model, a 15% reduction means roughly 84 GB less data that GPUs need to load and move during inference. At Cloudflare's scale — with Workers AI serving requests across hundreds of edge nodes — this translates to lower hardware costs per inference token and faster model loading at each node.

Unweight is designed to operate as a post-processing step after model training, making it applicable to any model that Cloudflare deploys rather than requiring architectural changes during training.

Prompt Caching and KV-Cache Sharing

Cloudflare's infrastructure includes two additional optimization layers that are particularly relevant for agents making repeated calls with overlapping context.

Prompt caching uses x-session-affinity routing headers to direct requests from the same session to the same compute node, increasing cache hit ratios from 60% to 80% during peak periods. For an agent that calls tools 20 times per session with a 4,000-token system prompt, the difference between 60% and 80% cache hits is a meaningful reduction in prefill compute cost.

For cross-GPU cache sharing, Cloudflare integrated Moonshot AI's Mooncake technology, which uses RDMA protocols to share KV-cache across GPUs. This enables cache reuse across decode-stage replicas, further reducing redundant compute.

Usability Analysis

For developers building on Workers AI, these infrastructure improvements are mostly invisible — they manifest as lower latency and higher throughput without requiring API changes. The practical impact is most significant for agentic use cases: tool-heavy workflows, long-context sessions, and multi-turn reasoning tasks that previously saturated Cloudflare's inference capacity.

For enterprises evaluating where to run inference, Cloudflare's technical transparency is itself a differentiator. Most cloud providers describe their AI infrastructure in marketing terms; Cloudflare published specific p90 latency numbers, GPU memory utilization figures, and cold-start times. That level of detail allows informed architectural decisions in a way that opaque "up to X% faster" claims do not.

The limitation is that Cloudflare's Workers AI is primarily optimized for edge-first, high-throughput inference on publicly available models. Organizations that need to serve fine-tuned proprietary models, or that require guaranteed SLAs for critical financial or medical applications, may still prefer purpose-built managed inference platforms from providers with deeper enterprise support commitments.

Pros

Disaggregated prefill/decode architecture cuts p90 inter-token latency from ~100ms to 20-30ms — a 3x improvement
Infire delivers up to 20% higher throughput than baseline vLLM on unconstrained systems
Unweight compression reduces model weight by 15-22% without accuracy loss, lowering per-token costs at scale
Kimi K2.5 (1 trillion parameters) now runs on 8 H100 GPUs with 30+ GB cache headroom — previously impractical
Under 20-second cold starts for large models enables practical edge deployment without pre-warming

Cons

Workers AI is optimized for publicly available foundation models; fine-tuned proprietary model support is limited
The technical approach is documented but not open-sourced — operators on other infrastructure cannot directly adopt Infire
Edge inference still involves geographic routing that can introduce variable latency depending on request origin
Enterprise SLA guarantees and dedicated capacity options are less developed than AWS or GCP equivalents

Outlook

The architecture Cloudflare describes in this release is essentially a production-grade solution to the core challenges of serving large models in agentic workflows: high inter-token latency, KV-cache memory pressure, cold-start delays, and per-token cost. Each of the four innovations — disaggregated serving, Infire, Unweight, and RDMA cache sharing — addresses one of those four problems directly.

The broader implication is that edge inference — the ability to serve trillion-parameter models from hundreds of geographically distributed points rather than a handful of central data centers — is becoming technically viable. Cloudflare is not alone in pursuing this architecture, but it is among the most technically explicit about how it works. For the AI inference market, this publication represents both a product announcement and a de facto reference architecture for edge-first LLM serving.

As agentic applications proliferate and inference workloads shift from batch processing to continuous, real-time execution, the performance characteristics Cloudflare is optimizing for — low inter-token latency, efficient cache reuse, fast cold starts — will become the dominant competitive dimensions of AI infrastructure.

Conclusion

Cloudflare's Infire-based inference architecture is a technically substantive advance in production LLM serving. The combination of disaggregated prefill/decode, purpose-built Rust infrastructure, and aggressive model compression positions Workers AI as a credible platform for latency-sensitive agentic workloads. For developers and platform architects evaluating inference providers, this release establishes Cloudflare as a technically serious competitor in the AI infrastructure market — not merely a CDN with AI features bolted on.

Editor's Verdict

Cloudflare's Infire Engine: How It's Reengineering LLM Inference at Global Scale earns a solid recommendation within the it news space.

The strongest case for paying attention is 3x inter-token latency improvement (100ms to 20-30ms) directly benefits agentic and real-time AI applications, which raises the bar for what readers should now expect from peers in this space. Reinforcing that, 20% throughput improvement over vLLM on unconstrained systems reduces per-token infrastructure cost adds practical value rather than just headline appeal. The broader signal worth registering is straightforward: separating prefill and decode stages onto distinct hardware is the key architectural shift that makes low-latency edge inference viable for trillion-parameter models. On the other side of the ledger, infire is proprietary and not open-sourced, limiting adoption by operators on non-Cloudflare infrastructure is a real constraint, not a marketing footnote, and it should factor into any serious decision. Layered on top of that, workers AI is optimized for public foundation models; fine-tuned proprietary model support remains limited narrows the set of teams for whom this is an obvious yes.

For AI industry watchers, strategy teams, and decision-makers tracking platform shifts, this is a serious evaluation candidate, not just a curiosity to bookmark. For everyone else, the safer posture is to monitor coverage and revisit once the use cases that matter to your team are demonstrated in the wild.

Pros

3x inter-token latency improvement (100ms to 20-30ms) directly benefits agentic and real-time AI applications
20% throughput improvement over vLLM on unconstrained systems reduces per-token infrastructure cost
Under 20-second cold starts for trillion-parameter models enables practical edge deployment
15-22% weight compression with no accuracy loss reduces hardware requirements at scale
Technical transparency with specific benchmark numbers enables informed infrastructure decisions

Cons

Infire is proprietary and not open-sourced, limiting adoption by operators on non-Cloudflare infrastructure
Workers AI is optimized for public foundation models; fine-tuned proprietary model support remains limited
Edge inference introduces geographic routing variability that centralized inference avoids
Enterprise SLA guarantees and dedicated capacity options are less mature than AWS or GCP equivalents

References

Cloudflare Blog: Building the Foundation for Running Extra-Large Language Models InfoQ: Cloudflare Builds High-Performance Infrastructure for Running LLMs Cloudflare Blog: How We Built the Most Efficient Inference Engine for Cloudflare's Network

Comments0

Key Features

1. Disaggregated prefill/decode architecture separates compute-bound and memory-bound LLM processing stages, cutting p90 inter-token latency from ~100ms to 20-30ms (3x improvement) 2. Infire inference engine written in Rust supports multi-GPU pipeline/tensor parallelism, delivers 20% higher throughput than vLLM baseline, and achieves under 20-second cold starts for trillion-parameter models 3. Unweight compression system reduces model weight by 15-22% without accuracy loss, lowering GPU memory pressure and per-token infrastructure costs 4. RDMA-based KV-cache sharing via Mooncake technology enables cross-GPU cache reuse, complemented by session-affinity routing that raises cache hit ratios from 60% to 80% 5. NVIDIA EAGLE-3 speculative decoding accelerates token generation for structured outputs and tool calls — the most common patterns in agentic AI workflows

Key Insights

Separating prefill and decode stages onto distinct hardware is the key architectural shift that makes low-latency edge inference viable for trillion-parameter models
Cloudflare's decision to write Infire in Rust rather than extending vLLM signals a long-term commitment to building proprietary AI infrastructure rather than relying on open-source serving frameworks
The ability to serve Kimi K2.5 (1 trillion parameters) on 8 H100 GPUs with 30+ GB cache headroom demonstrates that edge inference for frontier-scale models is no longer a theoretical capability
A 3x inter-token latency reduction has compounding effects on agentic workflows where a model may execute dozens of tool calls per session
Cloudflare's technical transparency in publishing specific p90 latency numbers and GPU utilization figures is itself a product differentiator in an industry dominated by marketing-level performance claims
The 15-22% weight reduction from Unweight without accuracy loss represents a meaningful cost reduction at scale — for a 560GB model, that is roughly 84GB less data movement per inference load
Edge-first LLM serving is transitioning from experimental to production-viable, with implications for AI deployment architectures that have historically centralized inference in a small number of hyperscale data centers