Feb 26, 2026

Other LLM

Liquid AI LFM2-24B-A2B: A Hybrid Architecture That Fits 24B Parameters in 32GB RAM

Liquid AI releases LFM2-24B-A2B, a sparse MoE model blending gated convolutions with attention that hits 26.8K tokens per second on a single H100 while fitting on consumer hardware.

#Liquid AI#LFM2#hybrid architecture#MoE#convolution

Liquid AI LFM2-24B-A2B: A Hybrid Architecture That Fits 24B Parameters in 32GB RAM

AI Summary

Liquid AI releases LFM2-24B-A2B, a sparse MoE model blending gated convolutions with attention that hits 26.8K tokens per second on a single H100 while fitting on consumer hardware.

A New Architecture Challenges the Transformer Monopoly

On February 24, 2026, Liquid AI released an early checkpoint of LFM2-24B-A2B, its largest model to date and a direct challenge to the assumption that transformer-only architectures are the only viable path to scaling language models. The model uses a hybrid design that pairs efficient gated short convolution blocks with a small number of grouped query attention layers, achieving throughput and memory efficiency numbers that outperform comparably sized transformer-based models.

This release matters because it demonstrates that alternative architectures can compete on both quality and efficiency at meaningful scale, not just in toy experiments or sub-billion-parameter research models.

Architecture: Convolutions Meet Attention

LFM2-24B-A2B is built on a hybrid architecture that fundamentally rethinks how language models process sequences. Rather than relying exclusively on self-attention, which scales quadratically with sequence length, the model uses two types of layers:

Base Layers (30 of 40 total): Gated short convolution blocks that process local context efficiently. These layers handle the majority of computation at a fraction of the cost of attention layers, making them particularly effective for the pattern matching and local dependency tasks that comprise most of language processing.

Attention Layers (10 of 40 total): Grouped Query Attention blocks that handle long-range dependencies. By limiting attention to just 25% of the layers, the model captures global context where it matters most without paying the full computational cost across the entire network.

This design was not hand-crafted. Liquid AI developed it through hardware-in-the-loop architecture search, optimizing the ratio and placement of layer types against actual hardware performance metrics rather than theoretical FLOPs alone.

Sparse Mixture of Experts at Scale

LFM2-24B-A2B is a sparse Mixture of Experts model with 24 billion total parameters but only 2.3 billion active parameters per token. This means the model maintains the knowledge capacity of a 24B parameter model while requiring the compute of a roughly 2B parameter model during inference.

The MoE routing selectively activates different expert subnetworks depending on the input, allowing the model to specialize without requiring every parameter to fire for every token. This is the same general principle behind models like Mixtral and Qwen's MoE variants, but applied here within a non-transformer hybrid architecture.

Performance Numbers

The throughput numbers are the headline story:

Metric	LFM2-24B-A2B	Qwen3-30B-A3B	Snowflake GPT-OSS-20B
Tokens/sec (H100)	26,800	Lower	Lower
Active Params	2.3B	3B	20B
Total Params	24B	30B	20B
Context Length	32,768	32,768	8,192

On a single H100 SXM5 with vLLM, LFM2-24B-A2B reached approximately 26,800 total tokens per second at 1,024 concurrent requests with 1,024 max input tokens and 512 max output tokens. This outperforms both Qwen3-30B-A3B and Snowflake GPT-OSS-20B under continuous batching conditions.

Edge inference is equally notable: the model achieves 112 tokens per second decode on AMD CPUs and 293 tokens per second on H100, making it practical for deployment outside data center environments.

Across standard benchmarks including GPQA Diamond, MMLU-Pro, IFEval, IFBench, GSM8K, and MATH-500, quality improves log-linearly as the architecture scales from 350M to 24B total parameters, confirming that the hybrid design follows predictable scaling behavior without hitting a ceiling.

Consumer Hardware Deployment

Perhaps the most practically significant detail: LFM2-24B-A2B fits in 32GB of RAM. This means the model can run on consumer laptops and desktops with integrated GPUs or dedicated NPUs, not just cloud GPUs.

The model ships with day-one support for llama.cpp, vLLM, and SGLang, the three most widely used inference frameworks. GGUF quantized versions are available on HuggingFace, enabling immediate deployment through tools like LM Studio and Ollama.

This accessibility matters. A 24B parameter model that runs locally on a laptop with competitive quality represents a meaningful step toward making large language models available without cloud infrastructure.

Training Status and Roadmap

LFM2-24B-A2B has been trained on 17 trillion tokens so far, and pre-training is still running. This means the released checkpoint is not the final model. Liquid AI has indicated that performance will continue to improve as training progresses, making current benchmarks a floor rather than a ceiling.

The company previously released LFM2-2.6B and LFM2.5, a compact model family designed for on-device agents. The 24B release represents a significant scale-up that validates the architecture's viability at sizes where most alternative architectures have struggled to compete with transformers.

Implications for the AI Industry

LFM2-24B-A2B challenges two widely held assumptions. First, that transformers are the only architecture that scales effectively. Second, that running capable language models requires cloud-scale infrastructure.

If Liquid AI's hybrid approach continues to scale predictably, it could open a parallel track of model development where efficiency and accessibility are primary design goals rather than afterthoughts. For enterprises evaluating AI deployment options, a model that delivers competitive quality while fitting on commodity hardware significantly changes the cost-benefit calculation.

The model is available under the Liquid AI Community License on HuggingFace.

Conclusion

Liquid AI's LFM2-24B-A2B demonstrates that hybrid architectures combining convolutions and attention can compete with transformer-only models at meaningful scale. The combination of 26,800 tokens per second throughput, 32GB RAM footprint, and competitive benchmark performance establishes a new efficiency frontier for language models. For developers and organizations seeking capable models that run on accessible hardware, this release warrants serious evaluation.

Editor's Verdict

Liquid AI LFM2-24B-A2B: A Hybrid Architecture That Fits 24B Parameters in 32GB RAM earns a solid recommendation within the other llm space.

The strongest case for paying attention is exceptional throughput of 26,800 tokens per second on a single H100 under continuous batching, which raises the bar for what readers should now expect from peers in this space. Reinforcing that, fits in 32GB RAM, enabling deployment on consumer hardware without cloud GPUs adds practical value rather than just headline appeal. The broader signal worth registering is straightforward: LFM2-24B-A2B uses a hybrid architecture with 30 gated convolution layers and 10 attention layers, challenging the transformer-only paradigm. On the other side of the ledger, the released checkpoint is an early version with pre-training still in progress, meaning final quality is unknown is a real constraint, not a marketing footnote, and it should factor into any serious decision. Layered on top of that, 32,768 token context length is shorter than many competing models offering 128K or longer contexts narrows the set of teams for whom this is an obvious yes.

For multi-model deployment teams, cost-conscious operators, and developers willing to evaluate beyond the major labs, this is a serious evaluation candidate, not just a curiosity to bookmark. For everyone else, the safer posture is to monitor coverage and revisit once the use cases that matter to your team are demonstrated in the wild.

Pros

Exceptional throughput of 26,800 tokens per second on a single H100 under continuous batching
Fits in 32GB RAM, enabling deployment on consumer hardware without cloud GPUs
Day-one support for llama.cpp, vLLM, and SGLang ensures immediate compatibility with popular inference tools
Hybrid architecture demonstrates predictable log-linear scaling from 350M to 24B parameters
Open weights available on HuggingFace with GGUF quantized versions

Cons

The released checkpoint is an early version with pre-training still in progress, meaning final quality is unknown
32,768 token context length is shorter than many competing models offering 128K or longer contexts
Liquid AI Community License may have restrictions compared to fully permissive open-source licenses
Limited ecosystem and community support compared to established transformer-based model families

References

LFM2-24B-A2B: Scaling Up the LFM2 Architecture - Liquid AI Blog Liquid AI's New LFM2-24B-A2B Hybrid Architecture - MarkTechPost LiquidAI/LFM2-24B-A2B - HuggingFace

Comments0

Key Features

Liquid AI released LFM2-24B-A2B on February 24, 2026, a sparse MoE model with 24B total parameters and 2.3B active parameters per token. The hybrid architecture uses 30 gated convolution base layers and 10 grouped query attention layers, designed through hardware-in-the-loop architecture search. It achieves 26,800 tokens per second on a single H100 and fits in 32GB RAM with day-one support for llama.cpp, vLLM, and SGLang.

Key Insights

LFM2-24B-A2B uses a hybrid architecture with 30 gated convolution layers and 10 attention layers, challenging the transformer-only paradigm
The model achieves 26,800 tokens per second on a single H100, outperforming comparably sized MoE models like Qwen3-30B-A3B
With only 2.3B active parameters per token out of 24B total, the sparse MoE design delivers large-model quality at small-model compute cost
The model fits in 32GB RAM with GGUF support, enabling deployment on consumer laptops without cloud infrastructure
Hardware-in-the-loop architecture search optimized layer ratios against actual hardware performance rather than theoretical FLOPs
Edge inference reaches 112 tokens per second on AMD CPUs, making the model practical for on-device deployment
Training on 17 trillion tokens is still running, meaning current benchmark results represent a performance floor

Was this review helpful?

Twitter/X

Related AI Reviews

Grok 4.5 Launch: xAI and Cursor's First Joint Model Targets Legal, Finance

NEWOther LLM

126

Visit Official Site

🟠Anthropic Claude 💎Google Gemini 🤖OpenAI GPT