Back to list
Feb 26, 2026
21
0
0
Other LLMNEW

Liquid AI LFM2-24B-A2B: A Hybrid Architecture That Fits 24B Parameters in 32GB RAM

Liquid AI releases LFM2-24B-A2B, a sparse MoE model blending gated convolutions with attention that hits 26.8K tokens per second on a single H100 while fitting on consumer hardware.

#Liquid AI#LFM2#hybrid architecture#MoE#convolution
Liquid AI LFM2-24B-A2B: A Hybrid Architecture That Fits 24B Parameters in 32GB RAM
AI Summary

Liquid AI releases LFM2-24B-A2B, a sparse MoE model blending gated convolutions with attention that hits 26.8K tokens per second on a single H100 while fitting on consumer hardware.

A New Architecture Challenges the Transformer Monopoly

On February 24, 2026, Liquid AI released an early checkpoint of LFM2-24B-A2B, its largest model to date and a direct challenge to the assumption that transformer-only architectures are the only viable path to scaling language models. The model uses a hybrid design that pairs efficient gated short convolution blocks with a small number of grouped query attention layers, achieving throughput and memory efficiency numbers that outperform comparably sized transformer-based models.

This release matters because it demonstrates that alternative architectures can compete on both quality and efficiency at meaningful scale, not just in toy experiments or sub-billion-parameter research models.

Architecture: Convolutions Meet Attention

LFM2-24B-A2B is built on a hybrid architecture that fundamentally rethinks how language models process sequences. Rather than relying exclusively on self-attention, which scales quadratically with sequence length, the model uses two types of layers:

Base Layers (30 of 40 total): Gated short convolution blocks that process local context efficiently. These layers handle the majority of computation at a fraction of the cost of attention layers, making them particularly effective for the pattern matching and local dependency tasks that comprise most of language processing.

Attention Layers (10 of 40 total): Grouped Query Attention blocks that handle long-range dependencies. By limiting attention to just 25% of the layers, the model captures global context where it matters most without paying the full computational cost across the entire network.

This design was not hand-crafted. Liquid AI developed it through hardware-in-the-loop architecture search, optimizing the ratio and placement of layer types against actual hardware performance metrics rather than theoretical FLOPs alone.

Sparse Mixture of Experts at Scale

LFM2-24B-A2B is a sparse Mixture of Experts model with 24 billion total parameters but only 2.3 billion active parameters per token. This means the model maintains the knowledge capacity of a 24B parameter model while requiring the compute of a roughly 2B parameter model during inference.

The MoE routing selectively activates different expert subnetworks depending on the input, allowing the model to specialize without requiring every parameter to fire for every token. This is the same general principle behind models like Mixtral and Qwen's MoE variants, but applied here within a non-transformer hybrid architecture.

Performance Numbers

The throughput numbers are the headline story:

MetricLFM2-24B-A2BQwen3-30B-A3BSnowflake GPT-OSS-20B
Tokens/sec (H100)26,800LowerLower
Active Params2.3B3B20B
Total Params24B30B20B
Context Length32,76832,7688,192

On a single H100 SXM5 with vLLM, LFM2-24B-A2B reached approximately 26,800 total tokens per second at 1,024 concurrent requests with 1,024 max input tokens and 512 max output tokens. This outperforms both Qwen3-30B-A3B and Snowflake GPT-OSS-20B under continuous batching conditions.

Edge inference is equally notable: the model achieves 112 tokens per second decode on AMD CPUs and 293 tokens per second on H100, making it practical for deployment outside data center environments.

Across standard benchmarks including GPQA Diamond, MMLU-Pro, IFEval, IFBench, GSM8K, and MATH-500, quality improves log-linearly as the architecture scales from 350M to 24B total parameters, confirming that the hybrid design follows predictable scaling behavior without hitting a ceiling.

Consumer Hardware Deployment

Perhaps the most practically significant detail: LFM2-24B-A2B fits in 32GB of RAM. This means the model can run on consumer laptops and desktops with integrated GPUs or dedicated NPUs, not just cloud GPUs.

The model ships with day-one support for llama.cpp, vLLM, and SGLang, the three most widely used inference frameworks. GGUF quantized versions are available on HuggingFace, enabling immediate deployment through tools like LM Studio and Ollama.

This accessibility matters. A 24B parameter model that runs locally on a laptop with competitive quality represents a meaningful step toward making large language models available without cloud infrastructure.

Training Status and Roadmap

LFM2-24B-A2B has been trained on 17 trillion tokens so far, and pre-training is still running. This means the released checkpoint is not the final model. Liquid AI has indicated that performance will continue to improve as training progresses, making current benchmarks a floor rather than a ceiling.

The company previously released LFM2-2.6B and LFM2.5, a compact model family designed for on-device agents. The 24B release represents a significant scale-up that validates the architecture's viability at sizes where most alternative architectures have struggled to compete with transformers.

Implications for the AI Industry

LFM2-24B-A2B challenges two widely held assumptions. First, that transformers are the only architecture that scales effectively. Second, that running capable language models requires cloud-scale infrastructure.

If Liquid AI's hybrid approach continues to scale predictably, it could open a parallel track of model development where efficiency and accessibility are primary design goals rather than afterthoughts. For enterprises evaluating AI deployment options, a model that delivers competitive quality while fitting on commodity hardware significantly changes the cost-benefit calculation.

The model is available under the Liquid AI Community License on HuggingFace.

Conclusion

Liquid AI's LFM2-24B-A2B demonstrates that hybrid architectures combining convolutions and attention can compete with transformer-only models at meaningful scale. The combination of 26,800 tokens per second throughput, 32GB RAM footprint, and competitive benchmark performance establishes a new efficiency frontier for language models. For developers and organizations seeking capable models that run on accessible hardware, this release warrants serious evaluation.

Pros

  • Exceptional throughput of 26,800 tokens per second on a single H100 under continuous batching
  • Fits in 32GB RAM, enabling deployment on consumer hardware without cloud GPUs
  • Day-one support for llama.cpp, vLLM, and SGLang ensures immediate compatibility with popular inference tools
  • Hybrid architecture demonstrates predictable log-linear scaling from 350M to 24B parameters
  • Open weights available on HuggingFace with GGUF quantized versions

Cons

  • The released checkpoint is an early version with pre-training still in progress, meaning final quality is unknown
  • 32,768 token context length is shorter than many competing models offering 128K or longer contexts
  • Liquid AI Community License may have restrictions compared to fully permissive open-source licenses
  • Limited ecosystem and community support compared to established transformer-based model families

Comments0

Key Features

Liquid AI released LFM2-24B-A2B on February 24, 2026, a sparse MoE model with 24B total parameters and 2.3B active parameters per token. The hybrid architecture uses 30 gated convolution base layers and 10 grouped query attention layers, designed through hardware-in-the-loop architecture search. It achieves 26,800 tokens per second on a single H100 and fits in 32GB RAM with day-one support for llama.cpp, vLLM, and SGLang.

Key Insights

  • LFM2-24B-A2B uses a hybrid architecture with 30 gated convolution layers and 10 attention layers, challenging the transformer-only paradigm
  • The model achieves 26,800 tokens per second on a single H100, outperforming comparably sized MoE models like Qwen3-30B-A3B
  • With only 2.3B active parameters per token out of 24B total, the sparse MoE design delivers large-model quality at small-model compute cost
  • The model fits in 32GB RAM with GGUF support, enabling deployment on consumer laptops without cloud infrastructure
  • Hardware-in-the-loop architecture search optimized layer ratios against actual hardware performance rather than theoretical FLOPs
  • Edge inference reaches 112 tokens per second on AMD CPUs, making the model practical for on-device deployment
  • Training on 17 trillion tokens is still running, meaning current benchmark results represent a performance floor

Was this review helpful?

Share

Twitter/X