Open Source
Explore the latest AI open-source projects from GitHub and HuggingFace.
Explore the latest AI open-source projects from GitHub and HuggingFace.
## Introduction When Google DeepMind released AlphaFold3 in 2024, it represented a significant advance in biomolecular structure prediction — but came with a restrictive license that prohibited commercial use and withheld training code. ByteDance's Protenix has systematically closed that gap: v1 became the first fully open-source model to outperform AlphaFold3 across diverse benchmarks, and the v2 release in April 2026 extends that lead with a larger model, improved antibody prediction, and greater computational efficiency. For drug discovery companies, academic researchers, and computational biology teams, Protenix represents a watershed moment: frontier-class structure prediction capability under a permissive Apache 2.0 license, with the full training data pipeline open-sourced. ## What Is Protenix? Protenix is a diffusion-based biomolecular structure prediction system capable of predicting 3D atomic coordinates for proteins, protein complexes, protein-nucleic acid complexes, and small molecule-protein interactions. It takes amino acid sequences and optionally multiple sequence alignments (MSA), template structures, and atom-level constraints as input, then generates high-confidence structural predictions. The system inherits from the AlphaFold family's architectural lineage but makes critical openness improvements: ByteDance has released the complete training data pipeline, including MSA generation, template search, and data preprocessing — not just model weights. This enables researchers to retrain, fine-tune, and adapt Protenix for specialized applications without relying on ByteDance's infrastructure. ## Key Features ### Model Variants and Progression | Model | Parameters | MSA | RNA MSA | Template | Data Cutoff | Release | |---|---|---|---|---|---|---| | protenix-v2 | 464M | Yes | Yes | Yes | 2021-09-30 | 2026-04-08 | | protenix_base_v1.0.0 | 368M | Yes | Yes | Yes | 2021-09-30 | 2026-02-05 | | protenix_base_v1.0.0 (2025) | 368M | Yes | Yes | Yes | 2025-06-30 | 2026-02-05 | | protenix_base_v0.5.0 | 368M | Yes | No | No | 2021-09-30 | 2025-05-30 | The v2 model expands the representation dimensionality and parameter space from 368M to 464M, incorporating "substantial training and optimization improvements" that produce significantly better performance on challenging targets. ### Performance Against AlphaFold3 Protenix v1 established the performance baseline: - "The first fully open-source model that outperforms AlphaFold3 across diverse benchmark sets while adhering to the same training data cutoff, model scale, and inference budget as AlphaFold3" - Inference-time scaling delivers "consistent log-linear gains" for antibody-antigen complexes — a particularly challenging prediction target Protenix v2 significantly extends these gains: - Absolute success rate improvements of 9 to 13 percentage points over v1 across three antibody-antigen benchmark collections at DockQ > 0.23 threshold - Protenix-v2 at only 5 inference seeds exceeds v1 at 1000 seeds — a dramatic efficiency improvement ### Atom-Level Constraints Unlike models that predict unconstrained structures, Protenix supports atom-level contact and pocket constraints as prediction inputs. This is critical for drug discovery applications where researchers want to predict structures consistent with known binding site geometry or experimental constraints from cross-linking mass spectrometry. ### Full Training Pipeline Access ByteDance has open-sourced the complete training data pipeline including: - MSA (Multiple Sequence Alignment) generation pipeline - RNA MSA pipeline - Template search and preprocessing - ColabFold-compatible local MSA search integration This breadth of openness distinguishes Protenix from models that release only weights or only inference code. ### PXDesign: Protein Binder Design The companion PXDesign tool extends Protenix's capabilities to protein engineering. PXDesign achieves 20-73% experimental success rates for protein binder design across multiple targets — 2 to 6 times higher than prior state-of-the-art methods including AlphaProteo and RFdiffusion. This makes it directly applicable to therapeutic protein design workflows. ## Usability Analysis Installation is straightforward via pip: ```bash pip install protenix protenix pred -i examples/input.json -o ./output -n protenix_base_default_v1.0.0 ``` The JSON input format accepts sequences, chain configurations, MSA paths, and constraint specifications. A web server interface is available at accessible online for users who want to test predictions without local GPU infrastructure. GPU memory requirements are substantial for the 464M v2 model — high-memory GPUs (A100 80GB or H100 equivalent) are recommended for complex multimers. The Protenix-Mini variants provide a lower-memory path for development and smaller prediction targets. The Apache 2.0 license with commercial use permitted makes Protenix directly deployable in pharmaceutical and biotech industry workflows without legal uncertainty. ## Pros and Cons **Pros** - Outperforms AlphaFold3 on diverse benchmarks as a fully open-source system - Apache 2.0 license permits unrestricted commercial use - Full training data pipeline released, enabling retraining and domain adaptation - Atom-level constraints support drug discovery binding site prediction - PXDesign companion achieves 2-6x higher experimental success rates than prior binder design tools **Cons** - High GPU memory requirements for the full v2 model limit accessibility for smaller research groups - Computational biology domain expertise required to interpret results effectively - Training data cutoff at 2021-09-30 means recently characterized proteins may lack MSA depth - Results require experimental validation — predictions are probabilistic, not definitive ## Outlook Protenix represents a significant democratization of structural biology. AlphaFold3's restrictive license had created a two-tier research ecosystem where commercial drug discovery could not access the best tools. Protenix eliminates that divide. For the computational biology community, the open training pipeline is particularly significant — it enables academic groups to train specialized variants on domain-specific data (e.g., viral proteins, synthetic biology targets, rare disease proteins) that general-purpose models handle poorly. As the v2 model's efficiency improvements are formalized and integrated into downstream tools, Protenix is likely to become the default structure prediction backend for open-source drug discovery pipelines in 2026. ## Conclusion Protenix v2 achieves something rare in AI: it surpasses a heavily-resourced proprietary system while being more open. For drug discovery teams, structural biologists, and computational researchers who need frontier-class structure prediction with commercial-use rights and full pipeline access, Protenix is the definitive open-source answer to AlphaFold3.