The Unreasonable Redundancy of Nature's Protein Folds

ray__1 pts0 comments

The Unreasonable Redundancy of Nature's Protein Folds - Ligo Blog

The Unreasonable Redundancy of Nature's Protein Folds

Arda Goreci · May 20, 2026

Over the last few years, deep neural networks have made generative language modeling dramatically<br>more powerful, giving us large language models. A similar leap happened for continuous<br>modalities like images and videos. Recently, similar techniques have been applied to the generative<br>modeling of<br>biomolecules with great success. Models such as DeepMind's AlphaFold3 made it much easier to predict<br>biomolecular interactions, including drug-protein and antibody-protein complexes, and soon after people<br>figured out how to re-purpose<br>these capabilities to design drug-like molecules.

Chai-2,<br>Latent-X2, and<br>Nabla all report developable antibody<br>or biologics designs.

In the near future, we might see most<br>antibodies entering the clinic designed in large part with deep-learning-based generative models,<br>potentially<br>with superior pharmaceutical properties and targeting receptors that have resisted wet-lab based approaches.

How would you improve on these systems? We definitely want to have better biomolecular modeling so we<br>can put better drugs into the clinic. The recipe for improving a deep learning system has been<br>surprisingly simple at a high level: you scale the<br>model, scale the compute, and scale the data. LLMs are obviously improving by being scaled aggressively.<br>AlphaFold3 was also a major effort to scale the model and data; it is trained on a broad collection of known<br>biomolecular complexes,<br>from experimental structures and protein-ligand complexes to the enormous sequence databases produced by<br>genomics and<br>metagenomics such as MGnify. Internally, DeepMind called the project "all-PDB" for a while, referring to all<br>the<br>interactions represented in the Protein Data Bank.

The key move in AlphaFold3's scaling recipe was to turn sequence scale into structure scale:<br>use structure prediction to convert large protein sequence databases into predicted 3D structures.<br>Genomics and metagenomics have given us billions<br>of protein sequences, many inferred from environmental DNA collected from organisms that<br>have never been cultured in the lab. For training structure-based design models, though,<br>the useful object is often the 3D structure. Structure prediction models let us convert some of that<br>sequence<br>scale into structural data: take millions of natural sequences, predict the folds they adopt,<br>and use those predicted structures as training examples for the next generation of<br>biomolecular models.

At Ligo, we care about this recipe because we train generative models for designing enzymes. When we tried<br>to scale our structural training data by folding more natural sequences, we ran into a<br>problem: natural protein sequences are vast, but their folds are much more redundant than<br>the sequence counts suggest. This post is about that mismatch, and about why simply<br>folding more natural sequences may not buy as much new structural diversity as we hoped. We will describe<br>data engineering tricks for clustering the known protein universe, and what our results imply about how to<br>think about the enzyme design problem.

Modern biomolecular models rely on sequence scale

Modern structure prediction models rely heavily on multiple sequence alignments. A multiple<br>sequence alignment, or MSA, lines up related versions of a protein from different organisms.<br>When two positions in that alignment tend to change together,

Coevolution means that two positions change in a coordinated way across related proteins.<br>For example, if one position is usually negatively charged and touches a positively charged<br>position, evolution may flip both together while avoiding pairs that would repel each other.

it can be a clue that the corresponding residues are close in 3D space or tied together by<br>function. My mental model of AlphaFold2 is that it used this kind of coevolutionary signal<br>to constrain the rough geometry of a protein, then learned how to fill in the rest of the<br>structure.

AlphaFold3 seems to be doing something broader. Its antibody-antigen performance is<br>especially interesting because there are no MSAs to extract clues from. Antibodies and their targets<br>do not share an evolutionary history.<br>To do well there, the model has to learn something about protein surfaces themselves:<br>which shapes, chemistries, and local geometries are likely to be compatible with each<br>other. That is a different kind of signal than residue coevolution within one protein<br>family.

This is where MGnify-scale data may matter. Metagenomic sequence resources expose models<br>to enormous numbers of natural variants, many from organisms we have never cultured. The<br>empirical clue is that models trained with MGnify-scale protein distillation seem to separate<br>most clearly on antibody-antigen prediction, where direct coevolution cannot explain the<br>interaction signal (Supplementary info).<br>That increased coverage of sequence space looks...

protein models scale sequence data structure

Related Articles