Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

How ESMFold and ESM3 replace explicit MSAs with encoder-only transformers

Chris Hayduk

Jan 22, 2025

Meta's ESMfold: the rival of AlpahFold2 | by Salvatore Raieli | Medium

Note: This post is part of the “Understanding Protein Language Models” Series:

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
[This article] Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

Overview of the Main Ideas

AlphaFold2’s MSA: AlphaFold2 identifies evolutionarily related proteins to the target sequence and builds a multiple sequence alignment (MSA). In the Evoformer block, row-wise (within-sequence) and column-wise (across-sequences) attention on this MSA yields information about co-evolving residues. This MSA-based representation is then integrated into a pair representation matrix, ultimately helping AlphaFold2 predict the 3D structure.
ESMFold’s Language Model Encoding: In ESMFold, the MSA step is replaced by a large protein language model (ESM-2) trained via a Masked Language Modeling (MLM) objective. As in standard large language models for text, the hidden layers of the encoder learn semantic and syntactic regularities—in this case, biochemical and structural patterns. The result is that ESMFold can leverage these learned encodings to identify motifs and co-evolving positions without explicitly performing genetic database searches or building large MSAs.
Conceptual Motif Lookup: We can interpret ESM-2’s embeddings as performing a “continuous fuzzy lookup” within an implicit database of protein motifs. Because the language model was pretrained on massive amounts of protein data, it has effectively learned how residues co-occur—and thus co-evolve—within protein families. This internal representation replaces the explicit MSA step.

Below, we will dive into how this replacement works in more detail, starting with a short recap of AlphaFold2’s MSA-based pipeline and then exploring how ESMFold (and ESM-2 as its core) sidesteps explicit alignment by using learned representations.

1. Revisiting AlphaFold2’s MSA-Based Approach

1.1 Gathering Evolutionary Information

AlphaFold2 conducts genetic searches against databases such as MGnify, UniRef90, Uniclust30, and BFD to identify sequences that share evolutionary relationships with the target sequence. From these hits, it constructs an MSA:

\(\text{MSA} \;=\; \begin{pmatrix} s_{1,1} & s_{1,2} & \dots & s_{1,L} \\ s_{2,1} & s_{2,2} & \dots & s_{2,L} \\ \vdots & \vdots & \ddots & \vdots \\ s_{S,1} & s_{S,2} & \dots & s_{S,L} \end{pmatrix},\)

where L is the length of the target sequence, and S is the number of evolutionarily related sequences found. Here, s_{k,i} denotes the i-th residue of the k-th sequence in the alignment. By hypothesizing that residues co-evolve, the MSA is an external source of statistical correlations about which residues likely pair or contact each other in 3D space.

1.2 Evoformer and Pair Representation

In the AlphaFold2 pipeline:

MSA Representation M: A 3D tensor M,
\(\mathbf{M} \in \mathbf{R}^{S \times L \times c}\)
where c is the dimensionality of each residue embedding.
Pair Representation P: A 2D grid P,
\(\mathbf{P} = \mathbb{R}^{L \times L \times c_z}\)
where each P_{i,j} is a learned embedding representing the pairwise relationship between residue i and residue j in the target sequence.

Inside the Evoformer block, row-wise and column-wise attention update the MSA representation:

Row-wise (Within-sequence) Attention
\(\text{AttnRow}(\mathbf{M})_{k, i} = \sum_{m=1}^{L} \alpha_{i,m} \, \bigl(W^V \mathbf{M}_{k,m}\bigr)\)
where α_{i,m} are attention weights.
Column-wise (Across-sequence) Attention
\(\text{AttnCol}(\mathbf{M})_{k, i} = \sum_{n=1}^{S} \beta_{k,n} \, \bigl(W^V \mathbf{M}_{n,i}\bigr)\)
where β_{k,n} are attention weights.

After these attention layers (plus MSA transitions via 2-layer MLP), AlphaFold2 computes an Outer Product Mean that integrates MSA embeddings into the pair representation:

\(\mathbf{OPM}_{i,j} \;=\; \Bigl(\frac{1}{S}\sum_{k=1}^S \mathbf{u}_{k,i} \otimes \mathbf{u}_{k,j}\Bigr) \,W_{\mathrm{proj}}\)

where u_{k,i} is the final MSA embedding vector for residue i in sequence k. This OPM_{i,j} is then added (or concatenated and projected) into P_{i,j}, effectively injecting co-evolutionary signals gleaned from the MSA into the residue-pair representation.

2. ESMFold: Replacing MSA with Language Modeling

2.1 The Core Mechanism: Encoder-Only Transformer

ESMFold (and its backbone ESM-2 model) is built around a large encoder-only transformer. It is trained with the masked language modeling objective, meaning it tries to reconstruct masked or hidden residues from context. This training strategy, originally popularized by BERT in natural language processing, has an important effect: it forces the model to encode in its weights the relevant “contexts” that predict each amino acid.

Mathematically, if x=(x_1,x_2,…,x_L) is the protein sequence and x_k is replaced by a special [MASK] token with some probability, the MLM training objective is

\(\mathcal{L}_{\mathrm{MLM}} = -\sum_{k=1}^{L} \log p_\theta(x_k \mid x_1, \ldots, x_{k-1}, \text{[MASK]}, x_{k+1}, \ldots, x_L)\)

where p_θ is parameterized by the encoder transformer. Over billions of observed residues, the model internalizes the patterns of co-occurrence across diverse protein sequences.

2.2 Implicit Motif Lookup

Where AlphaFold2 uses explicit lookups in an MSA database (plus explicit attention across sequences), ESM-2’s learned embeddings do something analogous “in one shot.” After pretraining, the internal representation of each residue h_i (the hidden state at position i) captures average contexts encountered during training. In effect, for any position i,

h_i has high similarity to h_j if residues x_i and x_j frequently appear in similar sequence contexts in the training set.
By extension, if an entire sequence x has patterns analogous to known motifs (e.g., an ATP-binding site pattern, a signal peptide motif, or secondary-structure fragments), then the embeddings reflect these patterns—allowing ESMFold to “retrieve” them without an explicit MSA.

You can view this as a “continuous fuzzy matching” process, wherein the [KEY], [QUERY], and [VALUE] matrices of the transformer contain compressed representations of how residues co-occur. Rather than computing the dynamic-programming-based edit distances (or alignment) across a large external database, the model’s attention modules effectively do an alignment on-the-fly in a continuous, high-dimensional space.

2.3 Integration into Folding

ESMFold then appends a structure-prediction head on top of these ESM-2 embeddings, akin to how AlphaFold2 appends its structure module after the Evoformer. Even though ESMFold no longer has an explicit pair representation from an MSA, it still must estimate which residues interact or contact each other. In current ESMFold architectures:

The final hidden states from the ESM-2 encoder are projected into a lower-dimensional representation that acts like a “pair embedding” for each (i,j).
A geometry module or a series of feed-forward layers further refines these embeddings to produce coordinates or distance/contact maps.

In practice, ESMFold’s results are often on par with AlphaFold2 for many proteins, especially those with strong evolutionary constraints. For proteins with scant evolutionary data, ESMFold can sometimes do better than AlphaFold2, because it does not rely so heavily on a large MSA. On the flip side, certain proteins with well-studied deep MSAs can benefit from the explicit signals that AlphaFold2’s large MSA provides.

3. Mathematical Rationales for Replacing MSA

3.1 Complexity and Speed

One major advantage of dropping MSAs is computational efficiency. MSA searches can be prohibitively expensive for large proteins or large sets of queries, requiring queries against massive databases (MGnify, UniRef, etc.) and heuristics to align thousands of sequences. In ESMFold:

No MSA Search: The model simply takes the query sequence and feeds it through the encoder in a single forward pass.
Linear vs. Quadratic Complexity: A single Transformer forward pass for a sequence of length L has complexity O(L^2 d) where d is the dimension of embeddings, whereas building an MSA might involve searching and aligning thousands of sequences, each of length up to L.

3.2 Continuous Fuzzy Matching Perspective

If we interpret the MSA as a form of nearest-neighbor search (looking for “neighboring” sequences in a large database), then the language model is effectively a learned data structure that has:

Compressed the manifold of known protein sequences into θ (the weights).
Learned an attention-based mechanism to query that internal manifold for relevant contexts.

In typical fuzzy string matching, one might compute edit distances between the query and every entry in the database. In the ESM-2 architecture, the attention mechanism:

\(\text{Attn}(\mathbf{Q},\mathbf{K},\mathbf{V}) = \mathrm{softmax}\Bigl(\frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}}\Bigr) \mathbf{V}\)

acts as a trainable similarity function to identify relevant contexts. The intangible advantage is that these contexts may mix and match partial motifs from multiple “virtual neighbors,” creating a new representation not limited to the top few explicit matches in a database.

3.3 Co-evolutionary Signals Without Explicit Alignments

A major reason MSA is so powerful is that it captures co-evolving residues—positions that change in correlated ways across evolutionary history. In a typical MSA-based approach, if residue i mutates from A to G, residue j might consistently switch from T to S. Over many sequences, one infers that i and j likely contact or interact structurally.

By training on a massive corpus, the language model sees countless such correlations in raw sequence form. The emergent embeddings reflect these patterns. Hence, the final hidden state h_i is (indirectly) sensitive to all correlated positions that have ever appeared near that residue in training. So even though ESMFold does not align the query sequence to a database, it has internalized an approximate version of that same statistical correlation from its pretrained weights.

4. Example: From MSA to Language Model—A Toy Mathematical Sketch

Suppose we consider a short hypothetical protein sequence x=(M,K,L,L,P,V,L). In an MSA-based approach, you might gather 10,000 sequences from a database, building:

\(\begin{pmatrix} M & K & L & L & P & V & L \\ M & K & L & L & T & V & L \\ M & R & L & L & P & A & L \\ \vdots & & & & & & \\ \text{(10,000 rows)} \end{pmatrix}.\)

You then compute attention across these sequences (column-wise) and across residues (row-wise), deriving correlation maps.

In the ESM-2 approach, no explicit MSA is constructed. Instead, during training the model saw thousands (or millions) of sequences resembling (M,K,L,L,P,V,L) or partial subsequences thereof. The MLM objective forced the model to fill in [MASK] tokens in contexts like _ K L _ P V _. Over many instances, it learned which next residues are probable. As a result, once we feed (M,K,L,L,P,V,L) into ESM-2, the hidden states reflect a “compressed MSA,” effectively picking up correlations that used to require explicit cross-sequence operations.

5. Implications and Future Directions

Efficiency Gains: ESMFold runs significantly faster than AlphaFold2 when no large MSA is available, since it avoids the alignment process. For proteome-scale structure predictions, this is a game-changer.
Handling Novel Proteins: If a target protein has few homologs in public databases, MSA-based models struggle. ESMFold is robust in these “low-homology” cases since it learned general protein grammar from the entire training corpus.
Limited Interpretability: One downside is that MSA-based approaches produce an explicit record of hits and alignments, which can be biologically interpretable (e.g., which species and families contributed the signals). ESMFold’s learned embedding, while powerful, can be less transparent.
Hybrid Approaches: Some emerging methods combine pre-trained embeddings with an MSA for the best of both worlds—particularly for proteins where deep MSAs exist.
Scaling Laws and Emergent Behavior: As ESM models grow (ESM-2, ESM-3, etc.), they exhibit emergent behaviors akin to large language models in NLP. This suggests we may see further improvements in structure prediction, function annotation, and protein design.

6. Conclusion

AlphaFold2’s success showed how vital MSAs are in revealing co-evolutionary signals, which guide 3D structure inference. ESMFold’s fundamental insight is that you can pre-learn these signals at massive scale by treating protein sequences as “language.” Then, instead of collecting an MSA at inference time, the model effectively “queries” its internal knowledge of sequence co-occurrences, learned through the MLM objective.

In both approaches, the central idea is to approximate how residues covary. In AlphaFold2, that covariance emerges explicitly from a large MSA. In ESMFold, it is embedded implicitly in a high-dimensional transformer space. The advantage of the language-model approach is that it (1) eliminates the bottleneck of database searching/alignment, and (2) leverages far more global knowledge than just sequences that happen to align with the target protein.

Mathematically, we can view these approaches as two different ways to compute a “similarity function” over the manifold of protein sequences:

AlphaFold2 + MSA: An explicit alignment-based approach that organizes relevant sequences so the model can learn correlations.
ESMFold + Transformer: A large-scale learned approach that stores correlation statistics in the weights, retrieving them through self-attention rather than explicit alignment.

As these language models grow and become more accurate, their potential to replace, or at least augment, MSA-based pipelines will only increase—promising ever-faster and more versatile protein structure prediction.

In summary, ESMFold’s fundamental contribution is demonstrating how one can use a large, pretrained protein-language transformer to replicate (and in some cases surpass) the evolutionary context that an MSA provides. It is a step toward an era where generative models of protein sequence space might supersede explicit database lookups, enabling faster, more flexible, and equally accurate structure predictions—even for proteins with scarce evolutionary data.

Musings by Chris Hayduk

Discussion about this post