5 Comments
User's avatar
AIfun's avatar

ESM3 is great! Before ESM3, SaProt has proposed combing AA+ discrete structural token for pre-training and shows that this helps scale to larger dataset.

SaProt: Protein Language Modeling with Structure-aware Vocabulary

Expand full comment
Tommy's avatar

Do you have any hypothesis on why a simple BERT-like structure for ESM performs way better than others? It sounds too simple almost

Expand full comment
Chris Hayduk's avatar

Hey Tommy! Just following up on your comment because, in the couple days since I responded, we actually DID get a paper showing how scaling up GPT-style architectures performs on protein sequence data! You can find it here: https://www.biorxiv.org/content/10.1101/2025.04.15.649055v1

Broadly, I think this Twitter thread has a solid set of takeaways: https://x.com/_judewells/status/1912743353608741260

The overall gist is that - PLMs show scaling curves similar to natural language models on the validation set (i.e. as we increase scale, validation set loss reliably and predictably drops). However, unlike natural language models, this improvement in the ability to predict the next token in an amino acid sequence does NOT translate to improved downstream task performance. They tested models all the way up to 46 B parameters, and it seemed like for most downstream tasks there were hugely diminishing returns to anything above 300M parameters.

I think this finding actually ties in nicely with the section of this article titled "The Limitations of Pure Sequence Data". In natural language, we see all of instructions, data, and solutions (since all three are representable as text), and we see concepts represented at different scales (e.g. a summary of book, the book itself, an analysis breaking down a specific section of a book, etc). This allows natural language models to "generalize" because they learn are able to learn instructions, how to interpret data, and how to solve the instructions given data, ALL from the pretraining objective! Again, this is only possible because all of these things are representable as text!

By contrast, protein language models are constrained to only predicting the next token in an amino acid sequence - they never "see" the higher or lower scales during pretraining (such as what are the physicochemical properties of this protein? What is the high-level function of this protein? What cell type is this protein located in? What is the typical expression level for this protein in that cell type? What happens to that cell if the protein expression is turned off? What about if it's upregulated?). These things are not representable as amino acid sequences, which is what PLMs are trained to process and predict. As a result, there is a point of diminishing returns to scale for protein language models beyond which we have extracted all we can from the pure next-token objective, without including all of this additional context for the model to learn from.

Expand full comment
Tommy's avatar

this is a phenomenal reply. Thank you for spending the time writing this, Chris!

From Twitter thread: "there is relatively little to be gained from massively scaling protein language models" :v

Expand full comment
Chris Hayduk's avatar

If we're comparing to GPT-style decoder-only architecutres, in some sense the BERT-like structure for ESM is actually much simpler! GPT-style architectures use one set of weights to encode the input sequence and decode the next output token. On the other hand, BERT-style models have one set of weights for encoding the input into a vector representation alongside a separate set of weights for decoding the next output token. This gives BERT more flexibility than GPT - in theory it can recapitulate the same architecture as GPT by setting the cross-attention weights to 0! But it also has added flexibility from the cross-attention sublayer that lets it learn in different ways than an equivalently-sized GPT architecture.

I think this leads us to two different questions:

1. Why do protein models use the more complex BERT architecture when GPT-based architectures are typically more efficient to train + serve and have shown better performance on text?

2. Why do language models work at all for learning properties of proteins? Why aren't things like multiple sequence alignments (MSA) or physicochemical properties needed?

For 1, I don't think I have a good answer for this, and I'm actually not sure that it's been studied well. My intuitive response is that the structure of proteins lend them better to fill-in-the-middle losses rather than next-token prediction as is used in GPT models since amino acid sequences aren't really linear in practice - when proteins fold, two amino acids that are far apart in the sequence may end up bound together in 3D space. However, this really isn't an argument in favor of the BERT architecture. You could just as easily use the masked language modeling training objective seen with BERT in a decoder-only, GPT-style architecture. So if this is the true reason for the use of the BERT model, it may be more inertia in the types of losses used with certian model types than anything else.

For 2, you can check out my Understanding Protein Language Models series! (https://www.chrishayduk.com/p/understanding-protein-language-models)

The general gist is that language models have an amazing capacity to learn correlations between tokens in an input sequence and the masked tokens that they need to predict, and these correlations do a great job of recapitulating the knowledge that we might find using a multiple sequence alignment.

Expand full comment