Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching

How transformers learn from their input data

Jan 15, 2025

transformer - Why GPT uses decoder only architecture, when they can use full encoder decoder architecture? - Artificial Intelligence Stack Exchange

Note: This post is part of the “Understanding Protein Language Models” Series:

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
[This article] Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

In the quest to understand modern protein language models like ESM2 and ESM3, we often focus on their impressive empirical results while treating their internal mechanisms as a black box. This post attempts to build intuition about how encoder-only transformers work by drawing an analogy to a simpler, well-understood algorithm: fuzzy string matching. I argue that encoder-only transformers can be viewed as performing a kind of continuous fuzzy lookup against a compressed form of their training data, encoded in their weights and latent space representations.

The Central Analogy

When working with large text corpora, we often need to find similar strings or patterns. The traditional approach employs fuzzy string matching: maintaining a database of all strings and computing edit distances to find matches. An alternative approach, which I argue is conceptually similar but mathematically more sophisticated, uses an encoder-only transformer to compress the patterns in the corpus into model weights, then uses attention mechanisms to find similarities.

Both approaches fundamentally solve the same problem - finding contextually appropriate matches - but do so in radically different ways. Understanding this connection helps demystify how encoder-only transformers work and suggests ways to improve them.

How Encoder-Only Transformers Process Input

To understand the analogy, we first need to build detailed intuition about how encoder-only transformers work. Unlike the full encoder-decoder architecture used in translation, encoder-only transformers take a sequence of tokens and return a sequence of the same length, where each output token is a refined representation incorporating contextual information.

The process begins in the embedding layer, where discrete tokens are converted into continuous vectors. Each input token is first converted to a one-hot encoding - a vector of zeros with a single one indicating the token's identity. This sparse vector is then multiplied by an embedding matrix to produce a dense vector representation. Mathematically, for a token x_i:

one_hot = [0, 0, ..., 1, ..., 0] # 1 at position x_i 

embedding = one_hot @ W_emb # Matrix multiplication with embedding matrix

To this embedding, we add a positional encoding vector that encodes information about where the token appears in the sequence. The original transformer paper used sinusoidal positional encodings:

def get_positional_encoding(seq_len, d_model): 
   position = np.arange(seq_len)[:, np.newaxis] 
   div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) 
   pos_enc = np.zeros((seq_len, d_model)) 
   pos_enc[:, 0::2] = np.sin(position * div_term) 
   pos_enc[:, 1::2] = np.cos(position * div_term) 
   return pos_enc

These positional encodings have an elegant property: the relative position of two tokens can be computed through linear combinations of their encodings.

The heart of the transformer architecture lies in its self-attention mechanism. For each position in the sequence, the model generates three vectors through learned linear transformations:

Q = H @ W_Q # Query vectors 
K = H @ W_K # Key vectors 
V = H @ W_V # Value vectors

where H is the matrix of hidden states. The attention scores are then computed as:

attention_scores = softmax(Q @ K.T / sqrt(d_k)) 
output = attention_scores @ V

This can be written more formally as:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $

This attention mechanism allows each position to gather information from all other positions, with the weights determined by learned compatibility scores. The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients.

The transformer employs multiple attention heads in parallel, each with its own set of query, key, and value projections:

$\text{MultiHead}(H) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O$

where each head is computed as:

$\text{head}_i = \text{Attention}(HW^Q_i, HW^K_i, HW^V_i)$

After attention, each position's representation goes through a two-layer feed-forward network:

$\text{FFN}(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2$

Traditional Fuzzy String Matching

Fuzzy String Matching with `stringdist` – Just R Things

To appreciate the analogy, we need to understand how traditional fuzzy string matching works. Given an input string and a database of reference strings, fuzzy matching computes the edit distance between the input and each reference. The edit distance represents the minimum number of operations (insertions, deletions, or substitutions) needed to transform one string into another.

The core of fuzzy string matching is the computation of edit distance. For strings s and t, the dynamic programming recurrence is:

def edit_distance(s, t): 
   m, n = len(s), len(t) 
   dp = np.zeros((m+1, n+1)) 

   # Initialize base cases 
   for i in range(m+1): 
      dp[i,0] = i 
   for j in range(n+1): 
      dp[0,j] = j 

   # Fill dp table 
   for i in range(1, m+1): 
      for j in range(1, n+1): 
         if s[i-1] == t[j-1]: 
            dp[i,j] = dp[i-1,j-1] 
         else: 
            dp[i,j] = 1 + min( 
               dp[i-1,j], # deletion 
               dp[i,j-1], # insertion 
               dp[i-1,j-1] # substitution 
            ) 
   
   return dp[m,n]

Mathematically, the recurrence relation is:

$D[i,j] = \min \begin{cases} D[i-1,j] + 1 & \text{deletion} \\ D[i,j-1] + 1 & \text{insertion} \\ D[i-1,j-1] + \mathbb{1}_{s[i] \neq t[j]} & \text{substitution} \end{cases}$

The computation proceeds through dynamic programming, building a matrix where each cell represents the minimum number of operations needed to match a prefix of the input string to a prefix of the reference string. The final cell gives the total edit distance between the strings. By computing this distance for each reference string and sorting the results, we can find the closest matches in the database.

While conceptually simple and mathematically elegant, this approach becomes computationally expensive for large databases, requiring time proportional to both the string lengths and the size of the database. Various optimizations exist, such as trie structures and early pruning, but the fundamental challenge of scaling remains.

The Transformer as Compressed Fuzzy Matching

Here we arrive at the core insight: the encoder-only transformer effectively compresses the pattern-matching capabilities of fuzzy string matching into its weights. During training, the model learns to encode the essential patterns and relationships present in the training data into its parameters.

The embedding matrix learns to map tokens to a continuous space where similar tokens are close together. The attention weights learn which patterns of tokens commonly co-occur, while the feed-forward layers learn to combine these patterns into higher-level features. Each successive layer captures progressively more abstract relationships. This process is analogous to building an optimized index of the training data, but instead of storing exact strings, we store distributed representations of patterns and relationships.

The connection between transformers and fuzzy matching becomes clearer when we compare their similarity computations. In fuzzy matching, the similarity between strings s and t is:

$\text{similarity}(s,t) = -\min_{\text{operations}} \sum_i \text{cost}(\text{op}_i)$

In transformer attention, the similarity between vectors q and k is:

$\text{similarity}(q,k) = \frac{q \cdot k^T}{\sqrt{d_k}} $

We can view the transformer's learned weights as parameterizing a continuous relaxation of edit distance. The attention mechanism implements this relaxed distance metric:

def attention_similarity(query, key, value): 
   # query shape: [seq_len, d_k] 
   # key shape: [seq_len, d_k] 
   # value shape: [seq_len, d_v] 

   scores = query @ key.T / np.sqrt(query.shape[-1]) 
   attention_weights = softmax(scores) # [seq_len, seq_len] 
   return attention_weights @ value # [seq_len, d_v]

Several observations support this compression view. The systematic improvement in model performance with increasing size suggests that larger models can store more detailed patterns from the training data. Analysis of attention patterns reveals that different heads learn interpretable relationships that match linguistic or domain structure. The organization of the embedding space shows meaningful clustering of similar tokens and preservation of analogical relationships.

When we run a sequence through the transformer, the process mirrors fuzzy matching but operates in a continuous space. The initial embedding maps tokens to vectors, analogous to preparing strings for comparison. The self-attention mechanism computes similarity scores between positions, playing a role similar to edit distance calculation, but using a learned, context-dependent metric. Multiple layers progressively refine these representations, like iteratively improving string alignment.

We can formalize this connection mathematically. In fuzzy matching, similarity is measured as the negative minimum cost of operations needed to transform one string into another. In transformer attention, similarity is measured through scaled dot products between query and key vectors. The transformer effectively learns a continuous approximation of edit distance that can capture more nuanced relationships.

Concrete Example: Protein Sequences

Consider a concrete example from protein sequence analysis. In traditional fuzzy matching, we might have a query sequence "MKLLPVL" and search a database containing sequences like "MKLLTVL" (one substitution) or "MLKPVL" (two operations). Each comparison requires explicit computation of edit distances. This type of genetic database search is used by MSA-based models such as AlphaFold2 and AlphaFold3. The mechanics of this is described in the previous post in the series, Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2.

The transformer approach is markedly different. After embedding the sequence into continuous vectors, self-attention finds similar patterns that have been compressed into the model's weights during training. The output reflects patterns observed during training, but crucially, the model can combine these patterns in novel ways. The transformer effectively "remembers" which amino acid substitutions are biochemically plausible in each context, without storing explicit sequences.

In the next post in this series, we will flesh out this section more and learn how protein language models allow us to replace the MSA step in AlphaFold, creating faster strucutre prediction models that generalize better to proteins that have few available related sequences.

Conclusion

Viewing encoder-only transformers as performing compressed fuzzy matching provides powerful intuition about their operation. Rather than seeing them as black boxes, we can understand them as learning to compress and query a vast database of patterns from their training data. This perspective suggests that improvements in transformer architecture may come from better compression techniques for storing training patterns, more efficient similarity computations, and explicit incorporation of string matching algorithms.

Future research might investigate how much pattern information is stored in different parts of the model, how different architectures affect compression quality, and whether we can design better compression mechanisms inspired by string algorithms. We might also explore the theoretical limits of this compression approach and its implications for model scaling.

The success of this architecture in domains like protein sequence modeling suggests that the ability to learn and compress domain-specific similarity metrics is a powerful paradigm. As we continue to develop these models, maintaining this conceptual connection to classical algorithms may help guide the way to more efficient and effective architectures.

Musings by Chris Hayduk

Discussion about this post