Musings by Chris Hayduk: The Sciences

The Unreasonable Effectiveness of LLMs in Mathematics

Chris Hayduk — Mon, 13 Apr 2026 12:02:39 GMT

Image from @alexwei_’s announcement post on X

In July 2024, DeepMind unveiled AlphaProof — an AlphaZero-inspired agent that constructs mathematical arguments in Lean, a programming language for proofs. It broke new ground in mathematical performance, achieving a silver medal in the 2024 International Math Olympiad.

One year later, in July 2025, OpenAI announced that they had achieved a gold medal in the 2025 International Math Olympiad using a raw LLM — no reinforcement learning in Lean space, no translation between natural language and formal proof languages. In the span of a few weeks, this same model would go on to add a gold medal at the International Olympiad in Informatics and a 2nd place finish at the AtCoder World Tour Finals to its achievements.

How is it possible that a general LLM, one that operates in natural language and would be just as comfortable answering questions about lasagna recipes in ChatGPT as it is scoring a gold medal in the IMO, could defeat a model that was custom-made to solve math problems by thinking directly in proof space? And is the current LLM architectural paradigm enough for us to solve mathematics?

Subscribe now

A Brief Explanation of AlphaProof

The core reasoning components of AlphaProof. Figure 1 in the AlphaProof paper

AlphaProof is an LLM- and reinforcement-learning-based approach to mathematical proof generation published by DeepMind in November 2025 (with its initial announcement in July 2024).1 The model was inspired by AlphaZero, the successor model to AlphaGo & AlphaGo Zero, which taught itself to play chess, shogi, and Go purely through reinforcement learning (RL) from self-play

The AlphaProof research team began by translating mathematical problems from natural language into Lean, a formal proof language that allows users to build mathematical arguments through explicit axioms, theorems, and deductive steps. Proofs in Lean are built up one step at a time by applying actions (called tactics) to change the current proof state. Lean guarantees that each step must be logically rigorous and consistent — if not, the proof won’t compile.

To generate a sufficiently large dataset to learn from, a Gemini-based LLM (known as the formalizer) was trained to translate natural-language mathematical statements into Lean (see the figure below for an example).

The formalization system translates natural language math statements to valid Lean code. Extended Data Fig. 2 in the AlphaProof paper

With this setup in place, constructing math becomes a game-like RL environment, where the state is the current proof status, the set of actions is the set of possible Lean tactics, and the reward for each action is -1 (encouraging shorter proofs, since we aim to maximize the cumulative reward).

AlphaProof gets its tactics from the combination of a prover agent and a search algorithm inspired by AlphaZero. The prover agent is a 3-billion-parameter encoder-decoder transformer model that suggests tactics to apply next (given the current prompt state) and estimates their expected cumulative return (that is, what I would expect my total reward to be starting from the proof state that this action will take me to and continuing until I complete the problem). The tree search algorithm explores sequences of actions suggested by the prover agent and evaluates their results.

The prover agent learns by training on the Lean problems generated by the formalizer. It, along with the tree search algorithm, generates attempted proofs and receives a learning signal depending on whether a valid proof is found or the agent times out during the search.

The prover agent + tree search algorithm approach allows us to scale along two axes: training time for the prover agent and test-time computation for the tree search algorithm. This dual-scaling enables strong performance on held-out IMO problems, as shown in the table below. When we increase tree search time from 2 TPU minutes per problem to 12 TPU hours per problem, we see validation accuracy jump from 33.2% to 43%.7%

AlphaProof performance on the held-out IMO validation set. Extracted from Table 1 in the AlphaProof paper

However, from the above, we can see that the paper uses another scaling axis — namely, test-time RL (TTRL).

The way this works is that, for challenging problems, a variant generator can create hundreds of thousands of distinct yet similar formal problems for the proof network to continue training on. The prover will then learn from these similar examples, updating its weights as it gets rewards for completing proofs. After TTRL has been executed, the prover agent now “knows” substantially more about the problem area adjacent to the problem we actually care about, thereby improving its accuracy.

Thus, to achieve the highest levels of performance, the AlphaProof system needed to overfit on new data that was extremely similar to the questions asked. It did not succeed out of the box at the IMO, even after training on ~80 million formal problems. In particular, on the IMO holdout set (as seen in the table above), AlphaProof required a 4-order-of-magnitude increase in compute budget to go from a 33.2% to a 58.3% success rate. The final 4.4 percentage points of performance (from 53.9% to 58.3%) required an additional order of magnitude in compute (from 3,000 TPU minutes per problem to 30,000 TPU minutes per problem). The paper explicitly states that “each of these solutions required 2–3 days of (test-time RL) TTRL, demonstrating substantial problem-specific adaptation at inference.”

The main takeaway of all this is the following: AlphaProof used a highly math-specific RL environment, along with supporting models (e.g., a formalization system and a variant generator), and a formal proof language to achieve its groundbreaking theorem-proving results. It did indeed set a state-of-the-art in mathematical benchmarks such as the IMO; however, despite the scaffold already being highly specific to the problem, it was still insufficient, with the model still requiring multiple TPU days of test-time RL and hundreds of TPU days of tree-search time per problem to achieve optimal performance. Thus, although highly performant, AlphaProof was not a general-purpose theorem-proving system — it still required substantial overfitting and custom software to be useful.

GPT-5.x and The Mathematician’s Mind

While AlphaProof performed well with its highly-custom scaffold and approach to overfitting problems, OpenAI’s IMO Gold model blew it out of the water without all of these math-specific bells and whistles.

How is this possible? How can a general-purpose system perform so much better than an application-specific system? And why have all the large labs moved away from these application-specific systems in the intervening ~2 years since AlphaProof’s release?

The answer lies in the subconscious mind.

Jacques Hadamard, one of the great mathematicians of the 20th century, published The Mathematician’s Mind: The Psychology of Invention in the Mathematical Field in 1945. Building on Henri Poincaré’s 1908 lecture titled L’invention Mathématique (Mathematical Invention), he interviewed several of the greatest living mathematicians & physicists at the time (including George Polya, Claude Lévi-Strauss, and Albert Einstein) and assessed how their process of mathematical discovery felt from a phenomenological perspective.2 He wanted to answer the following question: What is occurring in the minds of our greatest mathematicians when they make new discoveries?

The answer Hadamard found, synthesizing results from these interviews, was that mathematical discoveries arise from the interplay between conscious and unconscious processes. And, in particular, the idea for the proof tends to start in the subconscious.

…let us remember that every mental work and especially the work of discovery implies the cooperation of the unconscious, be it the superficial or (fairly often) the more or less remote one; that, inside of that unconscious (resulting from a preliminary conscious work), there is that starting of ideas which Poincaré has compared to a projection of atoms and which can be more or less scattered ; that concrete representations are generally used by the mind for the maintenance and synthesis of combinations. This carries, in the first place, the consequence that, strictly speaking, there is hardly any completely logical discovery. Some intervention of intuition issuing from the unconscious is necessary at least to initiate the logical work. [emphasis mine]

Hadamard asserts that no mathematical discovery is purely logical. The unconscious mind, in all cases that he examined, played a crucial role in the development of rigorous mathematical arguments. This role, and the handoffs between the subconscious and conscious minds, were distilled by Hadamard into the following framework for mathematical discovery:

Preparation (primarily conscious) — the conscious mind focuses on a problem for an extended period of time, collecting relevant information and trying out several avenues for solution
Incubation (primarily unconscious) — the unconscious mind, directed in its goals by the focus of the conscious mind in the Preparation stage, sets to work searching for high-level solutions. This is where the bulk of problem-solving and discovery is actually done. The unconscious mind is better at viewing the problem as a “whole” and at uncovering unexpected insights and connections than the conscious mind. The unconscious mind evaluates proposed solutions based on aesthetic criteria.
Illumination (primarily unconscious) — an idea generated by the unconscious mind that satisfies the unconscious criteria springs forth into the conscious mind
Verification (primarily conscious) — the conscious mind sets to work translating the unconscious’s idea into formal mathematical language and verifies that it is logically correct.

Thus, we can see that the unconscious mind is actually responsible for generating the proof structure. The conscious mind just sets the scene during the Preparation stage and verifies that suggested proof structure during the Verification stage. But it does not produce the critical insights regarding the proof’s structure that actually lead to the discovery itself. As a result, we can say that the discovery process is decidedly not rigorous.

AlphaProof’s downfall was that it was designed to act like the rigorous conscious mind — every step must be constructed rigorously and logically consistent with previous steps, and the proof is built up step by step. In this manner, AlphaProof acts at a local level with the prover trying to find the best next incremental step in a proof, whereas the mathematician’s unconscious mind works at a global level by identifying a full proof sketch all at once. Only then does the conscious mind fill in the rigorous details.

Based on this insight, we see that an AlphaProof-like system is most closely aligned with the Preparation and Verification stages in Hadamard’s framework. It completely omits the Incubation and Illumination phases, where the actual work of discovery occurs.

OpenAI’s gold-medal-winning model, as a standard LLM deployed for mathematical use cases, represents a new paradigm in mathematical reasoning that relaxes AlphaProof’s constraints on only following rigorous thought. It enables messier, higher-level reasoning in the language space rather than in the Lean-verified proof space. The system can think through and mentally pressure-test high-level approaches & proof sketches, rather than being forced to focus exclusively on granular proof steps, as in AlphaProof.

To demonstrate this, I provided GLM 5.1 with question C4 (a medium difficulty combinatorics problem) from the 2024 International Math Olympiad. By using an open-source model, we can see the full reasoning trace and how its thought process leads to its final answer.

In the excerpt provided below of the model’s reasoning trace, you can see how messy and non-rigorous the thinking truly is. The model is jumping around in conceptual space, backtracking frequently, and probing different high-level directions.

Reasoning excerpt from GLM 5.1’s attempt to solve question C4 from IMO 2024.

We can see from the above example that this brings us closer in line with the Incubation and Illumination phases of Hadamard’s process — the LLM is able to search for high-level solutions and make jumps that are not logically rigorous and are not incremental (in the sense that they do not need to be at the scale of a single tactic in Lean). The model is free to jump around in natural language between several, potentially unrelated concepts. It is only forced to converge on a rigorous, verified solution at the end of its answer once one of the high-level solutions satisfies its aesthetic sense.

This relaxation of constraints on the thinking process also explains how the system can generalize across domains — the model does not need all the math-specific scaffolding and so can learn a more general process of discovery. This process, as outlined by Hadamard, is not limited to math (as evidenced by Einstein's inclusion in the group). In fact, it seems to me that discovery in any domain that can culminate in a Verification step will follow this overarching process. And we do see this in the success that OpenAI’s model enjoyed at the IOI and AtCoder competitions shortly following its gold medal at the IMO.

So, is math (and all other easily verified domains) solved with this paradigm?

Not quite. While OpenAI’s gold medal LLM represents a meaningful improvement over AlphaProof in bringing AI proof systems into alignment with how mathematicians actually think, there is still a glaring gap: the model is required to reason in language space and produce one token at a time. Although this allows for broader, less rigorous thought than AlphaProof, it is still far more constrained than Hadamard’s description of the unconscious. Specifically, Hadamard asserts that, in addition to being non-rigorous, unconscious thought is often not even interpretable. He says that at this stage, all mathematicians think without language or precise symbols, and many do not even use clear images. The ideas themselves are vague, amorphous, and global. Hadamard says:

Practically all of [the mathematicians interviewed]… avoid not only the use of mental words but also, just as I do, the mental use of algebraic or any other precise signs; also as in my case, they use vague images.

This differs substantially from LLMs, in which each thought has a fixed amount of computation applied to it, after which that thought must then be made concrete in the form of a token. As a result, current systems are limited in how closely they adhere to Hadamard’s discovery framework. This fundamentally limits how creative they can be — the more we force thoughts to be interpretable and legible, the less unexpected and amorphous they will be, and thus the less they will approximate the unconscious mind.

The Reasoning Paradigm of the Future

As Hadamard astutely observed, true discovery occurs at the subconscious level rather than through rational, conscious processes. When moving from AlphaProof to OpenAI’s IMO Gold model, we took one step towards the unconscious — from the uber-rational realm of Lean and step-by-step proof construction into the messy, unrigorous world of natural-language thought.

Although this thought is messy, it is still articulable and thus operates at a level above what we call the subconscious. Reasoning in natural language allows for more flexibility than reasoning in Lean, but we know that mathematicians don’t reason in language at all during the Incubation and Illumination steps.

If the bottleneck is the token itself — the forced collapse of each thought into a discrete, legible symbol — then the natural question is whether we can let models reason before that collapse happens. What would it look like for an LLM to think in a medium richer than language?

For LLMs, the level below token-based thought is embedding-based thought. A model's actual computation lives in embedding space before it's projected down into the vocabulary — just as the mathematician's thoughts live in the unconscious during Incubation before being projected into the conscious mind during Illumination. Each token is a lossy compression of a much higher-dimensional internal state.

Recall Hadamard's observation that mathematicians reason in vague images rather than precise signs. An embedding is, in a loose sense, exactly that: a vague, high-dimensional representation that has not yet been forced into a precise sign.

Thus, rather than forcing the model to output its reasoning in token space, we can allow it to reason in this much higher-dimensional embedding space found within the model's internals. This allows for much more abstract and “unconscious” thought than standard token-based reasoning does, aligning more closely with Hadamard’s concept of unconscious thought.

ByteDance implemented this idea in their November 2025 paper titled “Scaling Latent Reasoning via Looped Language Models,” which developed a model family called Ouro.3 This model family is what is known as a “looped language model” — that is, rather than outputting a token once the language model reaches the last layer in its computation graph, it loops back on itself by ingesting the embedding from layer N as new input for layer 1 (see leftmost diagram in the image below).

Performance is less “spiky” for the looped language model. Figure 1 in the Looped Language Models paper

As a result, the model can learn to reason deeply about tokens without first converting them to language space. Moreover, the ByteDance team integrated this reasoning process directly into pretraining and implemented a mechanism for the model to determine when to stop looping during inference based on a predicted exit probability. This allows the model to more closely integrate its reasoning capabilities with its world knowledge (since it is trained during the pretraining stage), as well as to learn to modulate its reasoning based on the difficulty of the next token.

And, most importantly, this brings the model’s reasoning process much closer to Hadamard’s core Incubation and Illumination steps.

With this setup and its proximity to Hadamard’s Discovery process, a very general-purpose approach to reasoning in verifiable domains, we would expect the model to perform better across a number of benchmarks. And that is precisely what we see. Not only does the model outperform much larger models, but its performance is also less “spiky” — that is, it performs consistently well across many benchmarks rather than experiencing isolated “spikes” in performance on certain benchmarks (see the central and rightmost figures in the image above).

We can see further evidence that this approach improves reasoning in the image below. On the left-hand side of the image below, the authors compared the bits of knowledge memorized by different train models (y-axis) to the number of parameters of those models (x-axis). For each model size, they trained a looped and a non-looped model variant, represented by large and small circles, respectively. You may not have noticed that there are two circle sizes on the graph because they basically overlap exactly for all model sizes. Thus, adding looping does not allow the model to learn more information.

Now, if we turn our attention to the right-hand side of the image below, we can see the performance of a baseline transformer and a looped variant at different sizes on a reasoning-heavy benchmark. The numbers jump off the page — at each fixed model size, the looped model vastly outperforms the non-looped model. More astoundingly, a two-layer looped model substantially outperforms the largest non-looped model (the 12-layer variant in the first row).

Looping the language models doesn’t increase memorization capacity, but it does improve the ability to manipulate the memorized knowledge. Figure 6 in the Looped Language Models paper.

Hence, from the above, we can see that applying an embedding-space loop to language models, thus bringing them closer to Hadamard’s Discovery process, substantially improves reasoning ability. And this reasoning ability seems to generalize better across benchmarks than standard RL on top of language models. As a result, I expect all frontier LLMs to adopt this paradigm if they haven’t already.

Stepping back a bit, there is something quietly counterintuitive about the trajectory of the last two years. AlphaProof tried to make machines do math by forcing them to be maximally rigorous, with every proof built up from one logically rigorous step to the next. It was, in many ways, the purest expression of what we used to think mathematical reasoning was. And yet it hit a wall.

The lesson of OpenAI’s IMO Gold model, and now of looped language models, is that progress has come not from tightening the constraints on machine reasoning but from loosening them. First, we let models think messily in natural language. Now we are letting them think in a medium that isn’t even language. Each step has moved further from the clean, legible, step-by-step ideal. And each step has improved the models.

But loosening the constraints on Incubation only gets us half of Hadamard’s cycle. I’ve spent most of this piece arguing that LLMs are getting better at Incubation and Illumination, and that looped models push them further in that direction. But this leaves out the fact that current LLMs are weak at Verification. They hallucinate proofs, confidently assert false lemmas, and have no reliable internal signal for when their reasoning has gone off the rails. AlphaProof’s step-by-step rigor was, for all its limitations, a genuine Verification engine — every tactic was checked by Lean before the proof could proceed. Standard LLMs lack such a mechanism and are thus often prone to hallucinating results. The IMO Gold result is remarkable precisely because the model managed to produce verifiable outputs despite lacking a built-in verifier.

Thus, the real frontier is synthesis. AlphaProof gave us a formalizer and a verifier but no unconscious. Looped language models are giving us an unconscious but no verifier. Hadamard’s framework is designed to be a loop — when Verification fails, the mathematician returns to Preparation with new information about why it failed, which reshapes the next round of Incubation. The system that finally closes this loop will be the first one to actually instantiate Hadamard’s full cycle rather than just its generative half.

This, I think, is the actual research agenda for the next few years, and it has two halves that have to advance together. On one side, we need to keep deepening the unconscious by pushing latent-space reasoning further, scaling up looped architectures, and figuring out how to train them with the kinds of reinforcement signals that sharpen aesthetic judgment. On the other, we need to build formalizers and verifiers for a broader range of fields, so that the unconscious can receive concrete feedback from a Verification step. Neither half is sufficient on its own, as the unconscious without a verifier can drift into hallucinations, while the verifier without an unconscious is too rigid to make true intellectual leaps. A synthesis of both is required.

What this looks like concretely depends on the field. For mathematics, both halves are nearly in place. The unconscious is arriving via looped language models, and the verifier already exists in the form of Lean. All we need to do is use AlphaProof’s Formalizer to convert the Looped Language Model output to Lean, and we’re good to go.

The more difficult question is what we do in fields that don’t already have a Lean. The unconscious half seems to generalize quite well, judging by the less spiky benchmark results from the Looped Transformer paper. But the verifier half does not generalize. Instead, it has to be built field by field. And what counts as a verifier in one domain looks nothing like what counts as a verifier in another.

In physics, building the verifier means investing in experiment automation. That is, robotic labs that can run, measure, and report on experiments proposed by a latent-reasoning model, closing the loop between hypothesis and data without a human in the middle. In biology, the same logic points to high-throughput assay platforms that can test a model’s proposed interventions at scale, with the same feedback loop into the unconscious. In economics and the social sciences, the verifier half has to be weaker, such as large-scale simulation environments, prediction markets, or structured forecasting tournaments that can at least provide a noisy verification signal where a clean one is impossible. But even a noisy verifier may be enough to close the loop, so long as the unconscious half is there to propose hypotheses worth testing in the first place.

Hadamard told us in 1945 that the real work of discovery happens in a place that is neither rigorous nor articulable, and that the conscious mind’s job is to set the stage and then check the work afterward. Looped language models are finally giving us the unconscious that Hadamard envisioned. The next paradigm wires the verifier into the output of the looped language models, closing the loop on Hadamard’s Discovery process. The fields where we can build both halves are the fields where Hadamard’s full cycle will finally run on a machine and where we should expect AI progress to accelerate most dramatically in the coming years.

Olympiad-level formal mathematical reasoning with reinforcement learning

Hadamard, J. (2020). The Mathematician’s Mind: The Psychology of Invention in the Mathematical Field. Princeton University Press.

Scaling Latent Reasoning via Looped Language Models

A Tale of Two Futures

Chris Hayduk — Sun, 11 Jan 2026 23:07:46 GMT

The US and China are engaged in a two-country race towards an AI-powered future. Both countries have directed funding, policy initiatives, and talent towards the AI sector at levels matched only by the internet buildout of the 1990s or railroad construction of the 1880s. But both of these countries are building towards diametrically opposed futures.

One country has taken the view that it is on the path to artificial superintelligence (ASI) — the point at which AI will be more effective than the most capable humans at every task. The view of their prominent AI labs is that once ASI is achieved, there will be a runaway intelligence explosion, with the AI rapidly improving itself and reaching unthinkable levels of intelligence. This genius AI will then be able to solve our most pressing problems in mathematics, physics, philosophy, and more with minimal difficulty.

The other country has taken the view that it is not peak intelligence that matters, but rather the distribution of intelligence. It aims to develop intelligent AI models (though not superintelligent) that are fast and cheap enough to be embedded in machines across the economy. It wants household robots in every home, talking cars, and refrigerators that can do the grocery shopping.

One future is top-down, the other is bottom-up. One is centralized, the other is decentralized. One results in gains accruing to the select few corporations, the other results in gains throughout the entire economy. One is Skynet, the other is The Jetsons.

The great irony of the current AI race situation is that the centralized SkyNet future belongs to the democratic United States, and the decentralized Jetsons future belongs to authoritarian China.

The Skynet Future

Skynet is the fictional AI developed by Cyberdyne Systems in the original Terminator movie. It’s a large, powerful AI system that, once deployed, rapidly increases its intelligence to the point of becoming self-aware. When it becomes self-aware, it becomes self-interested and decides that the best way to preserve its existence is to eliminate all humans. Skynet then appropriates the nuclear codes of the United States and launches these nuclear weapons in an attempt to eliminate the human race from the face of the Earth. Small groups of humans survive the nuclear fallout, and they live in a post-apocalyptic world fighting robots directed by Skynet that are attempting to exterminate humanity once and for all.

This vision is very apocalyptic, but I think it captures the sentiment of the US AI scene better than any other popular depiction of AI. From the initial ambitions of Skynet running the United States to its intelligence explosion to its final destruction of humanity, these views are not outlandish in Silicon Valley. In fact, they may be the norm among AI researchers and AI lab CEOs. And these views have substantial implications for how AI research is playing out in the United States.

The CEOs of AI research labs are explicitly building towards this form of superintelligence. Crucially, these labs view pushing the intelligence frontier of these models as the core goal of their research. They explicitly look to benchmarks that measure intelligence on extremely difficult tasks (such as FrontierMath, GDPVal, and ARC-AGI-2) as the core metrics that they’re optimizing against. Their goal is to produce a “country of geniuses in a datacenter”, as Dario Amodei put it in his article “Machines of Loving Grace”. Amodei believes that, once achieved, artificial superintelligence could compress 50-100 years of biological research into 5-10 years.

Moreover, many of the prominent figures in AI view the path to superintelligence as a race. They believe that as we move closer to superintelligence, we will be able to achieve an automated AI researcher that can analyze its own codebase and improve it rapidly. These algorithmic gains from AI producing its own code improvements will result in an intelligence explosion, such that the first lab to produce an automated AI researcher will immediately gain an insurmountable lead in intelligence over the other labs. As such, not only do the AI lab leaders believe this Skynet scenario, but they also view it as a race that must be won at all costs. The very existence of their companies in their minds depends upon reaching the superintelligent AI first. To understand the pace of AI investment and the amount of that investment that becomes allocated to training ever-larger and more capable models (rather than more cost-efficient or broadly distributed models), you need to internalize that the prominent figures in AI strongly believe this to be the true state of the world.

To race towards superintelligence, massive increases in the two core inputs to AI training are needed: data and compute. As a result, the AI labs have invested hundreds of billions of dollars into massive compute scale-outs, data acquisitions, and RL environment development. New data centers are coming online that consume gigawatts of electricity to train larger models with higher parameter counts. In addition, these new models are fed data and trained in RL environments produced by hired PhDs and leading experts across math, computer science, finance, and more.

To afford the energy, compute, and data to produce these models that push benchmark metrics on FrontierMath and GDPVal, more and more centralization is encouraged in AI research. As shown in the chart above, the cost of building a frontier data center has been increasing exponentially, from around $7 billion in 2022 (when ChatGPT was first released) to a projected $106 billion in 2027. Hence, if there is a fixed amount of private & public funding available to AI companies, it is in investors’ interest to allocate that funding to a small number of companies so that they can afford the requisite frontier data centers to train these models. With funding too widely distributed, no single player would be able to produce a model that outperformed the state-of-the-art on these frontier benchmarks.

We can observe this trend more clearly in the training compute cost of frontier models from 2012 through 2025. We can see that model training costs rapidly increased from roughly $3 million in 2022 to over $300 million in 2025. In addition, the trend line shows this cost increasing at a rate of 0.5 orders of magnitude per year, indicating that the cost of a single training run next year (2027) will increase to about $3 billion. Projected forward to the end of the decade (2030), a training run would cost roughly $100 billion (with the data centers powering such a run likely costing in excess of $1 trillion).

Given all of this, we can think of the American AI ecosystem as a bet on increasing centralization and increasing scale. It is a bet on a benevolent Skynet future — leveraging unprecedented resources to build a single, massive AI model capable of solving the most difficult problems in science, technology, politics, and philosophy.

The Jetsons Future

The Jetsons is an animated sitcom from the 1960s that depicts a future defined not by a single technological breakthrough but by the accumulation of countless small conveniences — flying cars, household robots, and apartments that predict each family’s needs. The Jetsons’ future is not one of transcendence but of leisure — technology has not produced a godlike intelligence but has instead seeped into every object, automating away the drudgery of daily life. The show’s vision is one of abundance through distribution: no single machine is particularly impressive, but the sheer proliferation of helpful machines has transformed the nature of work and home life entirely.

The Jetsons aired the same year as the Cuban Missile Crisis. Sixty years later, it is the CCP & China, not the United States, that is building toward its vision.

Chinese AI labs burst onto the scene in early 2025 with the release of DeepSeek-R1. Unlike its American counterparts, the notable aspect of DeepSeek’s model was not its raw performance — it was strong, but it lagged behind the frontier. The truly impressive aspect of DeepSeek-R1 was that it performed similarly to frontier models at a fraction of the cost, both in terms of serving cost and training cost. For example, as I detailed in another article (Open Source LLMs Are Eating the World), DeepSeek-R1 was able to perform nearly as well as OpenAI o1 (the frontier model at the time) on the MMLU Pro benchmark, while only costing $6.75 to run the full benchmark suite compared to o1’s $75. This represented an 11x drop in the cost to serve the model at roughly equivalent performance levels.

The success of DeepSeek-R1 has sparked a wave of innovation in open-source Chinese AI. Various companies have entered the fray, including Alibaba with its Qwen series, Z.ai with its GLM series, and Moonshot AI with its Kimi series. Each of these three core competitors, along with DeepSeek, has steadily pushed the cost of economically useful intelligence towards zero.

Produced by GPT 5.2 Pro using MMLU Pro benchmark results

In addition, the speed of innovation has compressed the time delta in model performance between closed-source AI labs and their Chinese open-source equivalents. The chart above shows the catch-up time required for open-source AI models to match the performance of closed-source models at different performance thresholds on the MMLU-Pro benchmark. We can see that earlier performance levels, such as the 60% threshold, required roughly two-thirds of a year before open-source AI could match closed-source AI. However, recent performance thresholds have been achieved in far less time, between a quarter to a third of a year. Chinese AI labs have ramped up their investments in data centers and energy. They now have access to purchase NVIDIA H200 chips, and the Chinese chip ecosystem is maturing more quickly than expected. As a result, we should likely expect this time gap to continue to compress rather than to expand.

The implications here are genuinely massive. This chart shows that open-source AI models from China can match the performance of our leading closed-source models in less than 6 months, often at an order of magnitude lower cost.

In parallel with these advancements in open-source AI, China has been experiencing two massive buildouts over the last five years. The first has been a huge increase in energy generation, specifically in solar and nuclear energy. These rapid increases in energy generation, which frequently exceed the rest of the world combined, for example, in installed solar capacity, are enabling manufacturing at scales never before seen. In particular, the improvements to solar and batteries that are occurring and will continue to occur due to technological and cost improvements driven by this increased manufacturing will allow energy not just to be more plentiful overall, but also to be more local to specific needs. What this means practically is that energy will become local and mobile, allowing for various devices to become much more energy-intensive than they have been previously through the use of improved batteries and local solar panels. This will allow devices to include the substantial onboard compute required to run the top open-source AI models.

The second key build-out has been in advanced manufacturing. China was already the world's manufacturing center before this push into advanced manufacturing. But now it has moved up the value chain and very quickly has gone from a laggard in key industries to dominating them. For instance, five years ago, China was largely irrelevant in the electric car market. Now its electric car companies are leading the world and outselling Tesla. Several leading companies are now making strong pushes into humanoid robots and are setting the pace in that category.

This confluence of advanced manufacturing in robotics & battery-powered vehicles, increased energy generation (specifically solar), and open-source AI that is energy- and compute-efficient will allow China to develop a truly intelligence-powered economy. The major factors will be in place to have human-level AI embedded into a large share of both consumer and industrial products.

With this approach, China aims to enable a new level of general abundance with household robots, self-driving electric cars, self-directed delivery drones, and household appliances that can make decisions for themselves, such as a refrigerator that can detect when you’re running low on specific supplies and order them agentically.

And crucially, this future does not depend on reaching superintelligence. As I’ve detailed in my other article, Open Source LLMs Are Eating the World, many economically relevant tasks operate in a task-saturation regime. That is, once the models exceed some threshold level of performance, future increases in model scale and training compute do not make meaningful differences in task-level performance. Moreover, models today are already capable of performing many economically viable tasks, such as coding complex apps, serving as customer support agents, and more. Hence, this broad deployment of cheap AI in physical goods will deliver returns quite quickly.

Making this intelligence cheap and abundant through energy- and compute-efficient open source AI will unlock massive economic value across the spectrum. There isn’t much doubt about this. The Jetsons future is clearly within reach. However, there is a question mark whether we will reach the benevolent Skynet future.

Consequences of the Divide

US labs are betting on transcendent intelligence. Chinese labs are betting on abundant intelligence. Who wins depends on which future actually arrives.

Four scenarios are possible. Only one favors the American approach.

The Scenario Matrix

From the above matrix, we see that the American bet requires threading a needle: superintelligence must be achievable, it must trigger a runaway intelligence explosion, and that explosion must translate into massive economic returns unconstrained by physical bottlenecks.

Remove any link in that chain, and the calculus shifts.

If superintelligence arrives but can’t escape physical constraints: The leading lab pulls ahead on benchmarks, but drug discovery still bottlenecks at FDA trials. Robotics still bottlenecks at manufacturing. Most economically useful tasks don’t require superintelligence anyway. Open-source competitors deliver similar real-world value at a fraction of the cost.

If superintelligence arrives but no takeoff occurs: Energy infrastructure takes years to build. Training runs take months. Even with a superintelligent AI optimizing your codebase, the results manifest slowly enough for competitors to close the gap. No insurmountable lead materializes.

If superintelligence never arrives: Intelligence gains follow a sigmoid curve—rapid improvement, then diminishing returns. At that plateau, the race shifts from “who’s smartest” to “who’s cheapest and most distributed.” China wins that race.

The Asymmetric Bet

The Jetsons future requires no miracles. Cheap, capable AI embedded in robots, vehicles, and appliances delivers value whether or not superintelligence is possible. China’s bet pays off in three of four scenarios.

The Skynet future requires everything to go right. Superintelligence must be reachable, takeoff must occur, and physical constraints must not bind. America’s bet pays off in one scenario.

The Implication

We see from the scenarios enumerated above that the US AI Lab approach is decisively dominant in only one of them. In all other scenarios, Chinese open-source AI is able to keep pace with the closed-source frontier, and, in doing so, it is guaranteed to bring about the Jetsons future that China is building towards. The American benevolent Skynet future is far from guaranteed.

In sum, to ensure that the United States broadly benefits from the AI revolution it has itself started, we need to take a page from the Chinese AI playbook. We must ensure that, even if superintelligence is out of reach, we will have cheap, abundant intelligence suffusing the economy. We must ensure that our portable energy infrastructure (i.e., solar panels and batteries), our robotics manufacturing capabilities, and our open source AI efforts are sufficient to power a truly intelligent economy. The failure to do so may cede technological leadership in the 21st century to the CCP & China.

Open Source LLMs Are Eating the World

Chris Hayduk — Fri, 09 Jan 2026 20:15:04 GMT

The default way we evaluate large language models is fundamentally misaligned with how they create economic value. We track frontier capabilities across broad benchmarks (such as ARC-AGI-2, FrontierMath, and SWE-Bench Verified) and implicitly assume that whoever leads on these metrics captures the most value.

However, this view assumes that what matters is the model’s maximum intelligence across a broad range of tasks — the “PhD intelligence for all” mantra repeated by the large labs.

For practical company building, this framing is wrong, and understanding why reveals a structural advantage for open source models that holds regardless of when (or whether) we achieve AGI.

I. Introduction: The Benchmarking Problem

The standard narrative goes something like this: general model value scales with general performance. A model that scores higher on a diverse battery of benchmarks is more valuable than one that scores lower, and the companies training the most capable models will capture the lion’s share of economic returns.

But this misses how value actually gets created in practice. Companies don’t build products that require uniformly excellent performance across all possible tasks. They build for specific use cases: contract analysis, customer support, code generation, and medical documentation. Revenue comes from solving customer problems, and customer problems are specific. The “average” benchmark performance that frontier labs optimize for doesn’t map to any real product; it’s just an abstraction that obscures the actual economics.

For any given application, what matters is whether your model is good enough at this particular thing, not whether it can solve PhD-level mathematics problems or write publishable research. A legal tech startup needs strong performance on contract reasoning and citation accuracy. A customer support platform needs reliability in intent classification and tone. Neither benefits from improvements to the model’s ability to prove novel theorems.

Moreover, the relationship between capability and value is S-curved. Early capability improvements unlock entirely new use cases: a model that goes from 40% to 70% accuracy on a task might cross the threshold from “useless” to “useful with human oversight.” But a model that goes from 92% to 96% often delivers no additional value, because the human workflow was already designed around spot-checking outputs, and the bottleneck has shifted elsewhere to latency, cost, integration complexity, or user experience.

This is the crux of the argument: once a model clears the capability threshold for a given task, further intelligence improvements face rapidly diminishing returns. The contract analysis tool that’s “good enough” for lawyers to trust with first-pass review doesn’t become twice as valuable when the underlying model gets twice as capable. It just becomes overprovisioned.

II. The Task Saturation Phenomenon

For any specific task, the marginal value of model capability saturates at some threshold. Beyond a certain point, users cannot meaningfully distinguish between a model of size X and a model of size n×X for any n ≥ 1.

Consider what’s happened to standard benchmarks over the past few years. The chart below tracks top scores on benchmarks like ARC, MMLU, Winograd, HellaSwag, GSM8K, and TruthfulQA against their human baselines:

The pattern is consistent: rapid improvement followed by convergence toward (and often slightly beyond) human-level performance. Once a benchmark is effectively “solved,” additional capability improvements deliver zero marginal value for tasks that the benchmark measures. A model scoring 95% on MMLU isn’t twice as useful for MMLU-adjacent tasks as one scoring 90%. For most practical purposes, they’re equivalent.

III. Reframing the Analysis: Cost at Fixed Performance

If capability saturates for specific tasks, then the relevant question isn’t “which model is most capable?” but rather “which model solves my task at the lowest cost?”

Once we fix a performance threshold (the point at which a task is effectively solved) we can track how the cost to achieve that threshold evolves over time. The a16z team did exactly this analysis for MMLU scores:

The trend line shows roughly a 10× cost reduction every year for a fixed capability level. But the more important pattern is which models sit on that cost frontier over time. Early in a capability tier’s lifecycle, closed-source models from frontier labs define the frontier. But within months, open-source alternatives emerge at dramatically lower price points.

Look at the progression for MMLU > 83: GPT-4 at $45 per million tokens, then GPT-4o at ~$10, then Claude 3.5 Sonnet at ~$10, and finally Llama 3.1 70B pushing costs down toward $0.50. The same pattern plays out for every capability threshold: closed-source models solve the task first, and then open-source models quickly make it cheaper.

Thus, if we imagine a fixed benchmark score as a proxy for the threshold at which a task is “solved”, we see that closed source models have historically had a payoff horizon of roughly one year before open source models made

IV. Case Study: MMLU Pro Replication Speed

MMLU Pro extends the original MMLU benchmark by increasing the number of multiple-choice options from 4 to 10, introducing misleading distractors, and emphasizing reasoning-heavy questions. It’s a harder benchmark, which allows us to separate out the performance levels of recently released models.

Benchmark results available here

Consider the 83% performance threshold. That is, models that answered at least 83% of questions correctly:

OpenAI o1 was the first model to reach this level and did so upon its release on December 5, 2024. Its API pricing was $15 per million input tokens and $60 per million output tokens. The total cost to run the benchmark was $75.
DeepSeek R1 was the first open source model to reach this level when it launched on January 20, 2025, priced at roughly $1.485 per million input tokens and $5.94 per million output tokens. The total cost to run the benchmark was $6.75

That’s an order of magnitude cost reduction in one month for equivalent task performance. If we want to be generous and use the release date of o1-preview, this still results in a time horizon of only 4 months before DeepSeek matched its performance with an open source model costing an order of magnitude less.

To drive the point home further still, DeepSeek V3.2 came out on December 1, 2025, and again achieved the 83% performance threshold, but this time at a two orders of magnitude reduction in cost when compared with OpenAI o1. Specifically, the total cost to run the benchmark was only $2.24.

Thus, for a fixed level of performance, we see the price drop from $75 to $6.75 to $2.24 over the course of a single year. As a result, I argue that any task solved by a closed-source model will see enterprise buyers transition to cheaper open-source models within 6 months to one year.

And there’s reason to expect this pace to accelerate. As Huawei and SMIC close the gap with NVIDIA and TSMC, and now that NVIDIA potentially regains the ability to sell H200 chips in China, the Chinese open-source labs will have access to better hardware while maintaining their cost structure advantages. We may be looking at only a couple of months between closed-source frontier releases and open-source replication with substantial cost reduction.

V. The AGI-Agnostic Conclusion

What I think makes this view most compelling is that it doesn’t depend on AGI being decades away.

The conventional case for open source often rests on an assumption that we’re approaching a capability plateau. That is, that base model improvements will slow down, shifting competition to fine-tuning, cost, and vertical specialization. This assumes that the vision of the future espoused by the US AI labs, predicated on artificial superintelligence (ASI) and runaway intelligence explosions, is wrong, while China’s view of commoditized intelligence is correct. That may well be true, but it’s a bet on a particular trajectory of AI progress.

The task saturation argument is stronger because it’s agnostic to the AGI timeline. When you’re building a company, you’re typically building for a specific use case. That means you’re operating in the saturation regime, not the model scale-up regime. Even if frontier models continue improving rapidly and the AI-2027 timeline plays out, the task your company is built around has a capability threshold beyond which additional model intelligence doesn’t matter.

And once you’re in the saturation regime, the only dimension of competition that matters is cost. Open source wins on cost, systematically and structurally, because open-source economics allow for lower margins and broader distribution.

The practical takeaway for company builders is this: bias toward open source, and do so for cost reasons rather than capability bets.

If you’re building an AI-native product, ask yourself: what capability threshold does my use case actually require? Chances are, that threshold is either already achieved by current open-source models or will be within 6-12 months of a closed-source model first reaching it. Build your infrastructure and workflows around the assumption that you’ll be running on open-source models, even if you start with closed-source APIs for speed to market.

The benchmark that matters isn’t “which model is smartest.” It’s “which model solves my task cheaply enough.” And open source is destined to systematically win that competition through relentless cost deflation.

It's Time For Google to Acquire Intel

Chris Hayduk — Thu, 25 Sep 2025 13:02:32 GMT

Nvidia made headlines this week when it announced it would invest up to $100 billion into OpenAI and help deploy at least 10 GW of AI infrastructure. The move, frequently memed as an “infinite money glitch,” with capital and revenue cycling between Nvidia and OpenAI (see the image below), effectively ensures a substantial fraction of Nvidia’s GPUs will land in OpenAI‑aligned datacenters (via leasing or outright purchases).

This comes on the heels of OpenAI’s >$300 billion “Stargate” build‑out with Oracle, which targets ~4.5 GW of capacity, further tightening the market for top‑end accelerators.

And that’s before accounting for OpenAI’s ongoing expansion on Microsoft Azure, where the relationship now runs under a right‑of‑first‑refusal model for new capacity rather than blanket exclusivity, still conferring practical priority on Azure deployments while allowing OpenAI to add capacity with other partners.

Netting this out: through the end of the decade, OpenAI has assembled an envelope of roughly 10–15 GW of Nvidia‑powered capacity across Oracle, Microsoft, and other partners with overlap between these footprints, so think of this as a shared umbrella rather than purely additive numbers. For context, independent analyses estimate ~10 GW of additional AI data‑center power could be needed globally in 2025 alone; in other words, OpenAI’s program is on the scale of a full year of incremental world AI build‑out.

Courtesy of SemiAnalysis’s Dylan Patel

The above data implies Nvidia GPU availability will tighten substantially for other frontier‑model players—Anthropic (primarily on AWS), xAI, Meta, and Google DeepMind—raising effective prices and lead times and forcing harder choices about model cadence, context windows, and training tokens.

Google has been trying to break out of this Nvidia-dominated mold for years through the development of its own AI‑specialized TPUs for training and inference. But these in-house designed chips still pass through chokepoints that Nvidia heavily influences, especially TSMC wafers and advanced packaging. By the end of 2025, analysts expect Nvidia to be ~20%+ of TSMC revenue (second only to Apple), and the CoWoS‑class packaging and HBM ecosystems remain binding constraints even as capacity expands. TSMC’s allocation is fundamentally contractual, driven by prepays and take‑or‑pay deals, and it will be reluctant to shift meaningful share away from Nvidia while demand remains red‑hot.

To escape the straitjacket created by the Nvidia‑OpenAI alignment, Google should buy Intel (or a substantial portion of it), fund the High‑NA EUV ramp, and prepare to manufacture TPUs on Intel fabs as that capacity comes online. That gives Google end‑to‑end control of its AI training infrastructure—chip architectures, training software, chip manufacturing, and data center buildout—and a guaranteed runway independent of Nvidia’s queue.

Recent events make this even more urgent. Nvidia just disclosed a $5 billion Intel investment at $23.28/share (roughly 5% of Intel’s outstanding shares), alongside a product pact in which Intel will build x86 SoCs integrating Nvidia RTX GPU chiplets for PCs and collaborate on custom data‑center CPUs—clear evidence that Intel’s roadmap can be steered by anchor customers. Intel is also now soliciting an Apple investment, according to Bloomberg/Reuters reporting.

Given the quickly changing dynamics around Intel, Google must act quickly and decisively. For example, a $25 billion purchase at $35/share would buy on the order of ~714 million shares, implying ~16%–17% of Intel based on ~4.37 billion shares outstanding—placing Google ahead of both the U.S. government (~10%) and Nvidia (~4–5%) as the largest shareholder. That level of ownership could anchor governance and direct capex toward TPU‑critical fabs and packaging lines.

In practice, this looks like the following:

A minority stake + board influence sufficient to align Intel Foundry’s roadmap to TPU requirements
A TPU-only supply compact: multi-year, take-or-pay wafer and advanced packaging commitments, with right-of-first-allocation during shortages and pricing bands tied to verifiable tool/packaging milestones.
Parallel open‑market TPU SKUs to keep utilization high and de‑risk capex—turning Google’s silicon into a software‑first, capacity‑priced product.

#3 is the longest shot, but perhaps the most enticing benefit of the investment. This would open up a second profit engine to fuel Google’s growth over the next decade, especially as its Search business comes under threat from AI-search competitors (such as OpenAI’s search-enabled offerings). In fact, Nvidia’s data‑center business is now running at an annualized ~$160 billion revenue pace, which is comparable to Google’s Search cash cow. Thus, the addition of the TPU revenue line provides substantial growth opportunities and a potential hedge against Google’s eroding search moat.

If this plan works, Google gets scheduling certainty, lower $/token, a faster model cadence independent of Nvidia’s allocation calendar, and another revenue stream that could potentially reach the level of Google Search. If it stumbles, the downside is capped at a financial position that should still appreciate if Intel’s foundry inflects. Either way, for $25 billion, Google can buy its way out of the Nvidia-TSMC duopoly and into the driver’s seat of AI compute.

The Strategic Implications of GPT-5 for OpenAI

Chris Hayduk — Fri, 08 Aug 2025 15:39:23 GMT

Image courtesy of GPT-5

After years of anticipation and hype, GPT-5 is finally out. And the results are decidedly mixed. GPT-5 is undoubtedly a great model — it is #1 across the board on LMArena, sets new highs in SWE-Bench and a host of other coding tasks, and performs great across a range of math benchmarks. However, the expectations for GPT-5 were that it would blow the competition out of the water. Instead, it has made incremental improvements across all of these benchmarks, and is highly likely to be passed over whenever Google releases its next Gemini model in short order (or just when Gemini 2.5 Deep Think gets benchmarked!)

Note that Claude Opus 4.1 scored a 74.5% on SWE-bench just days earlier, so GPT-5 performance is virtually the same (also what in the world is going on with charts at OpenAI???)

The largest gains to GPT-5 came in a less performance-based metric: it seems that, for this release, OpenAI highly prioritized reducing hallucinations and sycophancy in model output.

So you may not notice large performance differences between GPT-5 and the leading models from other labs (or even when compared to OpenAI’s o3 model), but you likely will notice that the model is much less likely to make things up and say things that are flat out wrong just to produce an answer.

In addition, you will likely notice in the ChatGPT model picker that all of the previous models are gone: now there’s only GPT-5. This is another one of GPT-5’s main contributions — it greatly simplifies the model selection process. GPT-5 is more of a system than a model, dynamically routing requests to faster LLMs (analogous to GPT 4o) or slower, thinking LLMs (analogous to o3) depending on the complexity of the request.

(The two above points are important; we’ll come back to those later).

Likely in response to some widespread dismay at the performance benchmarks, Sam Altman tweeted the following after the GPT-5 announcement:

“we can release much, much smarter models”

It seems Altman is asserting that OpenAI deliberately chose to release a model below the company’s capabilities, barely edging out its competitors (and likely not even edging out Google’s leading model) on most performance metrics. Instead, they deliberately chose to focus on reducing hallucinations and streamlining model selection as the main contributions of GPT-5. Why would OpenAI do this? Why not continue setting the benchmark for LLM model performance, as they’ve done since the ye olde days of GPT-2?

Because the strategic focus of the company has clearly shifted.

Subscribe now

Market Dynamics Affecting OpenAI

ChatGPT is a consumer application with 700 million weekly active users. And it is absolutely trouncing the competition in consumer adoption. The ChatGPT app in the Apple App Store has 3.3 million reviews, compared to just 377,000 for the Gemini app and 23,000 for the Claude app. This suggests that ChatGPT has a mobile install base that is 10x the size of Google Gemini and 100x the size of Anthropic’s Claude. Moreover, in June 2025, openai.com had 1.12 billion visits, while gemini.google.com had 265 million and Claude had 113 million — again suggesting a lead of at least an order of magnitude for OpenAI over its competitors in the consumer chat space (source: https://www.semrush.com/).

By contrast, according to a Menlo Ventures report from July 2025, Anthropic is actually the leading market provider for enterprise LLM API usage, with 32% market share vs. OpenAI’s 25% market share in mid-2025. Google is also growing and not far behind OpenAI, at 20% market share, up from 12% in 2024. OpenAI’s enterprise position has also been trending strongly negative, cratering from 50% market share in 2023 down to its current position of 25% market share in mid-2025.

So we see the market dynamics pressing on OpenAI as a company — absolutely dominant positioning on the consumer side of the market, with a weak (and steadily weakening) position on the enterprise side of the market. This leaves OpenAI with a choice — double down on its success on the consumer side of the market, or attempt to win in the highly competitive enterprise space. This choice mainly comes down to where the company has a moat that can result in durable profit margins.

How Market Dynamics Affect the Models

First, let’s explore the dynamics of the consumer market. Consumers, by and large, make buying decisions not based on performance or objective metrics, but instead based on “vibes”. You can improve the vibes for a consumer by improving brand positioning (i.e., make the consumer feel a certain emotion from using your product) or by improving the user experience (UX) and user interface (UI) of your product (i.e., make the product more enjoyable for the user).

OpenAI’s lead in consumer usage stems primarily from precisely these areas, with its extremely strong branding and UI/UX improvements in its chat interface versus the competition. OpenAI had a strong, multi-year lead due to its first-mover advantage, providing its “ChatGPT” brand with significant mindshare in the consumer base. In addition, since the release of the original ChatGPT, OpenAI has focused strongly on the web and mobile chat experience. With features like memory and ChatGPT Projects, OpenAI has introduced a high level of personalization for users of the app, thereby creating a high switching cost moat — if you switch to Claude or Gemini, you can’t take ChatGPT’s memories or projects with you. This instantly makes the competing consumer apps less appealing to users in the same way that users of Spotify are reluctant to shift over to Apple Music once they have built up a library of playlists that they enjoy.

Hence, to improve consumer market share, OpenAI will need to continually pull the two levers of brand positioning and UI/UX improvements. Models can’t really improve brand positioning much, as that is more a function of marketing, so the main pressure on the model side of the equation will come from the UI/UX push. This pressure results in making models that are simpler and more enjoyable to use.

Now we’ll shift our view to the enterprise market. Businesses, unlike consumers, strictly focus on return on investment when allocating capital expenditures. These ROI calculations will essentially have four inputs when it comes to LLMs:

The API cost per million tokens
The number of tokens needed to solve a task
The value of that task
The performance of the LLM on that task

We can then model the ROI of using an LLM as follows:

So, from the above, we can see that the only levers that LLM providers can pull to improve the ROI calculations for a company are:

Decrease the cost per million tokens for the API
Decrease the number of tokens needed to solve the task
Increase LLM performance on the task

Given the pressure from open source contributions (e.g., DeepSeek, Kimi, and Qwen), closed source model providers will never be able to compete on #1. #2 also runs counter to the current scaling of AI models — to increase test-time compute (and thus make the LLM useful for more difficult tasks), we by definition have to increase the number of tokens used. Hence, LLM providers competing in the enterprise have started to converge on #3 — improving LLM performance on the given task.

Now, there are two additional levers that an LLM provider can pull to improve the performance of the model on a specific task:

Make the model smarter overall
Customize the model for that task

Broadly, Google has taken the first approach, with Gemini models consistently leading the pack in intelligence (particularly the new Gemini 2.5 Pro Deep Think model). OpenAI would struggle mightily to compete along this dimension because Google has such massive advantages in terms of scale — it has access to ridiculous amounts of compute and has indexed virtually all of the world’s data. Having a lead in algorithms is not a durable moat due to the speed of diffusion of inventions in Silicon Valley, and since model performance is a function of algorithms, data, and compute, Google will maintain a decisive lead here.

Meanwhile, Anthropic has taken the second approach, specializing its models for code using targeted reinforcement learning and building the Claude Code agentic harness. This is the lowest-hanging fruit for specialized models, given that this is the domain in which today’s LLMs perform best. Since Anthropic already has a large lead here, this then leaves OpenAI with two choices: find a less obvious niche for which it can start customizing its models, or compete directly with Anthropic in the coding space, where its competitor already has a large advantage.

From the above analysis, we can see that OpenAI has a large lead in the consumer market with durable moats, and that in order to improve those moats, OpenAI would need to improve the UI/UX of its models by making them simpler and more enjoyable to use. By contrast, to compete in the enterprise market, OpenAI would need to either produce the smartest model (where it is at a disadvantage compared to Google) or start customizing its models for targeted use cases (where it is at a disadvantage compared to Anthropic in the most obvious market of coding agents).

Conclusion - GPT-5 as AI for the Common Man

Now, let’s wrap up this argument.

We have already seen that OpenAI has a large and commanding lead in the consumer market, with a low and shrinking market share in the enterprise market. Now we have also shown that it has solid, defensible moats in the consumer market, and it is at a strong technical disadvantage in the enterprise market. We have also established that prioritizing consumers means improving model UI/UX, while prioritizing enterprise means improving model performance and specialization. Lastly, from the opening paragraphs, we have established that OpenAI deliberately did not make the highest-performing model possible.

Instead, they prioritized reducing hallucinations and streamlining the model selection process in ChatGPT. Both of these changes significantly improve the consumer experience, as confabulations can destroy consumer trust and erode brand advantages, while the old model picker with nearly 10 different models intimidated new users and caused high cognitive load when using the app.

As such, the logical conclusion is that OpenAI has chosen to prioritize consumers over enterprise, and GPT-5 is the result of this.

Hence, over the coming years, don’t expect OpenAI to consistently lead in model performance as they have over the past 3 years. Instead, look for continuing improvements in the usage experience of ChatGPT. If you want to find the best models overall or the best coding models, you’ll probably need to look to Google and Anthropic, respectively.

Gemini 2.5 Pro: How Data + Compute Moats Beat Algorithmic Tweaks

Chris Hayduk — Mon, 14 Apr 2025 21:08:30 GMT

The race towards Artificial General Intelligence (AGI) and state-of-the-art AI models is often framed around breakthrough algorithms and novel architectures. However, a deeper analysis reveals that the true drivers of durable leadership lie elsewhere. While algorithmic innovation is crucial, the path to AI supremacy is increasingly paved with massive datasets and unparalleled computational power. When viewed through this lens, Google DeepMind emerges not just as a competitor, but as the likely frontrunner.

The Trifecta of AI Progress: Algorithms, Compute, and Data

Training large-scale AI models hinges on three interdependent pillars:

Algorithms: These are the recipes, the architectures (like Transformers, Mixture-of-Experts), and the training methodologies (loss functions, optimization techniques) that dictate how effectively models learn patterns and relationships from data. Efficient algorithms extract more "knowledge" per unit of data and compute.
Compute: This represents the raw processing power, typically measured in FLOPs (Floating Point Operations Per Second), required to execute the vast number of calculations involved in training deep neural networks. It's the energy input transforming potential into a trained artifact.
Data: This is the raw material – the text, images, code, audio, video, and other modalities – from which the model learns the structure of the world, language, and reasoning. The quality, quantity, and diversity of data fundamentally shape the model's capabilities.

These factors exhibit strong interplay. An algorithmic leap, like the transition from RNNs/LSTMs to Transformers for sequence modeling, unlocked the potential to effectively utilize vastly larger datasets and compute budgets. Before Transformers, training on web-scale text data with massive parameter counts often hit diminishing returns due to limitations in handling long-range dependencies and parallelization. The Transformer architecture, with its self-attention mechanism, was significantly more scalable, allowing marginal increases in data and compute to translate into tangible performance gains once more. The performance wasn't just better; the scaling properties improved.

The Illusion of Algorithmic Moats

Recent history is replete with examples emphasizing algorithmic prowess. The excitement around models like DeepSeek-R1, achieving remarkable performance with comparatively modest training resources, underscores the power of efficient architectures (like Mixture-of-Experts) and optimized training strategies. It proves that clever algorithms can significantly improve the compute/data-to-performance ratio.

However, as I argued previously in On Algorithmic Moats and the Path to AGI, algorithms alone do not constitute a sustainable competitive advantage in the current AI landscape. Why?

Talent Mobility: The AI research community is fluid. Top researchers frequently move between major labs like Google DeepMind, OpenAI, Anthropic, and Meta, carrying conceptual knowledge and insights about successful (and unsuccessful) architectural experiments and training techniques. While NDAs exist, the fundamental ideas diffuse rapidly.
Open Source and Publication: Key players like Meta (LLaMA series) and innovative teams like DeepSeek often open-source their models and research. Academic institutions and arXiv ensure rapid dissemination of novel techniques. This accelerates the entire field but levels the playing field algorithmically. A breakthrough published today can be replicated and built upon by competitors within months, if not weeks.

Therefore, relying solely on being the first to discover the next architectural tweak is a fragile strategy. Being a fast follower, capable of rapidly implementing and scaling proven algorithmic advances discovered elsewhere, might be just as effective, provided you possess advantages in the other two factors.

The Real Moats: Data and Compute Scale

If algorithms are becoming increasingly commoditized, what provides a durable edge? The answer lies in the factors that are far harder to replicate: data and compute.

Why Scale Matters: The principle of scaling laws in deep learning empirically demonstrates that model performance often improves predictably, following a power law, as model size, dataset size, and training compute increase. While we've seen impressive results from smaller, efficient models, we are likely still far from the point of diminishing returns for many complex reasoning and multimodal tasks. Reaching the next plateau of AI capability will almost certainly require scaling data and compute far beyond current levels.

Why They Are Moats:

Non-Portability: Unlike algorithmic knowledge, engineers cannot easily take petabytes of proprietary, curated internal data or access to tens of thousands of specialized accelerators (like TPUs or GPUs) with them when they change jobs.
High Barrier to Entry: Building world-class compute infrastructure (data centers, custom silicon, high-speed interconnects) and accumulating diverse, high-quality datasets at the scale required represents billions of dollars in capital expenditure and years, often decades, of cumulative effort and investment. This is not something startups or even well-funded competitors can easily replicate overnight.
Synergistic Flywheels: Access to vast compute allows for more ambitious experiments and training larger models. These improved models, when deployed, can generate new, valuable interaction data, which feeds back into further model improvements, creating a virtuous cycle that is difficult for competitors with lesser resources to match.

Gemini 2.5 Pro: A Glimpse of the Advantage

Gemini 2.5 Pro Experimental, recently released by Google, offers a glimpse into how these interacting factors of data and compute will lead to a durable advantage in Google’s AI model performance. Despite OpenAI and DeepSeek releasing highly performant thinking models months in advance of Google (representing a large lead in algorithmic innovations), Gemini 2.5 Pro has managed to score #1 across the board in Chatbot Arena and across a wide range of benchmarks.

While Google describes Gemini 2.5 Pro partly through algorithmic concepts like "thinking models," the sheer breadth and depth of its capabilities, validated by both benchmarks and human preference, strongly suggest that these algorithms are being scaled and refined using computational resources and data diversity that few, if any, competitors can match. The "significantly enhanced base model" (as described by Google) is almost certainly a product of larger parameter counts trained for longer durations on more diverse data, enabled by Google's vertical integration of hardware (TPUs) and software within their hyper-scale data centers.

Google's Unassailable Advantage

This brings us to Google. When assessing data and compute advantages, Google stands in a league of its own.

1. Data Dominance:

Breadth and Modality: Google possesses arguably the most diverse and extensive collection of multimodal data on the planet. Consider the sources:
- Google Search: Billions of daily queries provide unparalleled insight into human intent, language variation, and real-time information needs (text, images, implicit semantics).
- YouTube: The world's largest video platform offers vast amounts of video, audio, transcripts, comments, and multilingual content – crucial for multimodal understanding.
- Android: Interaction data from billions of devices provides insights into user behavior, application usage, and sensor inputs (potentially anonymized and aggregated).
- Google Maps: Geospatial data, satellite imagery, Street View imagery, reviews, and real-time traffic information.
- Gmail, Docs, Workspace: While respecting user privacy is paramount, Google potentially has access (for internal R&D, aggregated/anonymized analysis, or opt-in features) to colossal amounts of text, code, and collaborative data reflecting professional and personal communication patterns.
- Google Books: A massive corpus of digitized text spanning centuries.
- Chrome: Web interaction data (aggregated and anonymized) reflecting how users navigate and consume information online.
Scale and Freshness: The sheer volume is staggering, but equally important is the constant influx of new data, keeping datasets fresh and reflecting current events, language evolution, and emerging trends. This continuous stream is vital for maintaining model relevance and accuracy.

2. Compute Superiority:

Custom Silicon (TPUs): Google made a strategic bet on custom AI accelerators years ago with its Tensor Processing Units (TPUs). Now in their 7th generation, TPUs are designed specifically for large-scale ML training and inference, offering potentially significant advantages in performance-per-watt and performance-per-dollar for Google's specific workloads and scale compared to general-purpose GPUs. This vertical integration allows hardware and software co-design for optimal efficiency.
Infrastructure Mastery: Google operates some of the world's most sophisticated and efficient data centers. Decades of experience in distributed systems (MapReduce, Borg/Kubernetes, Spanner) translate into an unparalleled ability to orchestrate and execute massively parallel training jobs reliably and efficiently across thousands of accelerators. This isn't just about owning chips; it's about the networking fabric, power delivery, cooling, and system software that make large-scale training feasible.
Capital Investment: Google has the financial resources to sustain and expand this infrastructure lead, continuously investing billions in data centers and next-generation TPUs.

Conclusion: The Inevitable Frontrunner?

While the AI race is far from over, and competitors like OpenAI and Anthropic continue to innovate, the fundamental dynamics favor players with entrenched advantages in data and compute. Algorithmic breakthroughs will continue to happen across the ecosystem, but they diffuse quickly. The ability to scale these algorithms using proprietary data and custom-built, hyper-scale infrastructure is the real differentiator.

Google's unparalleled data ecosystem, harvested across its diverse product portfolio, combined with its long-term investment in custom TPUs and mastery of planetary-scale computing, creates a formidable moat. Gemini 2.5 Pro is likely just an early indicator of what this integrated advantage can produce. As the demands for data and compute continue to escalate on the path to more capable AI, Google's lead in these foundational resources positions it strongly to outpace the competition and ultimately define the next era of artificial intelligence.

The Foundation Model Trap

Chris Hayduk — Wed, 05 Mar 2025 20:14:07 GMT

Harvey Sawikin recently wrote a great article analyzing the AI industry through a very Munger-like lens: will AI turn out more like the cereal industry (where there are many competitors with very healthy profit margins) or more like the airline industry (where competition compresses profit margins to near 0).

This idea has major implications for the major AI labs training foundation models today, such as OpenAI, Anthropic, and xAI. In this article I'll attempt to flesh out my understanding of this cereal vs. airline distinction and discuss why the airline scenario is more likely for the foundation model providers.

Before we dive in, you can find Harvey’s article below. I highly recommend giving it a read before continuing here.

Harvey’s Substack

AI Companies: Cereals or Airlines?

In the post The Munger Games, inspired by my first-ever attendance at the Berkshire annual meeting and purchase and reading of Poor Charlie’s Almanack, I promised more commentary on Charlie Munger’s book once I’d reflected on it. One insight that stuck with me has come to the fore lately as I’ve tried to get my head around AI – an effort that isn’t theo…

a year ago · 6 likes · 3 comments · Harvey Sawikin

Okay, so let’s start with differentiating why cereals allow for competition with healthy profit margins, whereas airlines are a rough business for all involved.

Cereals have different flavors, so consumer preferences for a certain flavor can cause some degree of demand inelasticity. From a firm perspective, rather than chasing the flavor and profit margins of another firm's cereal, the more profitable long-term strategy is to specialize in a different flavor and reap your own healthy profit margins.

By contrast, the main service that airlines provide is transporting you from Point A to Point B. There isn't really an "experience" to speak of that differentiates airlines from one another (particularly for non-business class flyers), so the calculus for a consumer then comes down to only two factors: speed and cost. Speed can be achieved through two means: faster planes (which hasn't happened in decades) and more direct flights. Airlines are incentivized to provide direct flights between major cities/transit hubs because, if they did not, then any travelers going between major hubs (say, NYC and London) would choose the other airlines which did have those direct flights. Thus, we can assume that most major airlines will have direct flights between most major transit hubs/cities within a certian distance of each other. Hence, for any two airlines that have a direct flight between a fixed pair of cities, the only way to compete is on price, since this will be the only criterion differentiating the airlines for consumers. Thus, a small difference in price will lead to nearly all consumers choosing the cheaper options. This inherently must drive down profit margins as airlines seek to charge the lowest possible price while still maintaining profitability.

Translating this argument to AI, we see two potential paths forward:

Cereal Mode: We know that the data input to a model during its training process basically determines its behavior on the other end - what it acts like, what tasks it's good at, etc. Access to different types of data may thus give rise to different "flavors" of AI models, providing varying skill profiles and personalities. In this scenario, we could imagine that OpenAI provides the best chat experience (due to its large dataset of user chats), while Grok might provide the best news aggregation and summarization (due to its up-to-the-second Twitter data). This may provide enough distinction to allow each AI company to charge healthy profits margins on their respective foundation models.
Airline Mode: In this case, maybe the data on the margins provided by chat interactions, Twitter, etc. doesn't move the needle much in terms of model behavior and capabilities. Perhaps the web-scale pretraining data drowns out the idosyncracies across each AI lab's datasets, leaving each lab's state-of-the-art AI models performing roughly identically. In this case, the only way to compete would be on the API pricing, with consumers rapidly moving to the cheapest option available that can perform the given task.

Based on trends of the last few months, I think Airline Mode is looking more and more likely. The Chatbot Arena leaderboard shows that all of the leading models from the main labs perform roughly similarly to each other (Grok 3 and GPT-4.5 are even currently within 1 Elo point of each other as of this writing!). And DeepSeek was able to reproduce OpenAI o1 in the span of a couple months (R1's Elo is actually 11 points higher than o1's). We're seeing more convergence between models over the last couple of years, not less.

Given that, unless a lab gets to AGI and the idea of recursive self-improvement leading to a permanent advantage turns out to be true, I don't see how foundation model training can provide durable, healthy profit margins without a significant change in business model for these companies.

Understanding DeepSeek Part II: DeepSeek-V2

Chris Hayduk — Wed, 05 Mar 2025 13:31:45 GMT

Summary

DeepSeek-V2, released in June 2024, built off the success of DeepSeek's previous papers to set a new standard for training and inference efficiency. The core changes made to DeepSeek-V2 that set it apart from prior open source models occur in two core components of the transformer architecture: the attention block and the feed-forward network (see below image).

The two key changes can be summarized as follows:

1. Feed-Forward Network Optimization: DeepSeekMoE architecture

Mixture of experts (MoE) layers are a drop-in replacement for the feed-forward layer in the standard transformer architecture. Prior to DeepSeekMoE, most MoE architectures functioned by splitting the feed-forward layer into several large feed-forward layers. Each input token would then "choose" 1 or 2 of these parallel feed-forward layers, also known as "experts", for its own computation. This architecture had one key problem - namely, each expert needed to learn large amounts of redundant information, since processing any token on any topic requires understanding of grammar, semantics, etc. DeepSeek solved this redundancy problem, thereby greatly increased the learning efficiency of the MoE architecture, through three key innovations. These included: more numerous, finer-grained experts; separating experts into shared and routing experts; and load balancing tokens across experts and devices. For more details on these innovations, see the previous blog post in the series.

2. Attention Layer Optimization: Multi-head Latent Attention (MLA)

Multi-head attention, described in detail in my other post, utilizes three matrices to produce new representations of the input tokens: the Query, Key, and Value matrices. Each of these matrices has dimension n x d, where n is the maximum length of the sequence and d is the dimension of the vector representing each token in the sequence. Standard transformers cache the Key and Value matrices for every layer fully in-memory at inference time, improving speed but resulting in large memory overhead. DeepSeek-V2's solution is to compress the Key and Value matrices at each layer into a single latent vector. At inference time, only this vector needs to be cached, substantially reducing memory requirements.

Since we already described the DeepSeekMoE architecture in detail in the previous blog post of this series, this post will focus primarily on multi-head latent attention. We'll start by describing the problem it aims to solve, then move on to describing the intuition behind MLA's solution, and finally dive into the concrete math describing the method. We'll then end this post by discussing the effects that the combination of DeepSeekMoE and multi-head latent attention has on training and inference efficiency. Let's dive in!

Note: This post is part of the “Understanding DeepSeek” Series:

Understanding DeepSeek Part I: DeepSeekMoE
[This article] Understanding DeepSeek Part II: DeepSeek-V2
[Upcoming] Understanding DeepSeek Part III: DeepSeekMath
[Upcoming] Understanding DeepSeek Part IV: DeepSeek-Prover-V1.5
[Upcoming] Understanding DeepSeek Part V: DeepSeek-V3
[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1
[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World

The Memory Efficiency Problem with Standard Multi-Head Attention

Standard multi-head attention, at its core, solves the problem of deciding how to update our understanding of one concept, given a set of other, potentially-related concepts. In the case of language modeling, we want to update our understanding of a particular token using the understanding of the other tokens present in the sequence. To accomplish this, at each attention layer in a transformer, the model learns to parametrize three key matrices: the Query, Key, and Value matrices. These three matrices work together to identify the most relevant portions of the sequence for each token, and then to update each token's representation based on the relevant portions that were found. I won't cover the full details of how this is done here, but you can reference my other blog post for more information.

Now, when language models are producing output at inference, we essentially need to place the transformer in a while loop. Until the transformer outputs an “End of Sequence” token, we’ll feed the input sequence into the transformer to produce the next token. Then, appending that next token to the input sequence, we’ll feed the newly elongated input sequence back into the transformer, repeating the process.

The key insight that enables caching here is the following: since modern LLMs are causal, meaning future tokens cannot influence previous tokens, by adding a new token to the end of the input sequence, we do not change the representation of any of the previous tokens. Hence, we do not need to recompute the hidden representations for the previous tokens, since these will be identical.

The only token for which we need to compute a new representation is the next token in the sequence (that is, the one token that doesn’t exist yet)! Another core insight coming from this observation is that we only need the key and value vectors for each previous token to compute the new token’s representation. Since the previous tokens’ representations do not change, we don’t need to use the other tokens as "queries” to update their representations. However, we do need their key and value vectors so that we can “query” these vectors with the new token's query vector.

The above observations then give us a road map for caching values in the transformer in order to limit the number of computations we perform and speed up inference time. In particular, we must cache the Key and Value matrices at each hidden layer so that we can use these to compute the hidden representation for the new token.

Now let’s compute the memory requirements to store these cached values for Llama 3.3 70B, a state-of-the-art open source model at the time of writing. (In practice, Llama 3.3 uses Grouped-Query Attention, which actually reduces caching requirements. For the sake of simplicity, we'll assume it uses standard attention here.)

Llama 3.3 has 80 attention layers. Each key and value vector in these attention layers has a dimension of 8192. And Llama 3.3 has a maximum context length of 128,000 tokens.

If Llama 3.3 is used in the default floating point 16 (FP16) mode, then each stored number will take up 2 bytes (16 bits). Hence, a single vector consisting of 8192 floating-point numbers will take up 16,384 bytes, or equivalently 16.384 kilobytes. For each cached token in our input, we need to store both a key vector *and* a value vector at each layer. Hence, at every layer, a cached token will require two vectors, totaling 32.768 KB in memory. Since there are 80 such layers, the cost to cache one token is thus 80 * 32.768 KB = 2621.44 KB (equivalently, 2.62 MB).

Now, suppose our input is 10,000 tokens long and we are producing the next token in the sequence. To cache the necessary data for the previous tokens, we need 10,000 * 2.62 MB = 26,200 MB (equivalently, 26.2 GB).

If our input uses the full Llama 3.3 context length of 128,000 tokens, the required space is 128,000 * 2.62 MB = 335,360 MB (equivalently, 335.36 GB).

As can be seen by the above example, memory requirements for the cache expand quickly as the input length increases. This makes it incredibly difficult to serve models with long context windows. In order to solve this problem with the standard transformer architecture, DeepSeek introduced Multi-head Latent Attention (MLA).

Multi-head Latent Attention (MLA)

In order to overcome these memory efficiency issues, DeepSeek created the Multi-head Latent Attention layer. This layer modifies standard multi-head attention (depicted on the left side of the above image) by compressing the key and value matrices into a single vector. In practice, this looks like the following:

That is, our model now must learn three additional matrices per layer - one down-projection matrix and two up-projection matrices. By learning these three matrices, we now no longer need to store the entire Key and Value matrices when caching previously-computed tokens. Instead, we can store the compressed latent vectors for each layer, where the compressed latent vector at a layer contains all of the information needed to produce the full Key & Value matrices.

Thus, if we have L layers, we only need to store d_c * L values (d_c numbers in each latent vector and L latent vectors total, one per layer).

Let's take the example of Llama 3.3 that we illustrated above to see how much this gains us - previously, caching the full Key and Value matrices for the full 128,000 token context length of Llama 3.3 required 335.36 GB. Now, instead of caching the full matrices, let's imagine we've augmented Llama 3.3 to use MLA. DeepSeek sets the dimension of the latent vector to four times the hidden dimension, so we will use 32,768 as the dimension for our latent vector here. Hence, each vector takes up 0.06554 MB. Then, to cache one latent vector at each of Llama 3.3's 80 layers corresponds to using 80 * 0.06554 MB = 5.243 GB.

This is a substantial reduction from the initial requirement of 335.36 GB for standard attention, demonstrating the efficiency gains that can be driven using this approach.

Training and Inference Efficiency

DeepSeek-V2 introduces significant efficiency improvements in both training and inference compared to its predecessor, DeepSeek 67B, primarily through innovations in its architecture—especially the Multi-head Latent Attention (MLA). By compressing the Key and Value matrices into a single latent vector, MLA dramatically reduces memory consumption during inference. The reduction of the KV cache by approximately 93.3% translates directly into substantial gains in maximum generation throughput, allowing DeepSeek-V2 to achieve throughput levels up to 5.76 times greater than those observed in DeepSeek 67B. These optimizations enable DeepSeek-V2 to handle much longer contexts (up to 128K tokens) efficiently, positioning it as one of the most practical choices among large-scale language models for real-world applications where large-context inference is critical.

Additionally, the integration of DeepSeekMoE into the Feed-Forward Network layers synergizes well with MLA, enabling significant computational savings without sacrificing model performance. By activating only a fraction (21B) of its total parameters (236B), DeepSeek-V2 demonstrates economical training by saving 42.5% of training costs compared to dense models of similar scale. Thus, MLA plays a critical role not only in inference-time efficiency but also in making the pretraining phase more cost-effective.

Results and Key Takeaways

The innovative Multi-head Latent Attention layer significantly enhances the practical deployability of DeepSeek-V2. Compared to traditional Multi-Head Attention, MLA achieves superior inference performance while simultaneously overcoming the KV cache bottleneck. With its novel low-rank joint compression strategy, MLA significantly reduces inference memory overhead, making DeepSeek-V2 particularly suited for high-throughput, real-time applications requiring extensive context management.

Empirical evaluations on various benchmarks illustrate the clear strengths of DeepSeek-V2, even when compared against other leading open-source models of the time. Notably, DeepSeek-V2 consistently achieved top-tier performance on benchmarks such as MMLU, math reasoning tasks, and coding challenges, highlighting the architectural advantages introduced by MLA. Moreover, these enhancements enabled DeepSeek-V2 to be trained and served at a fraction of the cost of comparably performing dense models (see the above image).

All in all, Multi-head Latent Attention represented another significant milestone for DeepSeek on the path towards highly optimized training and inference that marked their revolution with DeepSeek-R1 and DeepSeek-V3. The next blog post in this series will dive into the new innovations introduced for DeepSeek-V3, building upon the foundations laid here and forming the base model used to train DeepSeek's state-of-the-art reasoning model.

A Primer on Multi-Head Causal Self-Attention

Chris Hayduk — Sat, 01 Feb 2025 00:32:37 GMT

Lately, I've been writing quite a few series that center around the transformer architecture. For many of those blog posts, I struggle to decide whether I should include the background information necessary to understand attention (greatly increasing the length of the blog post) or I should assume the reader already knows this information (limiting the reach of my audience). Thus, this post is intended to be a compromise between the two positions, allowing me to link this post as background reading in any future blog post that requires knowledge of the nuts and bolts of the attention architecture.

This will be a "living" blog post, in that it will be edited and expanded upon as my own understanding of the architecture grows and deepens. If there are any radically large changes that I make, I will re-email the post out to subscribers for their review. Otherwise, feel free to check back periodically to see how the article has changed!

The Basic Terminology of Multi-Head Causal Self-Attention

The standard attention block used in first-generation LLMs like GPT-2 and GPT-3 is multi-head causal self-attention.

The goal of this variant of attention, like any attention variant, is to learn how to update a vector using other context vectors in order to accomplish some goal. In the case of language modeling, our vectors represent tokens, which you can think of as roughly analogous to words. The goal of these vector updates is to accurately predict the next word in the sentence. It is called causal because this type of attention ensures that each word can only update itself using previous words in the sentence - that is, it can't look ahead and update itself using words that haven't been written yet! It is called self-attention because the things that each word is paying attention to are the other words in the sentence. There is no outside data or context involved here. And finally, it is termed multi-head because, at each attention layer, we have multiple attention operations occurring in parallel. These parallel attention operators are referred to as "heads".

To produce the results of attention, each attention head takes as input a sequence of tokens represented as vectors. These vectors are passed through three feed-forward networks per head in parallel, projecting each token's vector into three new vectors. These new vectors are commonly referred to as the query, key, and value vectors. These query, key, and value vectors are then used to update the vector representations of the words in our sentence, improving the model's understanding of the concepts contained in the sentence.

Let's take a look at how this is done in practice.

The Mathematics of Causal Self-Attention

As mentioned, the attention block takes as input a sequence of tokens represented as vectors. Suppose the input sequence is given by:

where each x_i is the vector representation (embedding) of a token.

Each input vector x_i is simultaneously projected into three different spaces using learned linear transformations. That is, for every token x_i, we compute:

where:

W^Q, W^K, and W^V are the weight matrices for the query, key, and value projections, respectively.
The sets of all query, key, and value vectors are often denoted as Q, K, and V.

For a given token x_i, we will compute a similarity score with every token x_j such that x_j comes before it in the sentence (or is the token itself). This is done by taking the dot product of the query vector for x_i with the key vector for k_j. This result is then divided by the square root of the key vector's dimension. Mathematically, this is given by:

This value is precisely the unnormalized measure of how much token i should attend to token j. In linear algebra, the dot product of two vectors is just a scaled version of the cosine of the angle between them. An angle of 0 degrees gives a cosine value of 1, while an angle of 180 degrees gives a cosine value of -1. Hence, the closer the two vectors get to pointing in the same direction, the closer their dot product gets to 1, and the closer the two vectors get to pointing in opposite directions, the closer their dot product gets to -1. Intuitively, then, cosine (and by extension the dot product) has very desirable properties to use as a similarity function in the attention mechanism.

Armed with these similarity scores, we now have a way of measuring how "similar" two tokens in our sequence are. However, in order to use them to produce new vector embeddings, we're going to want to rescale them. The dot product between the query and key vectors could end up being quite large, and using this value directly can cause large changes in the scale of the vector representation for a given token. Moreover, knowing the score of a particular token pair (let's say between tokens x_i and x_j) tells us nothing about how important that pair is - importance is always relative, and what if the score for x_i and x_k is bigger?

Given the above discussion, we know we need to introduce some function that will re-scale our scores in such a way that we do not radically change the magnitude of the token's vector representation and that we can quickly determine how "important" each token pair is. A convenient differentiable function that does just this is softmax. The equation to produce the softmax output for token x_i is given below:

The softmax function will take our scores for all possible pairs made with x_i (i.e. all key vectors we multiplied by x_i's query vector) and squash them into the range of 0 to 1. Moreover, it will ensure that these values sum to 1. Hence, we can view these outputs, referred to as attention weights, as probabilities or percentages. It can be useful to think of the attention weight for the pair x_i and x_j as the percent of x_i's attention that should be paid to x_j.

Once we have these attention weights, we can use them to produce a new vector representation for token x_i. We do this by taking a weighted sum of the value vectors for each token, where the weight is the attention weight. As mentioned above, we can think of this attention weight as the percent of attention that x_i pays to each vector that precedes it in the sequence. Mathematically, this looks like:

This weighted sum integrates information from the tokens that x_i “attends” to, based on the learned attention weights.

In multi-head attention, these operations happen in parallel multiple times over. That is, we will produce multiple instances of the query, key, and value vectors for each token in the sequence. We will then use those unique instances to produce distinct updated vector representations for each token. If we have k heads, then we will produce k updated vectors for token i:

These vectors get concatenated together, forming a single vector to represent the updated token i:

The intuition behind using multiple heads to create the final updated representation for token i is that each head can learn to capture different aspects of language. One might learn grammatical structure, while another might learn vocabulary related to the legal profession. By splitting responsibilities between the attention heads, each can learn unique, non-redundant information.

This vector will then be passed through a linear projection layer, producing the final output of the multi-head attention layer.

Let's walk through a toy example now to make things concrete.

A Toy Example: "the dog barks"

Suppose our sentence is "the dog barks", and our tokenizer splits it into three tokens: "the", "dog", and "barks". Initially, these tokens are embedded into vectors:

When entering the first attention block, each of these vectors is projected into three new vectors:

Thus, we transform 3 input vectors into 9 new vectors (3 each for queries, keys, and values).

Updating the "dog" Token

Let’s focus on updating the token "dog". In our example, "dog" corresponds to the second token x_2. To update its representation, we use its query vector and compute dot-product scores with the key vectors of "the" and "dog" (i.e., the tokens that precede or are the token itself):

where d_k is the dimensionality of the key vectors. The division by the square root of d_k is used to normalize the scores.

These scores are then passed through a softmax function to obtain attention weights (or probabilities) that indicate how much attention "dog" should pay to itself and to "the":

Softmax ensures that these attention weights are between 0 and 1, and that they all sum to 1. Hence, they are valid probabilities and, to make things easier, you can of what percent of its attention the word "dog" should pay to the word "the" or to itself.

With the attention probabilities computed, we update the original vector representation of "dog" by taking a weighted sum of the corresponding value vectors. In this case, we combine the value vector of "the" and the value vector of "dog":

This new vector is an updated representation that incorporates contextual information from the preceding token "the" as well as from "dog" itself.

Recap

To recap, these are the major steps for updating the "dog" vector in our example using causal self-attention:

1. Input Embedding:

Each token is embedded into a vector x_i.

2. Linear Projections:

Each x_i is projected into query, key, and value vectors:

3. Score Calculation (for causal attention):

For token "dog" (second token), calculate:

4. Softmax to Obtain Weights:

Convert scores to probabilities:

5. Contextual Update:

Update "dog" by a weighted sum of the value vectors:

In the full multi-head attention mechanism, this process is performed in parallel over multiple "heads" (with different learned projections), and the results are concatenated and transformed further to form the final output of the attention block.

Key Takeaways

Let's now summarize the key points of what we've learned:

Attention is a neural network mechanism used to update vectors using the context from other vectors
Input vectors to an attention layer are replaced by 3 intermediate vectors: the query, key, and value vectors
The query and key vectors work together to produce similarity scores between pairs of vectors. If we are updating token i and want to know how much token j should influence our input, we multiply the query vector of token i by the key vector of token j.
The scores produced by the query and key vectors can be turned into probabilities using the softmax function. These probabilities are used to measure how much token i should consider the tokens that came before it in the sequence when updating its vector representation
The new vector representation for token i is produced by multiplying the softmax probabilities by the value vectors for each corresponding token. These are then summed together
The above process occurs independently across several parallel attention operations, called heads. At the end of the attention block, the new vector representations for token i coming from each head are concatenated together and passed through a linear projection layer.
The above process (steps 1-6) is performed in parallel for the full sequence of input tokens.

If you keep these 7 key points in mind while reading Arxiv papers (or my future blog posts!), you'll have a strong understanding of what multi-head causal self-attention is doing, where it faces limitations, and whether or not a given architectural change actually addresses those limitations.

Understanding DeepSeek Part I: DeepSeekMoE

Chris Hayduk — Thu, 30 Jan 2025 03:28:29 GMT

Series Introduction

Recently, the announcement of DeepSeek-R1 shook the AI world, as an open source project managed to match the performance of OpenAI's state-of-the-art API, o1, within months of its release. The market reacted vehemently to this news, with Nvidia's stock dropping 18% in a single day. AI researchers, engineers, and commentators alike took to Twitter/X to share their thoughts on DeepSeek-R1's implications for the AI industry and the United States, with many asserting that the age of American AI had come and gone in a flash, with China now firmly taking the lead.

But were these takes correct?

In order to dissect the true implications for the world going forward, we first need to understand DeepSeek-R1 on a fundamental level - what it is, what it does, how it works, and what key innovations it introduced. This blog post series will aim to arm you with that knowledge.

To do this effectively, we are going to start at the beginning of DeepSeek's major papers and work our way forward in time, tracing out the researchers' reasoning and how they arrived at the final design for DeepSeek-R1. This final design included two key components:

An efficient mixture of experts language model base
Reinforcement learning-tuned chain of thought capabilities

In this blog series, we will explore two separate but related series of papers in order to deeply understand the two key components of DeepSeek-R1. First, we will trace the evolution of the mixture of experts architecture from DeepSeek-MOE to DeepSeek-V3, their newest state-of-the-art language model. We will then turn our attention to reinforcement learning-tuned chain of thought, beginning with the seminal DeepSeekMath paper and working our way forward to the current AI darling - DeepSeek-R1.

With this strong foundational knowledge of the theoretical underpinnings of DeepSeek-R1, we will be able to separate the hype from the noise. In light of what we've learned from these paper deep dives, this blog series will conclude with an analysis of the implications of DeepSeek-R1 from several perspectives:

Technological progress
AI market dynamics
Geopolitical risks

By the end of this series, you will have a clear, evidence-based understanding of DeepSeek-R1—what makes it powerful, where it stands relative to its competitors, and what its long-term impact might be. As the AI landscape continues to shift at an unprecedented pace, cutting through speculation and focusing on the fundamentals will be key to making sense of the road ahead. Let’s dive in.

Note: This post is part of the “Understanding DeepSeek” Series:

[This article] Understanding DeepSeek Part I: DeepSeekMoE
Understanding DeepSeek Part II: DeepSeek-V2
[Upcoming] Understanding DeepSeek Part III: DeepSeekMath
[Upcoming] Understanding DeepSeek Part IV: DeepSeek-Prover-V1.5
[Upcoming] Understanding DeepSeek Part V: DeepSeek-V3
[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1
[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World

Paper Summary

Mixture-of-experts (MoE) models are an extension of the standard transformer architecture in which a collection of expert modules (typically feed-forward networks) each learn to specialize in different aspects of the data. For a given token or input, only a subset of these specialized experts is activated, allowing the model to dynamically focus its computation on the most relevant components. This selective activation enables MoE models to achieve a high effective capacity—since many different specialists are available—while maintaining computational efficiency, because only a limited number of experts actually process each input. As a result, MoE approaches excel at capturing diverse patterns, efficiently scaling model size, and flexibly adapting to a wide variety of tasks.

Standard mixture-of-experts models, used prior to DeepSeekMoE, typically rely on selecting the top K experts (often 1 or 2) out of N possible experts for each token in a sequence. While this approach does reduce computational load—since only a small fraction of experts are activated—it also forces those few activated experts to capture all aspects of the token, including common linguistic structure that is often duplicated across experts. Consequently, an enormous portion of each expert’s capacity is spent memorizing redundant information, leaving less room for true specialization.

DeepSeekMoE improves upon the standard MoE architecture, solving this redundancy problem by:

1. Using a larger number of smaller experts (Fine-Grained Expert Segmentation)

Instead of a few large experts, DeepSeek splits capacity into many more experts, each of which is smaller in dimensionality. The model then increases the number of selected experts by the same factor, creating a dramatically larger space of potential expert combinations. Despite this combinatorial explosion, the overall parameter count and per-token activated parameters remain exactly the same as in a conventional MoE setup—meaning we gain richer representational capacity without paying extra in total parameter count or computational cost.

2. Separating Experts into Shared and Routing Experts

DeepSeek also partitions its experts into two sets. The shared experts, which are always activated for every token, learn the broad “common knowledge” required by all inputs (e.g., syntax, high-level semantics). The routing experts, by contrast, are only activated if they are relevant to a specific token, allowing them to focus on niche or domain-specific information. This further decreases redundancy and promotes parameter efficiency: shared experts handle language “fundamentals,” while routing experts handle specialization.

3. Load Balancing Through Additional Loss Terms

Finally, DeepSeek addresses load balancing in two senses. It enforces a roughly equal usage of each active routing expert across tokens—ensuring no single expert is under- or over-utilized—and distributes the experts themselves across multiple GPUs to avoid hardware bottlenecks. Both of these aims are achieved by incorporating new balancing terms into the training objective.

Taken together, these modifications produce a model that is both parameter-efficient and highly flexible. By boosting expert variety, removing needless duplication, and balancing the workload across experts and devices, DeepSeekMoE provides a substantially more effective way to leverage MoE architectures—achieving greater specialization and capacity without increasing the overall parameter footprint.

Let's dive in deeper into these three optimizations now and see how they alter the standard MoE transformer architecture.

Standard Mixture of Experts Models

In standard MoE architecture, expert layers will typically replace the feed-forward layer that occurs after self-attention. Experts can be thought of as a set of N feed-forward layers that are structurally identical to the original feed-forward layer. Only a subset of these N possible feed-forward networks will be activated for any individual token, with many prior MoE architectures selecting 1 or 2 of these N possible networks for a given token.

Whether or not a network is activated is determined by taking the dot product of the output of the attention layer for that token (i.e., the hidden vector for token i) with the centroid of the current expert. We then take the softmax of this value to force it into the range of 0 to 1. You can think of this like an attention score computed over the experts instead of the tokens - we want to see which expert aligns most closely with the current token under consideration. These scores are computed for each expert, and then the experts are ranked according to this score. The top K (usually 1 or 2) experts are selected based on this ranking, and the token embeddings are then passed to those feed-forward expert networks.

The output of these experts is added together alongside the initial hidden state for the token (i.e., the token vector prior to the application of the experts). This produces the final output for the given layer.

The major obstacle with this approach is the following: since most prior MoE models only selected the top 1 or 2 experts for each token, the selected expert(s) must capture everything about a given token, including redundant information such as language structure. This wastes a large amount of the model's capacity to learn useful information, forcing the weights of each expert to memorize redundant information that is already captured by the other experts.

Fine-Grained Expert Segmentation

One of DeepSeek's solutions to the redundancy problem is to make experts smaller but more numerous. That is, the DeepSeekMoE approach reduces the dimensionality of each individual expert's feed-forward network (and therefore its computational cost and representational capacity) by a factor of 1/m compared to the network's standard feed-forward networks. Correspondingly, it increases the number of total experts by a factor of m and the number of selected experts by the same factor of m. This results in the same number of parameters for the model on net, but allows for substantially more variety when selecting the experts to use for a specific token.

We can see this increased variety when examining the combinatorics of the expert space. Suppose our standard feed-forward network has hidden dimension 4096, and our standard mixture of experts model uses 8 of these experts in total, with 2 selected for any given token. This results in the following number of possible expert combinations for each token in the standard mixture of experts model:

Now, using the DeepSeekMoE architecture, suppose we have m = 8. That is, we are going to increase our number of experts by a factor of 8 (and reduce the hidden dimension by a factor of 1/8). This gives us a hidden dimension of 512 per expert, with 64 total experts and 16 experts selected for any given token. This results in the following number of possible expert combinations for each token in the DeepSeekMoE version of the model:

That is, we go from 28 possible expert combinations to nearly 489 trillion possible expert combinations! This allows for significantly more specialization across experts and much more variety in knowledge application on a token-by-token basis. Astonishingly, even with this huge increase in variety, the number of tokens stays exactly the same! The number of total parameters in each model is given by:

Similarly, the number of parameters activated for any given token is exactly the same:

Hence, we get basically a free lunch here - significantly higher representational capacity in our model with the same number of parameters used!

Shared Experts

Another approach DeepSeek took to avoid capturing redundancy in its experts is to segment the expert population into two groups: shared experts and routing experts.

Shared experts are always activated, regardless of the input token. This incentivizes these expert modules to capture common knowledge relevant to all queries (e.g., language semantics). By contrast, routing experts are only activated if the token is relevant to the expert, as described in the "Standard Mixture of Expert Models" section.

That is, the initial mN experts are split into two groups: K_s shared experts and K_r = mN - K_s routing experts. All of the K_s shared experts are activated for all tokens, while a subset of the K_r are selected for each token. Mathematically, this looks like the following:

Hence, we can see that the hidden vector output of token t at layer L always uses all of the shared experts (denoted by the first summation in the equation) and always includes the residual (denoted by the last term). The middle term, representing the routing experts, includes a gating factor that controls which experts are turned on for any specific token. In particular, the gating factor is the output of a softmax if the expert ranked in the top mK experts. Otherwise, it is 0. As a result, not only do we eliminate most of the possible experts (thereby greatly reducing the number of active parameters), we also weight the final output based on how close each chosen routing expert is to the token. In other words, the more a chosen routing expert "knows" about a topic, the more heavily we weight its opinion.

This setup allows the routing experts to ignore the redundant information captured by the shared experts and instead focus on learning concepts and information that are relevant to their area of specialization. This promotes parameter efficiency in the model, as each marginal parameter added to the routing experts will be encouraged through the learning process to acquire information that is distinct from the existing parameters.

Load Balancing

Now that we have a better-designed MoE network with fine-grained experts and expert sharing, there still remains one major challenge to ensure the parameters are used maximally - we need to load balance requests across the available experts. Essentially, our goal is to force each token to attend to the outputs of the mK chosen routing experts roughly equally. This makes certain that, when we activate routing expert parameters to process a particular token, all of the activated parameters are contributing meaningfully to the output. As a result, we maximize the utilization of the MoE architecture.

In addition to load balancing across experts, we would like to load balance across devices. Experts are typically stored on many separate GPUs, since these models are too large to fit in the memory of a single GPU. Given this fact, we would like the chosen experts for a token to be evenly spread across devices, thus preventing overloading of any single GPU.

These two goals are achieved by DeepSeekMoE through introducing two new terms to the loss function.

Results and Key Takeaways

With the above optimizations, DeepSeek was able to mitigate many of the most challenging problems facing MoE models. Together, fine-grained segmentation, shared experts, and load balancing work to maximize the amount of unique, useful information stored in a given set of parameters. As a result, DeepSeekMoE is able to outperform models with fewer active parameters. Below, we can see that DeepSeekMoE outperformed LLaMA2 7B (a dense model that does not use any experts) across a number of benchmarks with fewer than half of the active parameters.

When compared to another mixture of experts model, GShard, we see that DeepSeekMoE again outperforms it with the same total parameters and only half of the activated parameters.

In sum, DeepSeek's optimizations for the MoE architecture served to substantially expand the possibilities for local and edge inference. Since only a small percentage of the model's total parameters are active for any given token, during inference, the model's performance requirements are much closer to those of a small, weak model. However, its output quality matches that of a large, well-trained dense LLM. This innovation was critical for laying the groundwork towards DeepSeek-R1, ensuring that state-of-the-art base LLM performance would be possible for smaller models.

Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

Chris Hayduk — Wed, 22 Jan 2025 15:53:30 GMT

Note: This post is part of the “Understanding Protein Language Models” Series:

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
[This article] Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

Overview of the Main Ideas

AlphaFold2’s MSA: AlphaFold2 identifies evolutionarily related proteins to the target sequence and builds a multiple sequence alignment (MSA). In the Evoformer block, row-wise (within-sequence) and column-wise (across-sequences) attention on this MSA yields information about co-evolving residues. This MSA-based representation is then integrated into a pair representation matrix, ultimately helping AlphaFold2 predict the 3D structure.
ESMFold’s Language Model Encoding: In ESMFold, the MSA step is replaced by a large protein language model (ESM-2) trained via a Masked Language Modeling (MLM) objective. As in standard large language models for text, the hidden layers of the encoder learn semantic and syntactic regularities—in this case, biochemical and structural patterns. The result is that ESMFold can leverage these learned encodings to identify motifs and co-evolving positions without explicitly performing genetic database searches or building large MSAs.
Conceptual Motif Lookup: We can interpret ESM-2’s embeddings as performing a “continuous fuzzy lookup” within an implicit database of protein motifs. Because the language model was pretrained on massive amounts of protein data, it has effectively learned how residues co-occur—and thus co-evolve—within protein families. This internal representation replaces the explicit MSA step.

Below, we will dive into how this replacement works in more detail, starting with a short recap of AlphaFold2’s MSA-based pipeline and then exploring how ESMFold (and ESM-2 as its core) sidesteps explicit alignment by using learned representations.

1. Revisiting AlphaFold2’s MSA-Based Approach

1.1 Gathering Evolutionary Information

AlphaFold2 conducts genetic searches against databases such as MGnify, UniRef90, Uniclust30, and BFD to identify sequences that share evolutionary relationships with the target sequence. From these hits, it constructs an MSA:

where L is the length of the target sequence, and S is the number of evolutionarily related sequences found. Here, s_{k,i} denotes the i-th residue of the k-th sequence in the alignment. By hypothesizing that residues co-evolve, the MSA is an external source of statistical correlations about which residues likely pair or contact each other in 3D space.

1.2 Evoformer and Pair Representation

In the AlphaFold2 pipeline:

MSA Representation M: A 3D tensor M,
where c is the dimensionality of each residue embedding.
Pair Representation P: A 2D grid P,
where each P_{i,j} is a learned embedding representing the pairwise relationship between residue i and residue j in the target sequence.

Inside the Evoformer block, row-wise and column-wise attention update the MSA representation:

Row-wise (Within-sequence) Attention
where α_{i,m} are attention weights.
Column-wise (Across-sequence) Attention
where β_{k,n} are attention weights.

After these attention layers (plus MSA transitions via 2-layer MLP), AlphaFold2 computes an Outer Product Mean that integrates MSA embeddings into the pair representation:

where u_{k,i} is the final MSA embedding vector for residue i in sequence k. This OPM_{i,j} is then added (or concatenated and projected) into P_{i,j}, effectively injecting co-evolutionary signals gleaned from the MSA into the residue-pair representation.

2. ESMFold: Replacing MSA with Language Modeling

2.1 The Core Mechanism: Encoder-Only Transformer

ESMFold (and its backbone ESM-2 model) is built around a large encoder-only transformer. It is trained with the masked language modeling objective, meaning it tries to reconstruct masked or hidden residues from context. This training strategy, originally popularized by BERT in natural language processing, has an important effect: it forces the model to encode in its weights the relevant “contexts” that predict each amino acid.

Mathematically, if x=(x_1,x_2,…,x_L) is the protein sequence and x_k is replaced by a special [MASK] token with some probability, the MLM training objective is

where p_θ is parameterized by the encoder transformer. Over billions of observed residues, the model internalizes the patterns of co-occurrence across diverse protein sequences.

2.2 Implicit Motif Lookup

Where AlphaFold2 uses explicit lookups in an MSA database (plus explicit attention across sequences), ESM-2’s learned embeddings do something analogous “in one shot.” After pretraining, the internal representation of each residue h_i (the hidden state at position i) captures average contexts encountered during training. In effect, for any position i,

h_i has high similarity to h_j if residues x_i and x_j frequently appear in similar sequence contexts in the training set.
By extension, if an entire sequence x has patterns analogous to known motifs (e.g., an ATP-binding site pattern, a signal peptide motif, or secondary-structure fragments), then the embeddings reflect these patterns—allowing ESMFold to “retrieve” them without an explicit MSA.

You can view this as a “continuous fuzzy matching” process, wherein the [KEY], [QUERY], and [VALUE] matrices of the transformer contain compressed representations of how residues co-occur. Rather than computing the dynamic-programming-based edit distances (or alignment) across a large external database, the model’s attention modules effectively do an alignment on-the-fly in a continuous, high-dimensional space.

2.3 Integration into Folding

ESMFold then appends a structure-prediction head on top of these ESM-2 embeddings, akin to how AlphaFold2 appends its structure module after the Evoformer. Even though ESMFold no longer has an explicit pair representation from an MSA, it still must estimate which residues interact or contact each other. In current ESMFold architectures:

The final hidden states from the ESM-2 encoder are projected into a lower-dimensional representation that acts like a “pair embedding” for each (i,j).
A geometry module or a series of feed-forward layers further refines these embeddings to produce coordinates or distance/contact maps.

In practice, ESMFold’s results are often on par with AlphaFold2 for many proteins, especially those with strong evolutionary constraints. For proteins with scant evolutionary data, ESMFold can sometimes do better than AlphaFold2, because it does not rely so heavily on a large MSA. On the flip side, certain proteins with well-studied deep MSAs can benefit from the explicit signals that AlphaFold2’s large MSA provides.

3. Mathematical Rationales for Replacing MSA

3.1 Complexity and Speed

One major advantage of dropping MSAs is computational efficiency. MSA searches can be prohibitively expensive for large proteins or large sets of queries, requiring queries against massive databases (MGnify, UniRef, etc.) and heuristics to align thousands of sequences. In ESMFold:

No MSA Search: The model simply takes the query sequence and feeds it through the encoder in a single forward pass.
Linear vs. Quadratic Complexity: A single Transformer forward pass for a sequence of length L has complexity O(L^2 d) where d is the dimension of embeddings, whereas building an MSA might involve searching and aligning thousands of sequences, each of length up to L.

3.2 Continuous Fuzzy Matching Perspective

If we interpret the MSA as a form of nearest-neighbor search (looking for “neighboring” sequences in a large database), then the language model is effectively a learned data structure that has:

Compressed the manifold of known protein sequences into θ (the weights).
Learned an attention-based mechanism to query that internal manifold for relevant contexts.

In typical fuzzy string matching, one might compute edit distances between the query and every entry in the database. In the ESM-2 architecture, the attention mechanism:

acts as a trainable similarity function to identify relevant contexts. The intangible advantage is that these contexts may mix and match partial motifs from multiple “virtual neighbors,” creating a new representation not limited to the top few explicit matches in a database.

3.3 Co-evolutionary Signals Without Explicit Alignments

A major reason MSA is so powerful is that it captures co-evolving residues—positions that change in correlated ways across evolutionary history. In a typical MSA-based approach, if residue i mutates from A to G, residue j might consistently switch from T to S. Over many sequences, one infers that i and j likely contact or interact structurally.

By training on a massive corpus, the language model sees countless such correlations in raw sequence form. The emergent embeddings reflect these patterns. Hence, the final hidden state h_i is (indirectly) sensitive to all correlated positions that have ever appeared near that residue in training. So even though ESMFold does not align the query sequence to a database, it has internalized an approximate version of that same statistical correlation from its pretrained weights.

4. Example: From MSA to Language Model—A Toy Mathematical Sketch

Suppose we consider a short hypothetical protein sequence x=(M,K,L,L,P,V,L). In an MSA-based approach, you might gather 10,000 sequences from a database, building:

You then compute attention across these sequences (column-wise) and across residues (row-wise), deriving correlation maps.

In the ESM-2 approach, no explicit MSA is constructed. Instead, during training, the model saw thousands (or millions) of sequences resembling (M,K,L,L,P,V,L) or partial subsequences thereof. The MLM objective forced the model to fill in [MASK] tokens in contexts like _ K L _ P V _. Over many instances, it learned which next residues are probable. As a result, once we feed (M,K,L,L,P,V,L) into ESM-2, the hidden states reflect a “compressed MSA,” effectively picking up correlations that used to require explicit cross-sequence operations.

5. Implications and Future Directions

Efficiency Gains: ESMFold runs significantly faster than AlphaFold2 when no large MSA is available, since it avoids the alignment process. For proteome-scale structure predictions, this is a game-changer.
Handling Novel Proteins: If a target protein has few homologs in public databases, MSA-based models struggle. ESMFold is robust in these “low-homology” cases since it learned general protein grammar from the entire training corpus.
Limited Interpretability: One downside is that MSA-based approaches produce an explicit record of hits and alignments, which can be biologically interpretable (e.g., which species and families contributed the signals). ESMFold’s learned embedding, while powerful, can be less transparent.
Hybrid Approaches: Some emerging methods combine pre-trained embeddings with an MSA for the best of both worlds—particularly for proteins where deep MSAs exist.
Scaling Laws and Emergent Behavior: As ESM models grow (ESM-2, ESM-3, etc.), they exhibit emergent behaviors akin to large language models in NLP. This suggests we may see further improvements in structure prediction, function annotation, and protein design.

6. Conclusion

AlphaFold2’s success showed how vital MSAs are in revealing co-evolutionary signals, which guide 3D structure inference. ESMFold’s fundamental insight is that you can pre-learn these signals at massive scale by treating protein sequences as “language.” Then, instead of collecting an MSA at inference time, the model effectively “queries” its internal knowledge of sequence co-occurrences, learned through the MLM objective.

In both approaches, the central idea is to approximate how residues covary. In AlphaFold2, that covariance emerges explicitly from a large MSA. In ESMFold, it is embedded implicitly in a high-dimensional transformer space. The advantage of the language-model approach is that it (1) eliminates the bottleneck of database searching/alignment, and (2) leverages far more global knowledge than just sequences that happen to align with the target protein.

Mathematically, we can view these approaches as two different ways to compute a “similarity function” over the manifold of protein sequences:

AlphaFold2 + MSA: An explicit alignment-based approach that organizes relevant sequences so the model can learn correlations.
ESMFold + Transformer: A large-scale learned approach that stores correlation statistics in the weights, retrieving them through self-attention rather than explicit alignment.

As these language models grow and become more accurate, their potential to replace, or at least augment, MSA-based pipelines will only increase, promising ever-faster and more versatile protein structure prediction.

In summary, ESMFold’s fundamental contribution is demonstrating how one can use a large, pretrained protein-language transformer to replicate (and in some cases surpass) the evolutionary context that an MSA provides. It is a step toward an era where generative models of protein sequence space might supersede explicit database lookups, enabling faster, more flexible, and equally accurate structure predictions—even for proteins with scarce evolutionary data.

Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching

Chris Hayduk — Wed, 15 Jan 2025 19:21:17 GMT

Note: This post is part of the “Understanding Protein Language Models” Series:

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
[This article] Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

In the quest to understand modern protein language models like ESM2 and ESM3, we often focus on their impressive empirical results while treating their internal mechanisms as a black box. This post attempts to build intuition about how encoder-only transformers work by drawing an analogy to a simpler, well-understood algorithm: fuzzy string matching. I argue that encoder-only transformers can be viewed as performing a kind of continuous fuzzy lookup against a compressed form of their training data, encoded in their weights and latent space representations.

Subscribe now

The Central Analogy

When working with large text corpora, we often need to find similar strings or patterns. The traditional approach employs fuzzy string matching: maintaining a database of all strings and computing edit distances to find matches. An alternative approach, which I argue is conceptually similar but mathematically more sophisticated, uses an encoder-only transformer to compress the patterns in the corpus into model weights, then uses attention mechanisms to find similarities.

Both approaches fundamentally solve the same problem - finding contextually appropriate matches - but do so in radically different ways. Understanding this connection helps demystify how encoder-only transformers work and suggests ways to improve them.

How Encoder-Only Transformers Process Input

To understand the analogy, we first need to build detailed intuition about how encoder-only transformers work. Unlike the full encoder-decoder architecture used in translation, encoder-only transformers take a sequence of tokens and return a sequence of the same length, where each output token is a refined representation incorporating contextual information.

The process begins in the embedding layer, where discrete tokens are converted into continuous vectors. Each input token is first converted to a one-hot encoding - a vector of zeros with a single one indicating the token's identity. This sparse vector is then multiplied by an embedding matrix to produce a dense vector representation. Mathematically, for a token x_i:

one_hot = [0, 0, ..., 1, ..., 0] # 1 at position x_i 

embedding = one_hot @ W_emb # Matrix multiplication with embedding matrix

To this embedding, we add a positional encoding vector that encodes information about where the token appears in the sequence. The original transformer paper used sinusoidal positional encodings:

def get_positional_encoding(seq_len, d_model): 
   position = np.arange(seq_len)[:, np.newaxis] 
   div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) 
   pos_enc = np.zeros((seq_len, d_model)) 
   pos_enc[:, 0::2] = np.sin(position * div_term) 
   pos_enc[:, 1::2] = np.cos(position * div_term) 
   return pos_enc

These positional encodings have an elegant property: the relative position of two tokens can be computed through linear combinations of their encodings.

The heart of the transformer architecture lies in its self-attention mechanism. For each position in the sequence, the model generates three vectors through learned linear transformations:

Q = H @ W_Q # Query vectors 
K = H @ W_K # Key vectors 
V = H @ W_V # Value vectors

where H is the matrix of hidden states. The attention scores are then computed as:

attention_scores = softmax(Q @ K.T / sqrt(d_k)) 
output = attention_scores @ V

This can be written more formally as:

This attention mechanism allows each position to gather information from all other positions, with the weights determined by learned compatibility scores. The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients.

The transformer employs multiple attention heads in parallel, each with its own set of query, key, and value projections:

where each head is computed as:

After attention, each position's representation goes through a two-layer feed-forward network:

Traditional Fuzzy String Matching

To appreciate the analogy, we need to understand how traditional fuzzy string matching works. Given an input string and a database of reference strings, fuzzy matching computes the edit distance between the input and each reference. The edit distance represents the minimum number of operations (insertions, deletions, or substitutions) needed to transform one string into another.

The core of fuzzy string matching is the computation of edit distance. For strings s and t, the dynamic programming recurrence is:

def edit_distance(s, t): 
   m, n = len(s), len(t) 
   dp = np.zeros((m+1, n+1)) 

   # Initialize base cases 
   for i in range(m+1): 
      dp[i,0] = i 
   for j in range(n+1): 
      dp[0,j] = j 

   # Fill dp table 
   for i in range(1, m+1): 
      for j in range(1, n+1): 
         if s[i-1] == t[j-1]: 
            dp[i,j] = dp[i-1,j-1] 
         else: 
            dp[i,j] = 1 + min( 
               dp[i-1,j], # deletion 
               dp[i,j-1], # insertion 
               dp[i-1,j-1] # substitution 
            ) 
   
   return dp[m,n]

Mathematically, the recurrence relation is:

The computation proceeds through dynamic programming, building a matrix where each cell represents the minimum number of operations needed to match a prefix of the input string to a prefix of the reference string. The final cell gives the total edit distance between the strings. By computing this distance for each reference string and sorting the results, we can find the closest matches in the database.

While conceptually simple and mathematically elegant, this approach becomes computationally expensive for large databases, requiring time proportional to both the string lengths and the size of the database. Various optimizations exist, such as trie structures and early pruning, but the fundamental challenge of scaling remains.

The Transformer as Compressed Fuzzy Matching

Here we arrive at the core insight: the encoder-only transformer effectively compresses the pattern-matching capabilities of fuzzy string matching into its weights. During training, the model learns to encode the essential patterns and relationships present in the training data into its parameters.

The embedding matrix learns to map tokens to a continuous space where similar tokens are close together. The attention weights learn which patterns of tokens commonly co-occur, while the feed-forward layers learn to combine these patterns into higher-level features. Each successive layer captures progressively more abstract relationships. This process is analogous to building an optimized index of the training data, but instead of storing exact strings, we store distributed representations of patterns and relationships.

The connection between transformers and fuzzy matching becomes clearer when we compare their similarity computations. In fuzzy matching, the similarity between strings s and t is:

In transformer attention, the similarity between vectors q and k is:

We can view the transformer's learned weights as parameterizing a continuous relaxation of edit distance. The attention mechanism implements this relaxed distance metric:

def attention_similarity(query, key, value): 
   # query shape: [seq_len, d_k] 
   # key shape: [seq_len, d_k] 
   # value shape: [seq_len, d_v] 

   scores = query @ key.T / np.sqrt(query.shape[-1]) 
   attention_weights = softmax(scores) # [seq_len, seq_len] 
   return attention_weights @ value # [seq_len, d_v]

Several observations support this compression view. The systematic improvement in model performance with increasing size suggests that larger models can store more detailed patterns from the training data. Analysis of attention patterns reveals that different heads learn interpretable relationships that match linguistic or domain structure. The organization of the embedding space shows meaningful clustering of similar tokens and preservation of analogical relationships.

When we run a sequence through the transformer, the process mirrors fuzzy matching but operates in a continuous space. The initial embedding maps tokens to vectors, analogous to preparing strings for comparison. The self-attention mechanism computes similarity scores between positions, playing a role similar to edit distance calculation, but using a learned, context-dependent metric. Multiple layers progressively refine these representations, like iteratively improving string alignment.

We can formalize this connection mathematically. In fuzzy matching, similarity is measured as the negative minimum cost of operations needed to transform one string into another. In transformer attention, similarity is measured through scaled dot products between query and key vectors. The transformer effectively learns a continuous approximation of edit distance that can capture more nuanced relationships.

Concrete Example: Protein Sequences

Consider a concrete example from protein sequence analysis. In traditional fuzzy matching, we might have a query sequence "MKLLPVL" and search a database containing sequences like "MKLLTVL" (one substitution) or "MLKPVL" (two operations). Each comparison requires explicit computation of edit distances. This type of genetic database search is used by MSA-based models such as AlphaFold2 and AlphaFold3. The mechanics of this is described in the previous post in the series, Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2.

The transformer approach is markedly different. After embedding the sequence into continuous vectors, self-attention finds similar patterns that have been compressed into the model's weights during training. The output reflects patterns observed during training, but crucially, the model can combine these patterns in novel ways. The transformer effectively "remembers" which amino acid substitutions are biochemically plausible in each context, without storing explicit sequences.

In the next post in this series, we will flesh out this section more and learn how protein language models allow us to replace the MSA step in AlphaFold, creating faster strucutre prediction models that generalize better to proteins that have few available related sequences.

Conclusion

Viewing encoder-only transformers as performing compressed fuzzy matching provides powerful intuition about their operation. Rather than seeing them as black boxes, we can understand them as learning to compress and query a vast database of patterns from their training data. This perspective suggests that improvements in transformer architecture may come from better compression techniques for storing training patterns, more efficient similarity computations, and explicit incorporation of string matching algorithms.

Future research might investigate how much pattern information is stored in different parts of the model, how different architectures affect compression quality, and whether we can design better compression mechanisms inspired by string algorithms. We might also explore the theoretical limits of this compression approach and its implications for model scaling.

The success of this architecture in domains like protein sequence modeling suggests that the ability to learn and compress domain-specific similarity metrics is a powerful paradigm. As we continue to develop these models, maintaining this conceptual connection to classical algorithms may help guide the way to more efficient and effective architectures.

OpenAI o3 and the Rise of the Intelligence Allocator

Chris Hayduk — Fri, 20 Dec 2024 19:19:58 GMT

OpenAI's announcement of their o3 series of models represents a pivotal moment in AI development - but not for the reasons many might expect. While the headline achievement of 87% on ARC AGI is impressive, the more transformative aspect lies in the economics of the model's deployment.

Let's start with the raw numbers: A single inference task on o3 at its highest compute setting costs over $1,000 (see the figure below). This isn't $1,000 per evaluation set or per session - this is per individual task. To put this in perspective, that's roughly equivalent to 5-10 hours of skilled human labor cost, dedicated to solving a single problem. The model offers lower compute settings, but with corresponding decreases in capability. This creates a direct tradeoff between cost and intelligence that we haven't had to grapple with before.

This cost structure represents a sharp departure from the trend we've observed over the past two years. During that period, the cost of running general-purpose language models has approached zero, even as their capabilities have steadily improved. GPT-3.5 became GPT-4, yet inference costs remained relatively stable. GPT-4 then became GPT-4 Turbo and GPT-4o, maintaining intelligence while rapidly decreasing inference costs. This led to a proliferation of AI applications - we could afford to experiment freely, integrating AI into virtually every workflow to see what stuck.

The o3 series shatters this paradigm. When each inference costs more than a decent laptop, you can't simply "throw AI at the problem" anymore. Every use of high-compute o3 needs to be justified by the value it creates. This introduces what we might call the "inference allocation problem" - how do we determine which tasks are worth deploying our most powerful (and expensive) models on?

Subscribe now

Consider a software development team using o3 for code analysis. Running the model at high compute to analyze a critical security vulnerability in a payment system might be easily justifiable. But what about using it to optimize a non-critical internal tool? Or to review routine pull requests? The team now needs to develop frameworks for making these decisions systematically.

This fundamentally transforms AI deployment into a capital allocation problem. Just as investment managers spread limited capital across opportunities to maximize returns, organizations must now optimize their allocation of inference compute to maximize value creation.

Consider a hypothetical AI budget of $1 million per month. Currently, this might support tens or hundreds of millions of GPT-4o inferences spread across hundreds of different use cases. With o3, the same budget only covers about 1,000 high-compute inferences. This scarcity forces us to think like capital allocators: Which thousand problems, if solved with our highest level of artificial intelligence, will generate the most value?

Beyond simply identifying high-value problems, intelligence allocators will need to understand the relationship between compute investment and value creation. Sometimes, a medium-compute inference at $100 might capture 80% of the potential value at 1/10th the cost. In other cases, the step-change in capability from high-compute might be worth the premium. Like any good investment decision, it requires understanding both the cost of capital and the expected returns.

In another parallel to traditional capital allocation, just as investors develop frameworks for evaluating investments across different sectors and risk levels, organizations will need frameworks for evaluating AI compute allocation across different use cases. These frameworks will need to consider factors like:

The value delta between using high-compute versus lower-compute models
The cost of being wrong or suboptimal
The potential for value capture from improved accuracy
The frequency with which the task needs to be performed

We might even see the emergence of "AI portfolio theory" - methods for optimizing the allocation of compute resources across different types of tasks to maximize expected return while managing risk. Some organizations might adopt a "barbell strategy" - using basic models for routine tasks while reserving expensive high-compute inferences for their most critical problems.

This shift in focus for AI engineers means that success looks more like developing the frameworks and metrics needed to make intelligence allocation decisions effectively, rather than focusing purely on technical implementations. The best AI engineers will be those who can think like capital allocators, understanding both the technical capabilities and the business value of different compute investments.

In this light, o3 represents the beginning of an era where artificial intelligence must be treated as a scarce resource requiring careful allocation. The organizations that thrive will be those that develop robust frameworks for deploying this resource where it can generate the highest returns.

The future of AI might look less like unlimited abundance and more like traditional capital markets, where success comes from making smart allocation decisions with limited resources. As models continue to become more powerful and computationally intensive, these allocation skills will only become more crucial.

On Algorithmic Moats and the Path to AGI

Chris Hayduk — Thu, 19 Dec 2024 21:59:19 GMT

The past few weeks have provided a remarkable natural experiment in AI development dynamics. OpenAI releases what appears to be a breakthrough technology, and Google promptly demonstrates superior capabilities:

OpenAI's Sora demonstrated remarkable text-to-video generation, only to be superseded by Google's Veo 2 with notably higher quality output
OpenAI's o1 introduced novel "thinking" capabilities, followed within weeks by Google's Gemini 2.0 Flash Thinking, implementing similar functionality
Gemini 2.0 has now surpassed both GPT-4 and Claude Sonnet across a broad range of benchmarks

This pattern reveals something fundamental about the nature of competitive advantage in artificial intelligence. To understand why this Google dominance was inevitable, we need to examine a broader principle: the myth of algorithmic moats.

Subscribe now

Algorithmic Moats

It has frequently been said that part of Silicon Valley's success is the lack of non-compete clauses for employees. This allowed trade secrets to proliferate rapidly in the Bay Area, creating more efficient competition dynamics and allowing many engineers to learn from each other, rather than restricting learning and competitive advantages to a single firm.

However, I rarely see this same line of argument applied to business moats. If this holds true, then it implies that algorithms alone cannot provide a durable moat to a business. Employees can easily leave one company and take all of its hard-won knowledge to a competitor, allowing the competitor to catch up.

Consider a thought experiment: You discover a revolutionary new algorithm. How long can you maintain that advantage? In a world of mobile talent and reverse engineering, the half-life of algorithmic secrets approaches zero as their value approaches infinity.

This creates what we might call the algorithm diffusion principle: Any sufficiently valuable algorithm will spread through the industry at a rate proportional to its perceived importance. Silicon Valley's prohibition on non-compete clauses accelerates this process, creating an upper bound on how long any single player can maintain algorithmic superiority.

Hence, algorithms only provide moats insofar as they facilitate the construction of another type of moat. When we talk about algorithmic moats, we're really discussing two separate concepts: the technical implementation details that can be replicated, and the emergent properties that arise from being first to market with those implementations.

Consider Google's own history with PageRank. While revolutionary for its time, the core insight – that incoming links could be weighted by the importance of their source – was relatively straightforward to replicate once published. What made Google dominant wasn't PageRank itself, but rather the virtuous cycle it enabled: better search results → more users → more data → even better search results. The algorithm was merely the catalyst for building a data moat.

This pattern repeats across the technology landscape. Spotify's recommendation algorithms, while sophisticated, aren't what prevent users from switching to Apple Music or YouTube Music. Instead, it's the years of accumulated listening history, carefully curated playlists, and social sharing features that create switching costs. The algorithms enable these benefits, but they aren't the moat themselves.

Moats on the Path to AGI

The implications of the lack of direct algorithmic moats become clear when we consider AGI development as a function of three primary variables:

Algorithmic innovation (A)
Computational resources (C)
Training data quality and quantity (D)

We might express AGI capability as: AGI_capability = A * f(C,D)

Where f(C,D) represents the effective utilization of compute and data. The algorithm diffusion principle suggests that A will quickly equilibrate across major players. Therefore, the decisive factor becomes f(C,D).

This is where Google's position becomes overwhelming. Consider their structural advantages:

Data Supremacy:

Google Search: The world's most comprehensive map of human knowledge and intent
YouTube: The largest repository of human audio-visual communication
Google Books/Scholar: A near-complete corpus of formal human knowledge
Android/Gmail: Vast behavioral and communication datasets

Compute Dominance:

Custom TPU architecture optimized for AI workloads
Vertical integration from silicon to software
World-class data center infrastructure
Decades of distributed systems optimization

These advantages compound non-linearly. Having twice the data and twice the compute doesn't yield four times the capability – it might yield eight times or more due to emergent properties in large-scale systems.

The recent pattern of Google rapidly matching and exceeding OpenAI's innovations perfectly illustrates this dynamic. When OpenAI develops a new technique, Google can quickly replicate it (algorithm diffusion) and then apply it with vastly superior resources, achieving better results almost immediately.

This creates what game theorists would call a dominant strategy for Google: Wait for algorithmic innovations, replicate them with superior resources, and achieve better results than the original inventors. The math becomes almost deterministic.

One might object that breakthrough algorithms could create discontinuous advantages that trump resource differences. However, the observed scaling laws in neural networks suggest otherwise. The smooth power-law relationships we've seen indicate that resource advantages compound predictably rather than being disrupted by algorithmic breakthroughs.

In retrospect, the tech industry's focus on OpenAI and other startups represents a failure to reason from first principles. In a world where algorithmic innovations cannot be contained, the player with overwhelming advantages in compute and data will inevitably emerge victorious. Google's position isn't just strong – it's strategically dominant in a game-theoretic sense.

The universal rule of algorithmic diffusion suggests a surprising corollary: The most effective strategy for other players might not be to compete directly with Google, but rather to focus on specialized domains where Google's general-purpose advantages are less relevant. This could, ironically, lead to a more specialized and diverse AI ecosystem than many currently predict.

ESM3 and the Future of Protein Language Models

Chris Hayduk — Tue, 25 Jun 2024 14:16:22 GMT

EvolutionaryScale, a team spun out of Meta’s AI research department, today released ESM3, the sequel to the hugely popular ESM2 protein language model that was released in 2022. On the heels of this model release announcement, EvolutionaryScale also announced that they had raised a staggering $142 million in their seed round. But what’s so different about ESM3 from previous iterations of protein language models? Does it warrant this level of hype?

To understand ESM3 and the surrounding hype, we’ll start with an overview of how protein language modeling has traditionally been done and use that as a springboard to see why ESM3 may represent a large step forward.

Note: If you’d like to learn more about protein language models before diving in here, check out my “Understanding Protein Language Models” series!

Subscribe now

Protein Language Modeling Overview

The goal of protein language modeling is to understand the space of all possible proteins through training on sequence data. To that end, protein language models leverage the underlying architecture that powers all of the advances in natural language text — that is, the transformer.

Tokenization

Just as with text language models, protein language models operate on “tokens”, or discrete chunks that input sequences have been divided into. The space of all possible tokens is referred to as the vocabulary of the model. In natural language text, this is a bit more difficult because different tokenization schemes can profoundly affect performance — if we chunk text at the level of letters, our vocabulary is likely far too small, and our model will need much more data to learn. It will also be limited in the length of sequences it can read since each character will use up 1 token of its context window (which can be thought of as the model’s working memory). But if we chunk by letter, our token vocabulary will now be tens of thousands of words long, and the model will again struggle to learn since it may only see rare words once or twice. For protein language models, on the other hand, we have a natural tokenization point at the level of amino acids! Amino acids are the organic compounds that act as the building blocks for proteins in organisms (in fact, DNA’s primary role is to provide instructions to each organism’s cells about protein production — what proteins should be produced, how many, and at what time). There are 22 distinct amino acids that are found somewhere in the genetic code of life (though only 20 are found in the human body), and these form the basis for any protein language model’s vocabulary.

Tokenization in natural language text models

Training Setup & Loss Function

Once we have a tokenization scheme, we need to train the model to perform our desired task. While text language models have focused squarely on generation, protein language models have instead focused on producing useful encodings of the target protein. To that end, they have typically used a bidirectional transformer with masked language modeling (MLM) loss rather than the unidirectional transformers with autoregressive loss. I’ll define these terms below:

Bidirectional transformer with masked language modeling loss

Bidirectional - the transformer model can read a sequence both forwards and backward
Masked language modeling loss - we pick random tokens in the input sequence and hide them (or mask them) from the model. We then score the transformer model based on whether or not it can guess this hidden token given the rest of the input sequence (see the image above)
Unidirectional transformer with autoregressive loss
Unidirectional - the transformer can only read a sequence forwards. That is, future tokens cannot influence past tokens in an input sequence
Autoregressive loss - we hide the last token in a sequence and score the model based on its ability to predict that token using all of the previous tokens that

Protein language models, by using the bidirectional model with MLM loss setup, are able to reference both previous and future amino acids when generating a representation for a given amino acid in a sequence. This allows the model to learn with less training data since it is an easier task than the unidirectional autoregressive case. In addition, it allows the model to attend to amino acids that may be far apart in the amino acid sequence but actually very close together in the resulting 3D protein structure.

ESM2 Results

Given the above tokenization scheme and training setup, ESM2 (and other protein language models like it) was able to produce some pretty impressive results. The base model, which was only trained to identify missing amino acids in a protein sequence, could be finetuned to perform tasks like protein function prediction, protein-protein interactions, or protein structure prediction (as shown in the image above). It is able to perform these predictions very quickly when compared to its contemporary competitors, such as AlphaFold2, due to the efficiency of the language modeling approach. However, by the same token, its accuracy in structure prediction was generally worse than AlphaFold’s.

Limitations of Pure Sequence Data

While ESM2 and other previous protein language models showed impressive results across several tasks, they have been fundamentally limited by their reliance on pure sequence data. This approach, while valuable, fails to capture the full complexity of biological systems and the hierarchical nature of protein interactions.

The key issue lies in the nature of DNA and protein data compared to natural language text. In language models trained on text, we observe a natural multiscale learning process. Text data contains paired instruction and answer data at multiple levels of abstraction, allowing models to learn tasks ranging from simple summarization to complex synthesis of multiple texts. For example, there may be a task-answer pair that says "Please summarize this passage of text", and then another that says "Please synthesize these summaries of different pieces of text into a single narrative". We move from one task (compressing text on a single topic into a summary) to a higher-level version of that same text (compressing multiple texts on multiple topics into a single summary). In other words, we're ascending a ladder of complexity in the types of instructions the LLM is learning to perform. This all happens naturally as part of the LLM training process because text is both the instruction and the answer. Or, to put it in the parlance of von Neumann computing, text is the program, and it is the data (just as bits are both the program and the data in the von Neumann computer architecture).

Now, let us consider a DNA language model training on a DNA sequence. We can think of DNA sequences as instructions that detail how to produce a protein (just as we looked at text instructions above). The protein can then itself be viewed as a higher-level instruction (namely, how should this molecule interact with other molecules in the body). These interactions can be seen as yet higher-level instructions for how to compose reaction pathways in the body. And so on up the chain.

The core idea here is that if we train on only DNA sequences, we see the instructions at an early stage without ever viewing the solution (the protein sequence & structure). Moreover, once we "complete" this instruction (DNA -> protein), to do anything useful we need to continue climbing the hierarchy (i.e. now our protein shape becomes the instruction and we use it to identify complexes & interactions). But our DNA language model never sees how to do this from its DNA sequence data and thus can never learn how to ascend the hierarchy. We don't get this natural multiscale learning the way we do in text because DNA is not universal. The modality changes as we move up the ladder of complexity, whereas text is always text, regardless of the complexity or level of granularity of the instruction-answer pair.

Hence, to make biological foundation models that replicate the success of LLMs in text, we need a way to encode all of the various modalities and learn them jointly in a single model. We also need to accelerate data collection efforts across these modalities rather than focusing purely on the lowest levels (i.e., DNA & protein sequencing).

ESM3 & Multiscale Data

ESM3 addresses this limitation by incorporating a form of multiscale data into its training process. Instead of focusing solely on amino acid sequence data as ESM2 did, it integrates:

Atomic coordinates: Providing information about protein structure
Sequence data: Offering the fundamental building blocks of proteins
Function data: Giving context to the protein's role in biological systems

All three of these modalities are tokenized and learned jointly by the model (see image above). That is, unlike ESM2, which only made predictions for a protein’s amino acid sequence, ESM2 learns to simultaneously make predictions about a given protein’s amino acid sequence, 3D structure, and high-level functional details. This multiscale approach allows ESM3 to jointly learn about proteins at multiple levels of abstraction:

Low-level: Understanding what sequence codes for a particular protein
Mid-level: Comprehending the protein's shape after folding
High-level: Grasping the protein's function(s) in nature

By learning to reason about proteins across these multiple scales, ESM3 will likely achieve significant performance improvements, particularly in generative tasks that require integration of knowledge from all three scales. The EvolutionaryScale team validated this with a case study in which they designed a new fluorescent protein that had never been seen before in nature (image below).

They did this by specifying high-level functional details, important protein structure requirements for fluorescence, and known amino acid sequences that code for those structural snippets. Given this conditioning data, the model was able to generate the remainder of the protein through reasoning about the constraints across all three scales of complexity: sequence, structure, and function.

Conclusion

Overall, ESM3 warrants the hype and signals a potential paradigm shift in the field of protein language modeling. It represents a first step in moving from an era focused on scaling up amino acid sequence data alone towards a focus more on the integration of diverse, multiscale data sources. This approach aligns more closely with the success of large language models in natural language processing, where the universality of text allows for seamless learning across various levels of complexity. By incorporating multiple modalities of biological data, ESM3 and its successors will be much better positioned to replicate this success in the biological domain.

Moving forward, this shift implies a need for accelerated data collection efforts across various biological modalities, rather than focusing solely on protein sequencing and structure determination. For ESM4 to truly serve as a foundation model for biology, the way GPT4 has served as a foundation model for text, we will need to go beyond sequence, structure, and function to include reaction pathways, cellular expression levels, and more.

A Perspective on the Limitations of Language Modeling

Chris Hayduk — Sat, 22 Jun 2024 18:29:59 GMT

Scenario #1: Imagine you want to model a sequence of coin flips. Your goal is to accurately predict the result of the next coin flip given the history of all previous flips in this sequence. Suppose in this scenario that you have no idea that coin flips are independent and that the underlying distribution for this process is binomial with a probability of 0.5. So, without any strong or informed prior on what this process looks like, we spend billions of dollars of compute to train up a massive autoregressive sequence model to predict the next flip in each sequence. This model will approach, but never exceed, 50% accuracy in token prediction on the test set given the nature of the underlying distribution.

Scenario #2: Now instead imagine that we have enough compute to model each coin toss analytically. We can simulate the impact of air resistance with CFD, the angle of the coin as it leaves the tosser’s thumb, etc. Given enough compute and sufficient accuracy of the measurements for the initial conditions, we should be able to predict the result of the coin toss with near 100% accuracy using the standard classical mechanics that have allowed us to put satellites into orbit and men on the moon.

Subscribe now

These two scenarios deal with the same underlying process (predicting a coin flip) but address that process from two totally different perspectives. Scenario #1 approaches the prediction task from an external perspective. It attempts to model the process without actually understanding any of the internal processes that generate the process. In the case of the coin flip, no matter how well we model from this external perspective, we can never achieve better than 50% accuracy on a large enough test set. The fundamental limit is not in the data or in the model but in the perspective from which we are modeling.

By contrast, in Scenario #2, we approach the prediction task using an internal perspective. We analyze the causal factors that contribute to the outcome of each coin flip and model those factors, allowing us to make predictions based on the mechanics underlying the flip rather than simply using the sequence itself. Here, given enough compute and sufficiently sensitive measurement devices, we can far exceed the 50% accuracy ceiling that limited the external perspective.

LLMs model human thought using the external perspective described above and, as such, will have a large amount of error that could be avoided by modeling thought from an internal perspective.

In order to move to modeling thought from an internal perspective, there are two promising avenues:

Combining symbolic AI with large language models
Simulation of the human connectome

Symbolic AI comprises formal logic, proof verification systems, knowledge graphs, and more. These approaches attempt to model the mechanisms of human thought directly, focusing in particular on producing logically correct deductions (in the parlance of Kahneman’s Thinking Fast and Slow, these attempt to model System 2 thinking). Advances in combining these approaches with large language models will probably come from algorithmic and engineering work rather than scaling up compute. Thus, if this approach works, we should expect to see artificial general intelligence (AGI) without much of a cost in increased scaling. Since this article is most interested in exploring the maximum lower bound of compute required for AGI, we will ignore this case and focus on modeling the human connectome.

Simulating the Brain

In the case that merging symbolic AI with LLMs does not work, our clearest avenue towards AGI would be mapping and simulating the human connectome. A simulation of the actual underlying hardware of the brain — the billions of neurons and trillions of synapses that comprise it as well as all of their interactions — should produce thought through a faithful reconstruction of its underlying mechanics in the same way that computer simulations can map the trajectory of a real rocket. And if we provide it with a map of the human connectome of someone like Albert Einstein and feed it data on our accumulated store of knowledge, it should be a general intelligence that can solve difficult, out-of-distribution problems.

The above paragraph rests on a number of critical assumptions (namely, that we can map the full human connectome at a high level of detail and that we can develop well-formulated models for human neurons and synapses). For the sake of this exercise, we will ignore the significant work that remains to be done in those domains and instead answer the question — if we already had a complete map of the human mind, how much compute would we need to run the simulation and generate thought? We’ll answer this question at a rough order of magnitude level, but it should give us a picture of when the amount of compute needed to run this simulation will become available to large corporations and research labs.

To accurately simulate the human connectome, we need to consider several factors:

Neuronal Complexity: Each neuron is a complex computational unit with intricate dynamics. Simulating a single neuron with high fidelity requires significant computational power.
Synaptic Plasticity: The strength and nature of connections between neurons are constantly changing. Modeling this plasticity adds another layer of complexity to the simulation.
Temporal Resolution: Neural processes occur on multiple timescales, from milliseconds to hours. A comprehensive simulation must account for these various temporal dynamics.
Spatial Resolution: The spatial arrangement of neurons and their connections is crucial for understanding brain function. High-resolution mapping of the connectome is essential for accurate simulation.

Given that we would like a high-fidelity simulation of the brain that can produce emergent, intelligent thought, we will tend towards higher complexity along all four of these factors. Our back-of-the-envelope calculations will assume a highly complex neuronal model, long- and short-term synaptic plasticity, a temporal resolution of 1 millisecond, and full-connectome modeling, including all 100 billion neurons and 600 trillion synapses.

Neurons

Let's focus on a single-neuron model as an example, using the Hodgkin-Huxley (HH) model, which is one of the more computationally intensive but biologically realistic models:

Membrane Potential Calculation: The HH model uses a differential equation: C(dV/dt) = -g_Na(V-E_Na) - g_K(V-E_K) - g_L(V-E_L) + I_ext Solving this numerically (e.g., using the Euler method) requires:
- 3 subtractions
- 3 multiplications
- 3 additions
- 1 division (for dt)
- Total: ~10 floating point operations timestep
Ion Channel Dynamics: For each of sodium, potassium, and leak channels: dm/dt = α_m(1-m) - β_m*m (similar for h and n) Each gate variable (m, h, n) requires:
- 2 exponential calculations (~10 floating point operations each)
- 4 multiplications
- 2 additions/subtractions
- Total: ~30 floating point operations per gate, ~90 floating point operations for all three
Conductance Calculations: g_Na = g_Na_max * m^3 * h g_K = g_K_max * n^4 Requires:
- 5 multiplications
- 2 exponentiations
- Total: ~20 floating point operations
Current Calculations: I_Na = g_Na * (V - E_Na), etc. Requires:
- 3 subtractions
- 3 multiplications
- Total: ~6 floating point operations

Summing these up, we get approximately 126 floating point operations per timestep for a single-compartment HH model. However, this is a significant underestimate for a realistic neuron simulation:

Multiple Compartments: Real neurons aren't single compartments. A moderately detailed model might have 10-100 compartments, each requiring its own HH-like calculations. Total: 126 * 10 to 126 * 100 = 1,260 to 12,600 floating point operations
Synaptic Integration: A typical neuron might have 1,000-10,000 synapses. For each active synapse:
- Calculate postsynaptic current (~5 floating point operations)
- Update synaptic state (~5 floating point operations)
If 10% of synapses are active in a timestep: Total: 100 to 1,000 * 10 floating point operations = 1,000 to 10,000 floating point operations
Intracellular Signaling: Calcium dynamics and second messenger systems might add another 100-500 floating point operations, depending on the level of detail.

Adding these up, we get a range of about 2,360 to 23,100 floating point operations per neuron per timestep.

Synapses

We’ll now estimate the number of floating point operations needed to simulate a single synapse, accounting for both transmission and plasticity.

Basic Synaptic Transmission: I_syn = g_syn * s * (V_post - E_rev) ds/dt = α * T * (1 - s) - β * s
- 2 multiplications
- 2 subtractions
- 1 addition
- 1 division (for dt)
- Total: ~6 floating point operations
Short-Term Plasticity (STP):
1. Facilitation: dF/dt = (1 - F)/τF + f * δ(t - t_spike) b.
2. Depression: dD/dt = (1 - D)/τD - d * D * δ(t - t_spike) c. Synaptic efficacy: A = A0 * F * D
  - 4 subtractions
  - 3 divisions
  - 3 multiplications
  - 2 additions
  - Total: ~12 floating point operations
Long-Term Plasticity (LTP/LTD):
1. NMDA receptor activation: I_NMDA = g_NMDA * s_NMDA * B(V) * (V_post - E_NMDA) B(V) = 1 / (1 + exp(-0.062 * V_post) * [Mg2+] / 3.57)
  - 4 multiplications
  - 2 subtractions
  - 1 division
  - 1 exponentiation
  - Total: ~18 floating point operations
2. Calcium dynamics: d[Ca2+]/dt = -[Ca2+]/τCa + γ * I_NMDA + baseline
  - 1 division
  - 2 multiplications
  - 1 addition
  - 1 subtraction
  - Total: ~5 floating point operations
3. CaMKII activation: dCaMKII/dt = k1 * [Ca2+]n * (1 - CaMKII) - k2 * CaMKII
  - 2 multiplications
  - 1 subtraction
  - 1 exponentiation
  - 1 division
  - Total: ~15 floating point operations
4. Weight update rule (based on CaMKII): dw/dt = η * (CaMKII - θp)+ - η * (CaMKII - θd)- Where ()+ and ()- denote rectification
  - 2 subtractions
  - 2 comparisons
  - 2 multiplications
  - Total: ~8 floating point operations
Homeostatic Plasticity: w = w * (1 + η_homeo * (target_activity - actual_activity))
- 2 subtractions
- 2 multiplications
- 1 addition
- Total: ~5 floating point operations
Structural Plasticity (simplified): P_form = sigmoid(local_activity - threshold) P_elim = 1 - P_form
- 1 subtraction
- 1 exponentiation (for sigmoid)
- 1 division (for sigmoid)
- 1 subtraction (for P_elim)
- Total: ~14 floating point operations
Neuromodulation (simplified, e.g., dopaminergic influence on plasticity): plasticity_factor = baseline + k * [dopamine]
- 1 multiplication
- 1 addition
- Total: ~2 floating point operations

Summing these components: 6 + 12 + 18 + 5 + 15 + 8 + 5 + 14 + 2 = 85 floating point operations per synapse per timestep.

Full Estimate

Assuming we need to simulate each neuron and synapse at a resolution of 1 millisecond, each neuron operation requires 20,000 floating-point operations, and each synapse operation requires 85 floating point operations, we can make a rough calculation:

100 billion neurons * 20,000 floating point operations/neuron * 1000 timesteps/second = 2 * 10^18 FLOPS

600 trillion synapses * 85 floating point operations/synapse * 1000 timesteps/second = 5.1 * 10^19 FLOPS

Total: Approximately 5.3 * 10^19 FLOPS or 53 exaFLOPS

As a result, simulating the human connectome for approximately 1 week would require more floating point operations than the entire training budget for GPT-4. Keep in mind that simulation, in this case, is essentially inference in the case of LLMs - imagine if serving 1 instance of GPT-4 for 1 week took the entire training compute budget of OpenAI. This is a monumental amount of compute and beyond the economic feasibility of any current company (even if the human connectome had been fully mapped, which it is currently far from being).

From another perspective, we can look at this compute requirement in terms of the number of H100s that it would require. FP64 is typically used for scientific simulation work, and the NVIDIA H100 can perform 34 teraFLOPS = 3.4 * 10^13 FLOPS in FP64 mode. Hence, our estimate for the human brain would require approximately 1,550,000 H100s for simulation. This is 2 orders of magnitude larger than the largest deployed H100 clusters today. There is not much data on the growth rate of FP64 FLOPS, but FP32 FLOPS are doubling every 2.3 years. Since an FP64 multiplier unit takes roughly five times the area of an FP32 multiplier, we can estimate that it will require 5 times the growth in transistors in order to double the FP64 performance when compared to doubling FP32 performance. Hence, our doubling time for FP64 performance should be roughly 7.6 years, which would imply that FP64 performance will increase by one order of magnitude (i.e. 10x the performance of the H100) in about 25 years.

If the above trends and assumptions hold (a very tenuous assumption), it will be 50 years before the FP64 performance has increased to the point where companies can spend roughly equivalent amounts to today’s largest AI training clusters in order to simulate the human brain. If algorithmic advances or further research in computational neuroscience support a transition to FP32 from FP64, this 50-year timeline could be compressed to only 15 years.

Hence, if simulating the brain is our only viable path to AGI, we can expect the required compute to be available for the largest corporations in 2040 at the earliest and 2075 at the latest given current trends.

A Case Study in Finetuning Open Source LLMs: Training LLaMA 2 for the Text-to-SQL Task

Chris Hayduk — Tue, 04 Jun 2024 13:49:54 GMT

Introduction

Below is a write-up of a consulting project I did for a client in late 2023 (with all client names removed). In the report, I detail the models, datasets, and approaches needed to create a state-of-the-art text-to-SQL model that outperforms GPT-4. If you don’t have time to read the full report, the three main takeaways are the following:

The dataset is everything. By far, my highest ROI activity came from inspecting, correcting, and augmenting my dataset. This included: spot-checking and fixing errors in the table schemas in the training data, identifying additional non-SQL datasets to include in training that helped performance, implementing curriculum learning to improve convergence, and more.
Don’t just validate next token prediction, test real task performance. You want to make sure your validation set is as close to the real thing as possible. To that end, I made two modifications:
1. I biased the validation set towards harder SQL queries rather than having the same distribution as the training set
2. I created real SQL tables using GPT-4 to match the schemas that are in the data. I then executed the SQL statements produced by the model against these tables and compared the result of this execution to the result of executing the ground-truth SQL. This gave me a metric that tracked the performance of the model in a real-world setting, rather than just looking at how accurately it can guess the next token
Experimentation is key. All of these insights took many training runs to arrive at (likely about 50 in total). Thus, make sure to use a parameter-efficient fine-tuning method such as QLoRA so you don’t bankrupt yourself on all of the experimentation training runs. Also, take meticulous notes - sometimes the insights coalesce over the course of many trials. For tracking purposes, I used Weights & Biases and kept all of my notes in a running Google Doc.

These three insights together allowed me to develop a model based on Code LLaMA that outperformed GPT -4 by 30 percentage points and cost 99% less to run at inference time.

Now, if you’d like all the details, the full consulting report is below.

Overview

Currently, the usage of many database & data warehouse management tools requires SQL knowledge in order to write queries. As a result, the number of seats per software license for these tools is constrained by the number of employees within a given organization who have a strong grasp of SQL, which puts a lower cap on the potential revenue per client. The recent emergence of large language models (LLMs) can help to alleviate this problem, as they demonstrate near-human-level ability to translate from natural language to code. However, usage of common LLMs such as ChatGPT can present several challenges, including:

Cost: OpenAI charges per 1000 tokens of input and output (where 1 token roughly corresponds to 1 word). As a result, costs can skyrocket as the user base increases. For example, 100,000 users making on average 10 document summarization requests per day will cost about $5,800 per day in API fees alone. This can significantly reduce profit margins on a product. In addition, sudden unforeseen spikes in API usage can result in large losses without appropriate API throttling checks in place.
Security: When using the OpenAI API endpoint, requests to ChatGPT are sent to an external endpoint. This increases the probability of data leakage and the associated negative downstream effects for a company, such as fines, legal fees, and loss of goodwill. Even if using an OpenAI endpoint deployed within a customer’s Azure environment, there will still be security concerns over using an external LLM on valuable, confidential data.
Output Control: When using OpenAI’s API endpoints, you are at the mercy of any changes they decide to make to their models. GPT-3.5 and GPT-4 are constantly updated with new training data pouring in from the millions of user chats. While this is intended to improve the service, performance can end up degrading for your specific task & prompt combination (see Figure 1 below). This can result in sudden, random drops in performance for your tool.

Figure 1. Chen, Lingjiao, Matei Zaharia, and James Zou. "How is ChatGPT's behavior changing over time?." arXiv preprint arXiv:2307.09009 (2023).

Open source LLMs, however, have the potential to address each of these key drawbacks of closed source LLMs through providing: (1) smaller models that reduce the cost required to serve the LLM, (2) ability to deploy the model in any environment, whether in a secure cloud VPC or even on a local machine, and (3) complete control over all model parameters, ensuring consistent output quality. In order to leverage these open source model advantages, we have developed a state-of-the-art text-to-SQL model based on Meta’s Code LLaMA model. Our new model, dubbed LLaMA2-SQL, significantly outperforms GPT-4 on the text-to-SQL task, while being lightweight enough to run on a CPU.

Methodology

Base Model Choice

We began the project with three main desired characteristics for our base model:

Strong general reasoning capabilities
Commercially-permissive licensing
Large open source community ecosystem

Of models released during the main phase of the project (March 2023-December 2023), Meta’s LLaMA 2 and Code LLaMA models were the highest performing LLMs to satisfy these three criteria. LLaMA 2 is a general-purpose LLM that was pretrained on 2 trillion tokens. Code LLaMA extends LLaMA 2 by further pretraining the model on an additional 520 billion tokens of code, significantly improving performance on coding tasks (see Figure 2).

Figure 2. Llama and Code Llama performance compared to ChatGPT and several other open source models. HumanEval and MBPP both consist of tasks that the model must solve using Python code. Multilingual HumanEval extends HumanEval to include coding challenges in C#, Go, Java, JavaScript, Kotlin, Perl, PHP, Ruby, Scala, Swift, and TypeScript.

Given Code LLaMA’s strong general reasoning & coding performance, we hypothesized that it would provide a good starting point from which to finetune a SQL-specific model. Specifically, we selected Code LLaMA - Instruct 34B to leverage its instruction-following capability when creating LLaMA2-SQL.

Dataset Curation & Augmentation

Creation of the dataset was the most significant piece of the project and resulted in the largest gains in performance. To begin, we combined several datasets, including WikiSQL, Spider, and Code Instructions Alpaca 120K. These datasets combined to form the LLaMA2-SQL AI dataset. This amalgamation was not just a mere aggregation of data; it was a strategic blend designed to encompass a broad spectrum of SQL queries and structures.

In the pursuit of refining the dataset and enhancing the model's performance, several key strategies were employed. We adopted the instruct dataset format, which is known for its efficacy in guiding models towards more accurate and context-aware outputs. This format was instrumental in aligning the model's responses with the intricate demands of SQL query generation. Furthermore, we introduced a mix of general coding problems alongside SQL generation tasks. General coding questions tended to be longer and require more reasoning steps than the typical SQL queries found in WikiSQL and Spider datasets. As a result, this mixture improved the model’s reasoning capability on more complicated user questions, despite lowering the overall proportion of SQL queries in the training set.

One of the more technical challenges we addressed was a major table schema issue, where all columns in the training & validation sets were indiscriminately coded as VARCHAR. By resolving this, we ensured that the model could recognize and handle a variety of data types, thereby increasing the accuracy and reliability of its SQL output. Additionally, we eliminated examples from the dataset where the response was less than 10 characters. This exclusion was based on the rationale that shorter responses often lack the complexity and detail required for effective training in SQL generation.

To further refine the dataset, we sorted examples in order of instruction length. This sorting approach (known as curriculum learning) allowed for a gradual and systematic exposure of the model to increasingly complex queries, thereby enhancing its learning curve. We also utilized an embedding model to identify and remove similar data points. This step was crucial in ensuring that the dataset was not only diverse but also free of redundant or overly repetitive examples.

Another innovative step was the randomization of the SQL schema order within the dataset. This randomization was a strategic move to prevent the model from developing biases or shortcuts based on schema ordering. Lastly, we intentionally biased the validation set to include mostly difficult SQL problems. This bias ensured that the model was rigorously tested against complex and challenging queries, which is essential for real-world applications where query complexity can vary greatly.

Model Training

Model training, particularly the fine-tuning of large language models, typically demands substantial GPU memory, posing significant challenges in terms of cost and efficiency. Even the smallest LLaMA model, with 7 billion parameters, requires approximately 140 GB of GPU RAM, while the 70B LLaMA model demands a staggering 1400 GB. This high resource requirement makes conventional fine-tuning methods both expensive and time-consuming. To address these challenges, we employed Parameter Efficient Fine-Tuning (PEFT), a novel approach that significantly reduces GPU RAM requirements for fine-tuning open source models. This method is a part of the HuggingFace library ecosystem and offers various out-of-the-box implementations.

Among the PEFT methods, we chose to use Quantized Low-Rank Adaptation (QLoRA). QLoRA builds upon the core concept of PEFT and introduces three critical enhancements (see Figure 3). Firstly, it quantizes the base model since the base model parameters are frozen and do not require maintenance in a high-precision format. This quantization effectively reduces memory usage. Secondly, it offloads some of the optimizer's values to the CPU memory when the GPU memory is insufficient, bringing them back to the GPU as needed. Lastly, it uses an interesting trick from linear algebra to reduce the number of trainable parameters. The trick works as follows: Assume the weight matrix W has dimension (d x d). If we set the dimensions of matrix A to be (d x r) and matrix B to be (r x d), then the multiplied matrix AB has the dimensions (d x d), the same as our original weight matrix W, but with potentially far fewer parameters! W has d*d = d^2 parameters, whereas A and B together have (d*r) + (r*d) = 2dr parameters. When r is less than d/2, this results in a reduction in trainable parameters compared to the original model. Low values of r can make this reduction dramatic. The combination of these three enhancements makes QLoRA incredibly memory efficient for fine-tuning while still maintaining strong performance.

Figure 3. Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

To facilitate PEFT fine-tuning, we utilized several libraries. Axolotl was instrumental in orchestrating the training process using common PEFT methods. Its user-friendly setup, requiring only a simple YAML file of training parameters, streamlined our workflow. Additionally, Axolotl's support for distributed training was crucial for managing the computational demands of our project.

Another key component in our training arsenal was Flash Attention. This library implements an exceedingly efficient attention mechanism, allowing for faster training completion and lower memory usage, which translates to cost savings. This efficiency was vital in our pursuit of a balance between performance and cost.

Google Colab played a pivotal role in our training process. By renting A100 instances at an affordable rate, we could train models up to 34 billion parameters using a combination of QLoRA and Flash Attention. We particularly recommend Colab Pro+ for its background execution capabilities and priority access to A100 instances.

The synergy of QLoRA and Flash Attention proved to be incredibly powerful. In our project, we successfully fine-tuned the Code LLaMA 34B model on 67,000 training examples over five epochs in under 22 hours. This efficiency resulted in a cost-effective training process, costing approximately $28 in Google compute credits. The outcome was a fully fine-tuned model that surpassed GPT-4 in our specific task, a testament to the effectiveness of our chosen training methods and tools.

Model Results

Accuracy

LLaMA2-SQL was evaluated against its two main competing models: GPT-3.5 and GPT-4. All three models were evaluated on a dataset consisting of a natural language instruction as input with a desired SQL statement as output. Some examples included the table schema as input, while others excluded the schema to test the model’s ability to infer schema information from text alone. Each text and SQL example was accompanied by a sample SQL table. An example data point is displayed below in Figure 4 (the contained text is fairly small, so I recommend zooming in on the PDF to see the data more clearly).

Figure 4. Example data point from the evaluation set.

To produce our accuracy numbers, we executed the ground truth SQL statements against this sample table and compared the results to the tables produced by executing the SQL output of each of the LLMs tested in our benchmark suite. The accuracy thus represents the percent of evaluation examples where the model produced executable SQL that arrived at the correct answer. By examining the output accuracy rather than simply checking if the model’s SQL statement matches the desired SQL statement, we are able to account for examples where the model uses a different approach to arrive at the correct answer. As a result, our accuracy more closely reflects the accuracy that the model would achieve in the real world.

Using this evaluation setup, we see that LLaMA2-SQL significantly outperforms both GPT-4 and GPT-3.5, beating each by about 30 and 37 percentage points on the evaluation set, respectively (see Figure 5 below). It achieves an accuracy of 68.82%, which includes examples where no schema is provided at all. When provided table schemas along with every user question, LLaMA2-SQL’s accuracy exceeds 85%. As a result, we are able to provide a better user experience for the text-to-SQL task than is offered by closed-source models, despite their increased model and training data size.

Figure 5. Model results on evaluation set.

Cost & Compute Efficiency

In addition to beating GPT-3.5 and GPT-4 handily in accuracy, LLaMA2-SQL has 1/10th as many parameters as GPT-3.5 and 1/100th as many as GPT-4. Moreover, we have optimized LLaMA2-SQL to further reduce the parameter size and enable it to run on CPUs. This translates to massive reductions in the compute needed to serve LLaMA2-SQL to customers, resulting in large cost savings. The LLaMA2-SQL model can either be served on a CPU machine, with response times of about 30-60 seconds, or on a cheap GPU machine, with response times of about 5 seconds. The deployment hardware can be determined by the desired usage – if response times do not need to be near instant to produce a strong user experience, then the cheaper CPU environment can be used. Estimated costs and theoretical maximum requests served for a single representative instance hosting the LLaMA2-SQL model are as follows:

GPU:
- AWS EC2 Instance Type: g4dn.xlarge
- Daily Cost: $12.60
- Maximum Requests per Day: About 29,000
CPU:
- AWS EC2 Instance Type: t4g.2xlarge
- Daily Cost: $6.47
- Maximum Requests per Day: About 1900

LLaMA2-SQL’s compute efficiency allows thousands of daily user requests to be served at the cost of only a few dollars per day. By comparison, 29,000 daily requests to the GPT-4 API endpoint would cost roughly $1500 per day. As a result, when deploying our model on a GPU instance, we achieve a 99% reduction in cost while also improving performance by 30 percentage points when compared to GPT-4.

Conclusion & Next Steps

This project’s key objective was to validate the potential of a custom, low-cost text-to-SQL model, with the goal of increasing Client’s market penetration by making the tool accessible to non-technical users. LLaMA2-SQL achieved this objective, setting a new state-of-the-art in text-to-SQL generation, both from a performance and cost standpoint. Future work building on LLaMA2-SQL can take a number of directions, including:

Retraining LLaMA2-SQL using newly-released open source models that are more powerful than Code LLaMA
Improving LLaMA2-SQL’s training dataset to more closely match the requests that will be seen in a production environment from Client’s users
Optimizing & automating cloud infrastructure using Terraform to support rapid deployment of LLaMA2-SQL on AWS, Azure, and GCP
Developing UI/UX for LLaMA2-SQL to support its integration into the tool

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2

Chris Hayduk — Mon, 03 Jun 2024 15:06:23 GMT

Note: This post is part of the “Understanding Protein Language Models” Series:

[This article] Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

Over the past couple of months, I’ve been on a journey to understand protein language models - what exactly are they learning and how do they work? I began that journey by trying to understand the inner working of AlphaFold2, given that it represented the first leap forward for AI in biology.

In particular, protein language models sought to replace the multiple sequence alignment (MSA) component of AlphaFold2 in order to achieve similar performance at much lower computational costs. So, in order to better understand what protein language models are doing, it is first important to understand what they replaced. To that end, in these notes I dive into the role of multiple sequence alignments in AlphaFold2 and how it drives the model’s ability to infer contacts between residues in the protein. I hope you find these notes useful!

Subscribe now

What is MSA?

Multiple sequence alignment (MSA) refers to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences is assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred, and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins.

Visual depictions of the alignment (as in the image below) illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.

Thus, by including an MSA as input, AlphaFold2 is able to infer information about the target sequence by assessing its shared evolutionary history with a number of other sequences. This is a powerful concept - it gives the model a strong “starting point” to make predictions for a new sequence.

The MSA databases used by AlphaFold2 to identify evolutionarily similar sequences to the target sequence are:

- MGnify

- UniRef90

- Uniclust30

- BFD

AlphaFold2 Architecture

Full architecture of AlphaFold2

Now that we know what MSA is and why it is used, let’s sketch out the high-level architectural details of AlphaFold2. In the above image, we can see the various inputs & components that comprise the model. AlphaFold2 takes in three key inputs:

The input sequence itself
An MSA using the input sequence as its starting point
Template structure related to the input sequence

These three inputs are then distilled into two by using the template structures and input sequence to initialize a pair representation matrix. The pair representation matrix can be thought of as scores for “similarity” or “interaction” between each pair of amino acids i and j in the input sequence.

By contrast, the MSA representation can be thought of as storing a vector representation of each amino acid for each protein in the alignment. If we imagine the matrix as a 2D grid, each row represents a protein and each column represents a position in the aligned amino acid sequence (e.g., amino acid #5 in the sequence). In each cell of this matrix, we can imagine a vector that represents the specified amino acid. In reality, this is a tensor of shape (number of sequences, number of residues, channels).

These two inputs then flow through the Evoformer block, which generates improved representations of the MSA and pair representation matrices for structure prediction. The journey for the full MSA matrix ends here, as we extract the representation for our input sequence from the first row of the MSA matrix and send it forward to the structure module.

Given that the processing for the MSA matrix takes place in the Evoformer block, we’ll dive in a bit deeper there.

Evoformer Block

Evoformer architecture in AlphaFold2

The Evoformer block begins with components for processing the MSA representation:

Row-wise gated self-attention
Column-wise gated self-attention
Transition

Following these three blocks, the MSA representation matrix is integrated into the pair representation matrix through the outer product mean block and resulting sum.

We’ll now dive deep into the three core components of the Evoformer block (alongside the outer product mean integration) to better understand how the MSA matrix is being updated.

MSA Row-wise Gated Self-Attention

Row-wise attention builds attention weights for residue pairs within the same sequence and integrates the information from the pair representation as an additional bias term. The updated MSA representation matrix thus ensures that each sequence has a contextual representation for its residues - that is, for sequence k, the embedding of the residue at index i takes into account information from the residues at indices 1, …, i-1, i+1, …, r

Architecture of row-wise gated self-attention

MSA Column-wise Gated Self-Attention

Column-wise attention lets the elements that belong to the same target residue exchange information across sequences in the MSA. The updated MSA representation matrix thus ensures that each residue has a cross-sequence representation - that is, for the embedding for residue i in sequence k also takes into account information from residue i in sequences 1, …, k-1, k+1, …, s.

Architecture of column-wise gated self-attention

MSA Transition

After row-wise and column-wise attention, the MSA stack contains a 2-layer MLP as the transition layer. The intermediate number of channels expands the original number of channels by a factor of 4.

Architecture of MSA transition

Integration of MSA with Pair Representation

Integration of MSA representation with pair representation

The “Outer product mean” block transforms the MSA representation into an update for the pair representation. Intuitively, its job is to take everything we’ve learned about residues across the aligned sequences and distill it back down into a single signal about each residue pair (i, j) in the target sequence — a signal we can then inject into the pair representation so the structure module can reason about contacts.

The mechanics work as follows. We start from the MSA representation of shape (s, r, c), where s is the number of sequences, r is the number of residues, and c is the channel dimension. After a layer norm, we apply two separate linear projections to produce a “left” tensor A and a “right” tensor B, each of shape (s, r, c’), where c’ is a smaller hidden dimension. You can think of A as the representation of residue i “as a left-hand partner” and B as the representation of residue j “as a right-hand partner” — two different views of the same residue, each tuned for its role in the pairwise interaction.

Then the core operation: For every sequence k in the MSA and every pair of residue positions (i, j), we form the outer product of A[k, i] and B[k, j]. This outer product is a c’ × c’ matrix that captures every pairwise interaction between the features of residue i and the features of residue j, as seen in sequence k. Where a simple dot product would collapse everything down to a single number, the outer product preserves the full grid of feature-by-feature interactions — much richer information about how the two residues relate in this particular sequence.

We then sum these outer products over all sequences in the MSA and divide by the number of sequences (with proper masking to ignore padded rows). This gives us, for each pair (i, j), the average feature-by-feature interaction between residue i and residue j across the entire alignment. Intuitively: if residues i and j consistently co-vary across the MSA, the hallmark of co-evolution and likely physical contact, that pattern will show up strongly and consistently in these outer products, and survive the averaging. If they vary independently, the signals across sequences will wash each other out.

The resulting tensor has shape (r, r, c’, c’). We flatten the last two dimensions to get (r, r, c’·c’), then apply a final linear projection to map down to c_z, the channel dimension of the pair representation. This produces a tensor of shape (r, r, c_z) that can be added directly into the pair representation.

The end result is a clean handoff: the MSA stack uses row- and column-wise attention to figure out which residues behave similarly across evolutionary history, and the outer product mean then translates those cross-sequence patterns into a per-pair signal. This tells the pair representation, for every (i, j), how strongly the evolutionary record suggests these two residues are coupled. The structure module downstream uses this to guess which residues are in contact, which is what ultimately drives accurate structure prediction.

Conclusion - What is the MSA Representation Doing in AlphaFold?

So, putting this all together - the MSA steps compute a representation that optimally captures similarity of residues, both:

1. Within sequences by using row-wise attention to attend across amino acids inside a given sequence

2. Across sequences by using column-wise attention to attend across sequences for a given amino acid index

This representation is then used to generate a measure of similarity between all possible residue pairs in the MSA representation. We then update the pair representation of the target sequence by adding these values. In essence, we use the MSA to "find out" which residues are similar to which other residues, and then add this information to the pair representation so that the structure module can guess at which residues are in contact with one another (based on the fact that they co-evolve and are therefore similar in the MSA representation). This allows for highly accurate structure prediction, incorporating information from the evolutionary tree to infer the optimal folded structure of a given input protein.

Reverse Goal Planning

Chris Hayduk — Tue, 11 Oct 2022 15:56:14 GMT

Photo by Glenn Carstens-Peters on Unsplash

Think of the biggest goal you have for your life. It probably feels out of reach and intimidating, something abstract and totally unattainable. When you begin to think of how to accomplish your goal, you might feel lost, like you’ve been asked to find your way to a remote location without a map. This feeling of disorientation — the lack of an idea, any idea, to get from where you are to where you want to be — can cause you to feel hopeless and paralyze you in the face of real, profound change in your life. But it doesn’t need to be this way.

The anxiety surrounding our largest goals — whether they be career-, family-, or spiritually-focused — centers around the fact that given any starting point in life, there are an infinite number of paths forward. The more difficult, concrete, and far off into the future the goal is, the fewer of these branching paths will reach the destination you desire. Identifying the correct path, or even seeing that one exists, can feel nearly impossible.

This is because most people try to plan forward to their goals, instead of backward from them.

Imagine an event in your life, something big that you have accomplished. Think back to when you were in the process of completing that goal — the uncertainty you may have felt. Now look backward from today, and feel how certain the path forward looks in hindsight. When looking backward at the path, the route seems obvious and safe. When looking forward, the path seems obscure and treacherous. The objective, then, is to imagine yourself having already completed the goal and then imagine the steps it took you to get there.

In this way, we can work iteratively backward from your goal to the present day, littering the path forward with checkpoints and benchmarks that provide clear guideposts along your path.

For example, let’s say your goal is to learn Spanish to fluency. This is an extremely large, abstract goal that involves multiple years of effort. Looking forward from today to your goal, reaching the point of fluency can feel impossible with no clear way to achieve it. Let’s take the opposite approach and imagine that you already speak Spanish. What would someone learned Spanish to fluency have already done? They’ve probably read something like Don Quijote.

We now have our first checkpoint along the path to fluency — reading Don Quijote. This checkpoint now becomes our new endpoint in the journey in this iterative process. So now we must ask ourselves, what would someone who has read Don Quijote have already accomplished? That’s a hard book, so they probably started somewhere easier. Maybe they read through the Harry Potter series to grow their vocabulary and grammar knowledge. There are eight Harry Potter books, giving us a further eight checkpoints along the journey.

Now, imagine you’re someone who has already read the first Harry Potter book in Spanish. What would you have already accomplished? You probably need to know a few thousand words and some grammar to read that book, so you would have likely memorized the 3000 most common words in Spanish and worked through a grammar book.

This becomes our new objective, and we’ve now reached one that can be acted upon today as a small, daily task. We can start by buying a grammar workbook and finding a list of vocabulary to memorize, and set a goal to work through a small portion of it every day. Once we finish the vocabulary list and grammar book, we can continue by working towards the other objectives we’ve outlined during this exercise, following the journey we laid out.

In this iterative process, working backward from our true goal, we can develop a plan of checkpoints that carve out the path forward for us. By looking backward into an imagined past, we can determine our very real future, making the vague and abstract clear and concrete.

The Role of Experimentation in Fundamental Machine Learning Research

Chris Hayduk — Tue, 11 Oct 2022 15:52:36 GMT

Photo by Julia Koblitz on Unsplash

I recently watched an excellent talk from Dr. Tom Goldstein given to the National Science Foundation in which he discussed the current limitations of machine learning (ML) research and a path forward to correct those issues. The fundamental thrust of his argument — that ML research needs to focus more on experimentation and less on theory — addresses many of the shortcomings in machine learning research and taps on several interesting ideas in theories of the mind, complex systems, and the development of true artificial intelligence (AI).

Subscribe now

Taking Lessons from Science

In the current fundamental ML research paradigm, experiments tend to be informed by theory. In particular, many researchers attempt to advance machine learning through a math-style research process, in which new theorems are deduced logically from existing theorems, lemmas, and corollaries in the machine learning corpus of knowledge. Experimental studies then attempt to validate these theories, potentially using a toy dataset to demonstrate the theory’s predictions. In this paradigm, it is unacceptable to publish experimental results that are not supported by theory. In Dr. Goldstein’s talk, he gives examples of two papers he worked on, which produced surprising and counterintuitive results but were based upon empirical experimentation rather than rigorous proof. As a result, both papers struggled to be accepted at reputable conferences. However, theoretical results that are contradicted by experimental evidence tend to still be published despite the apparent inconsistency.

By contrast, the experiment-based approach used in science inverts the hierarchy in machine learning. Theory becomes subservient to experiment — the goal of theory switches to explaining what we observe in the real world. Theory is useless if it does not align with already-existing experimental results, and previously accepted theories are tossed out if new experimental results refute them. Porting this paradigm to machine learning research would result in a landscape where most progress comes from attempting new ideas on real-world datasets. Theories would then retrospectively attempt to tie together experimental results from trying new network architectures, hyperparameters, and preprocessing techniques, developing an explanation for the results that we had already verified empirically. This would not only produce theories that are more consistent with how machine learning operates in the real world, but it would also unshackle applied machine learning progress from the constraints of existing theory. Fundamental research would now be directly oriented towards demonstrating new ideas empirically on real-world datasets, further accelerating the application of machine learning.

While very appealing, this line of thought does raise a key question: why is machine learning research better suited for experimentation-based methods rather than theory-based methods?

Complexity Theory

Before we dive into why top-down theories of machine learning are so difficult to construct through deductive logic, let’s take a brief detour into complexity theory. According to Wikipedia, complexity is defined as follows,

Complexity characterizes the behavior of a system or model whose components interact in multiple ways and follow local rules, meaning there is no reasonable higher instruction to define the various possible interactions.

Essentially, complex systems are built upon agents acting under rather simple rules. These agents are typically constrained by distance, available information, or other limiting factors. As a simplified example, think about the interactions between people in an economy. Each person’s actions are constrained by their geographic setting, their limited knowledge of the world around them, and their available resources. If we consider an economy with no availability to credit, each agent’s actions essentially consist of buying or selling goods & services within this framework of constraints. While the action space (buy & sell) is rather small and the constraints placed upon each individual agent (knowledge, geographic distance, available resources) limit the scope of their actions significantly, the interactions between the agents and their decisions produce massively complex economies. To make it more concrete, while describing the economic decisions available to a merchant or farmer in the Roman Empire might be rather straightforward, describing the economic machine of Rome, which essentially amounts to interactions between many merchants and farmers, is an extremely difficult task.

This property of complex systems, in which highly complex behaviors develop from the interactions between constrained agents operating from a small set of possible actions, is known as emergence. This behavior makes describing complex systems in terms of top-down theory extremely difficult, if not impossible. These systems must be defined in terms of the interactions between their constituent parts — only then does the global behavior become clear. This is a key component for the “intelligent” behavior of many systems we observe in nature and is essential to understanding the emergent properties of neural networks.

Biological & Artificial Neural Networks as Complex Systems

The brain, the most powerful thinking machine known, consists of about 86 billion neurons. Each individual neuron behaves rather simply — it receives input from its environment in the form of pressure, stretch, chemical transmitters, and changes of the electric potential across the cell membrane. This input then determines whether the neuron “turns on” or not. That is, the voltage of the cell membrane rapidly rises and falls, creating an electrical spike in response to the input. The key piece of the brain, and the reason why neurons are not simple voltage-spiking machines, is that each neuron is connected to thousands of other neurons through synapses. In this way, the electrical spikes that occur in one neuron propagate to thousands of others, either inhibiting or facilitating spikes in those neurons. In turn, those neurons’ signals propagate to other neurons, creating a cascade of neural activation.

These cascades of neural activation create complex behavior, such as your ability to read this article while simultaneously being conscious of yourself, your thoughts, and your emotions, despite starting from a rather simple process — that of the activation of a single neuron. This is best captured by Bassett and Gazzaniga in their 2011 paper Understanding Complexity in the Human Brain:

Perhaps most simply, emergence — of consciousness or otherwise — in the human brain can be thought of as characterizing the interaction between two broad levels: the mind and the physical brain. To visualize this dichotomy, imagine that you are walking with Leibniz through a mill. Consider that you can blow the mill up in size such that all components are magnified and you can walk among them. All that you find are mechanical components that push against each other but there is little if any trace of the function of the whole mill represented at this level. This analogy points to an important disconnect in the mind–brain interface: although the material components of the physical brain might be highly decomposable, mental properties seem to be fundamentally indivisible.

When we zoom into a single neuron, the functionality appears fairly straightforward, but the emergent properties of the mind are completely obscured. It is the interactions between massive numbers of neurons that drive the highly complex behavior exhibited by humans and other animals with large numbers of interacting neurons and neural connections. In this way, top-down theories of intelligence and brain function have been thwarted — developing a compact theory of brain function is akin to developing a compact theory describing the interactions of hundreds of millions of people in the US economy.

Similarly, artificial neural networks are composed of artificial neurons, loosely based on their biological equivalents. Connections between neurons in the network are established, similar to synapses in the human brain, allowing the artificial neurons to transmit signals to one another. Deep neural networks, the most successful example of machine learning in the field today, stack multiple layers of neurons between the input and output layers. These additional layers allow for massive numbers of neurons and connections (the language model GPT-3 has about 175 billion parameters, roughly corresponding to the number of neurons and connections available in the model). These huge networks, as in the case of biological neural networks, exhibit intelligent behavior despite relatively simple components. This is captured by Testolin, Piccolini, and Suweis in their 2018 paper Deep Learning Systems as Complex Networks,

…in deep learning even knowing perfectly how a single neuron (node) of the network works does not allow to understand how learning occurs, why these systems work so efficiently in many different tasks, and how they avoid getting trapped in configurations that deteriorate computational performance. In these models, interactions play a crucial role during the learning process, therefore a step forward toward a more comprehensive understanding of deep learning systems is their study also in terms of their emerging topological properties.

The neurons themselves are governed by quite simple laws, but the interactions between those neurons produce incredibly complex behavior, such as automatic speech-to-text programs, self-driving cars, and facial recognition software. Given what we know about complex systems and the property of emergence, it seems reasonable that deep neural networks would be difficult to describe without accounting for the interactions between the billions of neural connections that make up the network. The quest for top-down theories produced from deductive logical reasoning may be fruitless in this complexity-laden case.

Experimental Research as a Solution to Understanding Complexity

Now, returning to our original topic, we can begin to see how experimental methods address the fundamental issues with understanding complex systems. Deep neural networks and other machine learning methods based on large-scale interactions between simple components do not lend themselves well to top-down theoretical understanding. However, we can tease out the emergent behavior of these systems by using real-world datasets, choosing a particular behavior we would like to examine, and constructing experiments to understand the emergent behavior of the network in question.

For example, we may construct experiments to see how the number of iterations it takes for the weights of a convolutional neural network to converge changes as the quality of the image dataset increases or decreases. While this experimental result would not explain fundamentally why the network’s convergence rate behaves the way it does, it provides us insight into the emergent behavior of the complex system — namely, the convolutional neural network itself. With enough experimental results of this nature, we may be able to piece together insights to begin understanding the behavior of these networks and how they respond to varying stimuli. I believe this type of understanding, as put forth by Dr. Goldstein, while lacking in theoretical explanations and justifications, will facilitate the way forward for significant improvements in machine learning, both in academia and industry.