Musings by Chris Hayduk

Reading Recap: Q1 2026

Chris Hayduk — Sun, 19 Apr 2026 22:38:44 GMT

The idea for this series started with a tweet I made a few weeks ago.

Then, a couple of weeks later, posted a fantastic essay titled The Sovereign Reader , which contained this wonderful quote:

The truth is the more unique and personalized the books you read, the more original a thinker you will become. This, in short, is how you become you. I am a strong post-Hegelian believer in the personal duty of coming into our full being throughout our lives. Other than finding a fitting occupation and worthy life companions, cultivating your own mind is the prerequisite for building an existence for yourself that is truly yours.

This quote connected with the ideas I’d been toying with in my earlier tweet — many of the most interesting people in the world have likely cultivated themselves and their minds in this sense, and thus, peering into their bookshelves would help us understand the inner self that they cultivated. And I would kill to see what they read, when they read it, and what they thought of it!

So I thought, “why not be the change you want to see in the world?” and decided to share what I’m reading much more often (and hopefully this inspires some people far more interesting than me to also share what they’re reading!)

The result is this article, the first in a quarterly update series in which I list, in chronological order, all the books I’ve completed this quarter. I’ll also list some of the notable papers I read in the quarter, though this list won’t be exhaustive, since my skim-to-deep-read ratio is much higher for papers than for books! Some of these papers and books will have brief reactions and/or reviews accompanying them if they especially stood out to me.

I hope you enjoy!

Books

#1. Claude Hopkins, Scientific Advertising

A pretty useful book capturing the earliest forms of approaches now taken for granted in the advertising industry — namely, taking a scientific approach to advertising through A/B testing aggressively and tracking conversions tied to each ad type, rather than trying to “plan” ad creative without customer feedback.

Interesting to read if you are in/adjacent to the ad space. Also contains useful nuggets on how to think about advertising. The one concept that really stuck with me is that advertising is not for building brand awareness, getting your name out there, or any other ancillary goal. Instead, it is just sales at scale. As a result, when writing your ads, you should imagine that you are standing in front of one of your target customers and trying to sell him or her your product. What would you say to this one person? How long would your pitch be? What would you do to legitimize yourself, your company, and your product to this one person? Think through these questions and write your ad based on your answers.

#2. Alexandre Dumas, The Count of Monte Cristo

I don’t really know where to begin with this book because I loved it so much. Read it if you haven’t. Re-read it if you have. It hooks you from the first page and doesn’t let go for its full 1,300+ pages.

If there is one theme I can give you that stuck with me through the book, it is an interesting parallel to Crime and Punishment. The protagonists in both books, Edmond Dantès here and Raskolnikov in Crime and Punishment, have these ideas of grandeur, of being a superior man who does not need to obey the laws and customs of inferior men. But when they come into contact with human connection, those ideas fall apart. Dantès wavers in his plans to kill Albert de Morcerf after talking to Mercedes. He also resolves to save Valentine de Villefort after Maximillien Morrel reveals that he is in love with her. These contacts with love and social connection cause him to change his divine mission of Providence and remove some of the suffering he had planned for his enemies. Similarly, when Raskolnikov is alone, he frequently thinks of his superiority to others and hates those around him. But when he comes into contact with others, especially those who inspire tenderness in him, like Sonia Marmeladov or her father when he was dying, he behaves extremely altruistically and without any consideration for his ideals that he is above the common man, and acts in any way that he wants as long as it suits his aims.

To connect both to Nietzsche (who was explicitly influenced by Crime and Punishment and likely influenced by The Count of Monte Cristo) and his conception of the Übermensch, both authors seem to assert that removing oneself from the rest of humanity is necessary to become the Übermensch.

For Dostoevsky, this task is impossible and results in the destruction of the individual who attempts it. We find salvation only by connecting with humanity, not by setting ourselves apart. By contrast, for Dumas, this represents a genuine tension — Dantès is able to set himself apart and does act as a superior man, or Übermensch. And this does bring him fulfillment in some ways (namely, avenging his wrongful imprisonment and his father’s death). However, this choice also precludes his happiness in other ways, as while acting as the Count, he must make choices that hurt those closest to him.

#3. Dan Wang, Breakneck: China’s Quest to Engineer the Future

I found Wang’s now-famous framing of the US as “the lawyerly society” and China as “the engineering state” very illuminating — it is a neat little framework that helps you make sense of the many strengths and weaknesses of each country. However, I thought the most useful part of the book was not its thesis or the framework it introduced, but rather the stories Wang tells throughout. He lived in China from 2017 to 2023, experiencing the country’s rapid economic growth and the way this continually reshaped society firsthand. But he also experienced the early authoritarian turns instituted by Xi Jinping, as well as the COVID-19 pandemic and dramatic lockdown of Shanghai in 2022. Getting an insider’s view of Chinese society in this period was clarifying in a way that the many economic and historical analyses of China are not.

And, of course, Dan Wang’s annual letters are self-recommending for more of this type of writing.

#4. Eric Berger, Liftoff: Elon Musk and the Desperate Early Days That Launched SpaceX

At this point, I’ve read a few books on Elon and listened to many podcasts about him or including him, so there wasn’t a lot in the way of surprises in Liftoff. What this book does well, however, is to impart the feeling of the life-or-death struggle that followed Elon around in those first few years at SpaceX. The company existed on the razor’s edge for essentially all of its early life, and we get to see the intensity with which Elon drove his employees (and himself) to ensure the company survived.

#5. Ali Aminian, Generative AI System Design Interview

Fairly useful for interview prep, pairs well with the other books in the series (Machine Learning System Design Interview and System Design Interview Volumes I & II).

#6. Jacques Hadamard, The Mathematician’s Mind: The Psychology of Invention in the Mathematical Field

In this book, Jacques Hadamard asks the question, “How do the world’s greatest mathematicians and physicists actually make their field-defining discoveries? What is going on inside their heads when they solve these supremely difficult problems?”

Being an elite mathematician himself, Hadamard was able to interview some of the greatest minds of his era to learn the answer to this question. The book compiles a list of vignettes from Poincaré to Einstein, probing the question from different angles, and essentially lands on the following answer — the rational, conscious mind is useful for getting us started on a problem and for verifying an answer once we have it, but the actual work of producing a novel insight is done by the unconscious mind. The discovery process he deduced from his observations, and that he asserts all great mathematical discoveries abide by, is as follows:

Preparation (primarily conscious) — the conscious mind focuses on a problem for an extended period of time, collecting relevant information and trying out several avenues for solution
Incubation (primarily unconscious) — the unconscious mind, directed in its goals by the focus of the conscious mind in the Preparation stage, sets to work searching for high-level solutions. This is where the bulk of problem-solving and discovery is actually done. The unconscious mind is better at viewing the problem as a “whole” and at uncovering unexpected insights and connections than the conscious mind. The unconscious mind evaluates proposed solutions based on aesthetic criteria.
Illumination (primarily unconscious) — an idea generated by the unconscious mind that satisfies the unconscious criteria springs forth into the conscious mind
Verification (primarily conscious) — the conscious mind sets to work translating the unconscious’s idea into formal mathematical language and verifies that it is logically correct.

This book inspired my recent essay, The Unreasonable Effectiveness of LLMs in Mathematics.

#7. Stendhal, The Red and The Black

After reading Crime and Punishment last year, I’ve become slightly obsessed with the crisis in literature and philosophy that Napoleon presented after his meteoric rise and fall in the late 18th and early 19th centuries. I’ll start by giving a brief sketch of how I see the situation and the intellectual crisis that followed.

For the 300 or so years prior to Napoleon, philosophy had been building up to a notion of equality among people. The Enlightenment supercharged this, popularizing ideas about egalitarianism, self-determination, and natural rights. The American Revolution then put these abstract ideas into practice, and was quickly followed by the French Revolution, which sought to put an even more radical version of Enlightenment ideals into practice.

The Enlightenment’s ideas on political philosophy and these two revolutions were both rebelling against the prevailing traditional doctrines of the time. Power and the right to rule were seen as hereditary — God ordained a specific line to rule, and this right was passed down from father to eldest son in an unbroken chain. Moreover, the structure of society itself was seen as a crucial part of any nation. The peasants were meant to be peasants, the rulers were meant to rule. To go against this was to go against the natural order ordained by God.

These two groups were locked in an existential struggle in the late 1700s when Napoleon erupted onto the scene, forever changing the trajectory of the world. Through unparalleled competence, energy, and charisma, he became emperor of France, and then nearly the emperor of Europe. He reshaped society at the high-level scale of large historic battles, as well as at the granular scale of laws, regulations, and standards. By the time of his final defeat, there was not one piece of European life that did not have Napoleon’s fingerprints on it.

His almost-mythic life refuted both groups’ ideals.

The traditionalists asserted that humans can influence history, but that this is only the providence of hereditary kings (i.e., greatness comes from family lineage). The new ideas of the French Revolution hold that humans cannot influence history and that all humans are equal in this impotence — no one is inherently greater or more capable than anyone else. Napoleon refutes both — he is history, and his will controls the direction of events. But he is not a hereditary monarch; he is an upstart rising on his own merits. At once, he says to the monarchs, “Some men are destined to rule, but you are not those men. You don’t deserve your lot,” and says to the egalitarians, “You are not equal to me. None could accomplish what I have done”.

However, he also synthesized both positions. The egalitarians are right that monarchs are not predestined to rule — Napoleon shows that there are others with greater merit than the hereditary kings of Europe. He also validates the traditionalists that people are not equal — he is clearly more fit to rule than those who preceded him in the French Revolution.

This is what caused the crisis in literature and philosophy of the 19th century — whether you’re an egalitarian or a traditionalist, how do you explain the problem of Napoleon? And what does the fact that something like him occurred imply about the world?

Now, with all that background context out of the way, Stendhal tackles the problem of Napoleon from a sociological angle. He observes that Napoleon arose in a time of profound change, with France submerged in chaos due to the Revolution, and asks, “What happens when a Napoleonic figure arises in a society that is totally resistant to change?”

His answer is the character Julien Sorel, an extremely intelligent, Napoleon-obsessed youth from the lower class of the fictional town of Verrières, set during the Bourbon Restoration. Like his idol, Sorel possesses a formidable mind and intense ambition to rise from his lowly starting place in society. But unlike his idol, Sorel’s France does not reward competence and ambition. Instead, it places a premium on playing social roles and playing them well. As a result, Sorel’s ambition, which in an earlier age would have been channeled into great acts of heroism on the battlefield (or, in a future age, into founding a great company), was instead channeled into playing the roles that society valued in order to advance.

Thus, we see Sorel become a priest despite not believing in God, start affairs with various upper-class women without loving them, and tutor Latin despite not enjoying teaching Latin. He does these things because they are the legible paths by which a member of the lower class can aspire to move towards the upper class (but never fully join it).

Sorel is acutely aware of the incongruity between Napoleon’s life and his own — he frequently ruminates that his own quest for power, which involves lying and assuming social roles, is pathetic compared to Napoleon’s world-changing deeds. This incongruity forms the heart of the book and demonstrates the necessity of favorable cultural and societal circumstances for a Napoleonic figure to be fully actualized.

#8. Fyodor Dostoevsky, Notes from Underground

Dostoevsky remains a writer whose prose is fairly unenjoyable for me to read, but whose themes and ideas sit with me for long after I have completed the book. Notes from Underground will likely require several re-reads for me to fully absorb its points, but on my first pass, I came away with the following major takeaway: Dostoevsky strongly opposes the systematization of humanity.

In the first part of the book, the Underground Man spends much of his time asserting that the utilitarian ethic, namely, converting the good to a function that must be maximized, does not lead to fulfillment. And, even more strikingly, he says that elevating this conception of life to the highest good would necessitate the loss of free will. If we are optimizing a value function, there is a set of actions that maximize it. If we could wave a magic wand and have this set of actions revealed to us, then it would be congruous with our stated highest principle to follow these actions to the letter and incongruous to assert any form of free will. Hence, the highest form of “good” in the utilitarian perspective necessitates a loss of free will, a reduction of humans to automatons. The Underground Man states this and then also observes that, if you were to reveal this value function to people, they would act against it just to spite it and assert their free will. And if you forced humans to live this way and renounce their free will, it would be spiritual torture. Hence, the rational, utility-maximization approach is anti-human.

In the second part of the book, the Underground Man seems to espouse a philosophy that, on its surface, is diametrically opposed to utility maximization. Instead, he seems to be living his life through the most exaggerated form of performative Romanticism. He constantly makes irrational decisions based on fleeting passions that overtake him. However, he is very self-conscious about these actions — he does not act immediately upon feeling the emotion. Instead, he thinks that he should take action x in response to feeling y and ruminates on the fact that he didn’t take action x in the moment (sometimes for months at a time). It is only after berating himself that he then takes his “passionate” action in response to his emotions. In this way, the Underground Man is forcing himself to exist within a system that constrains his behavior — it is different and less legible than the utilitarian perspective he railed against in the first part of the book, but it nonetheless serves to constrain his freedom of will substantially, and leads to a similar spiritual sickness.

In both cases — through the Underground Man’s words in part 1 and through his actions in part 2 — we see the failures of forcing humanity into a behavioral box. As far as I can tell, Dostoevsky aimed to show that by constraining human behavior, we harm ourselves and become something like the Underground Man.

#9. Sally Smith Hughes, Genentech: The Beginnings of Biotech

A really fun history of the first major biotech company and the recombinant DNA technology that enabled it. It’s interesting to see the obstacles that Herb Boyer and Bob Swanson needed to overcome in order to make Genentech a reality — from raising capital from highly skeptical investors to convincing scientists that joining a company could be both scientifically meaningful and monetarily rewarding, the pair had to navigate so many issues that we now take for granted in startups. The contrast is stark with today’s AI boom, where investors will hand out billions to newly founded AI labs without the faintest hint of a business plan or even a differentiating idea.

Papers

For the papers I don't have much to say about beyond the contribution itself, I'll just give the challenge and the solution. That is, I’ll state the main challenge the paper addresses, followed by a brief description of the solution it proposes.

#1. Chai et al., MultiPath: Multiple Probabilistic Anchor Trajectory Hypotheses for Behavior Prediction (2019)

Challenge: Predicting future motion for agents in a scene is challenging because the set of potential outcomes is highly uncertain and multi-modal. Any approach that predicts only one future trajectory presents inherent problems. For example, how many times do we need to sample from this model to be confident that a pedestrian won’t jaywalk?

Solution: The paper introduces MultiPath, an approach that uses a fixed set of anchor trajectories (namely, the modes extracted from the dataset). The model then learns to output a probability distribution over these anchors, as well as an offset Gaussian that measures the deviation from the anchor (captured by the mean) and the position’s uncertainty (captured by the covariance). This setup allows us to model two kinds of uncertainty:

Intent uncertainty, captured by the distribution over anchor trajectories
Control uncertainty, captured by the probabilistic distribution over the agent’s state at each time step, conditioned on selecting a particular anchor.

#2. Nayakanti et al., Wayformer: Motion Forecasting via Simple & Efficient Attention Networks (2022)

Challenge: Self-driving cars receive inputs of many different modalities (velocities, positions, images, LiDAR, road graphs, traffic light states, etc.) and types (static vs. dynamic). Many solutions rely on combining results from modality-specific models, leading to highly complex and brittle systems.

Solution: Introduces Wayformer, a family of attention-based architectures for motion forecasting that are simple and homogeneous. Wayformer offers a compact model architecture comprising an attention-based scene encoder and a decoder. Ablations showed that early fusion of modalities in this architecture performed best. Achieved SOTA on Waymo Open Dataset.

#3. Gulino et al., Waymax: An Accelerated, Data-Driven Simulator for Large-Scale Autonomous Driving Research (2023)

Challenge: Simulation is crucial for generating larger training datasets and evaluating autonomous vehicle planning software. However, existing simulations are typically very expensive to run, may not reliably match real-world data when they generate scenes, and may not exhibit realistic agent behavior.

Solution: Waymax solves these problems with simulation by:

Improving speed through writing the simulator in JAX. This allows Waymax to be run on hardware accelerators (GPUs/TPUs)
Initializing simulations from real-world data in the Waymo Open Dataset. This grounds simulations in real trajectories.
Using agent models to control the dynamic objects in the scene.

#4. Jumper et al., Highly accurate protein structure prediction with AlphaFold (2021)

This was not my first time through this seminal paper (or the second or the third…), but it was the first time I went through it with the depth required to reimplement AlphaFold2 from scratch (note: still very much a work in progress!).

The AlphaFold2 paper itself is a relatively easy read, providing a brief overview of how the model works, how it fits within the frameworks of previous approaches, and the unprecedented performance it achieves. But it does leave a sizeable gap between a high-level understanding of what the model is doing and a true understanding of its inner workings.

For that, we need to dive into the 62-page supplementary paper (a full 5 times longer than the main paper!).

I think this piece is where the ingenuity of the AlphaFold team, as well as their clarity of insight, really shines. Each component of the model, labeled as an algorithm in the paper, is explained using a combination of text, diagrams, and complete pseudocode. In addition, the paper clearly outlines how each algorithm is stitched together in higher-level functions, such as the AlphaFold2 training loop or inference pass. And if that’s not enough, it even goes into great detail on the training setup & parameters, ablation studies to determine what parts of the network are truly necessary, and visualizations of the attention matrices to see what the model has truly “learned”.

To me, the clarity of the supplementary paper felt as if it were screaming out to me that it needed a code implementation that was just as clear. One that was not intended for production use or included dozens of Dockerfiles, deployment scripts, and ancillary supporting files, but instead emphasized clarity, pedagogy, and alignment with the paper's algorithm breakdown.

This, along with Andrej Karpathy’s minGPT, was the inspiration for my minAlphaFold2 project. The goal of this project is to align the model exactly with the supplementary paper’s structure and pseudocode, using only PyTorch primitives, with an emphasis on simplicity, clarity, and explanatory comments. Once the code portion of the project is complete, stay on the lookout for an “Annotated AlphaFold2”, in the same vein as Harvard’s Annotated Transformer project.

AlphaFold2 (along with GPT-3) represents one of the two largest breakthroughs in the history of artificial intelligence. And yet the pedagogical material available for AF2 is essentially non-existent when compared to the pedagogical material available for GPT-3-style models. I hope to do a small part in changing this.

#5. Zhu et al., Scaling Latent Reasoning via Looped Language Models (2025)

So my brief speculation tied to this paper caused quite a stir on ML Twitter:

This was not meant to be taken as fact (or to blow up the way it did)! I was mostly just speculating based on the patterns I saw in the benchmark scores for Claude Mythos and some inferences on Anthropic’s compute budget. But here I’ll give a brief explanation of why it seems plausible to me that this architectural choice has made its way into the frontier models.

First, I’d like to point out (and thank you to Kalomaze on Twitter for setting this straight), that I don’t think the frontier models are using the main contribution from this paper (namely, the ability for the model to choose how many times it loops based on a prediction head that predicts the probability it has already arrived at the correct token). My claim is now a bit weaker — that the models may be trained using weight tying. That is, the model loops a deterministic number of times, feeding the last layer’s output embedding back into the input layer. This essentially increases the model's depth without increasing its total parameter count. As such, I would expect the training process to look more like recycling in AlphaFold2 (see supplementary paper linked above) rather than looping in the ByteDance paper.

Anyway, with that context out of the way, I’ll give a brief overview of why I think looping seems plausible for frontier models. First, we’ll start with the benchmark scores.

Since looping increases model depth without increasing parameter count, what you end up with is a decoupling of the model’s knowledge (i.e., what it has memorized) and the model’s reasoning (i.e., its ability to manipulate that knowledge).

To explain this phenomenon, I’ll include a screenshot from the ByteDance paper along with a brief explanation of this image from my piece The Unreasonable Effectiveness of LLMs in Mathematics:

On the left-hand side of the image [above], the authors compared the bits of knowledge memorized by different train models (y-axis) to the number of parameters of those models (x-axis). For each model size, they trained a looped and a non-looped model variant, represented by large and small circles, respectively. You may not have noticed that there are two circle sizes on the graph because they basically overlap exactly for all model sizes. Thus, adding looping does not allow the model to learn more information.
Now, if we turn our attention to the right-hand side of the image [above], we can see the performance of a baseline transformer and a looped variant at different sizes on a reasoning-heavy benchmark. The numbers jump off the page — at each fixed model size, the looped model vastly outperforms the non-looped model. More astoundingly, a two-layer looped model substantially outperforms the largest non-looped model (the 12-layer variant in the first row).

I think we see this dynamic playing out in the benchmark results for Claude Mythos, captured in the Twitter screenshot I shared above. For benchmarks that rely more heavily on stored internal knowledge (e.g., GPQA and MMLU, which evaluate models on multiple-choice questions spanning a wide range of subjects), the improvement over Claude Opus 4.6 is fairly modest. We do see a larger improvement in Humanity’s Last Exam without tools, though these questions are designed to be more difficult than GPQA and MMLU, and as such, there is some level of gains that can be attributed to improved reasoning. (I also want to state here that, of course, Mythos is likely a larger model than Opus 4.6, so it will also have more knowledge memorized — the debate is on how much compute/capacity was devoted to reasoning vs. improved knowledge).

The benchmark that really jumped off the screen for me was GraphWalks BFS. This benchmark uses randomly generated sets of nodes and edges as input to the LLM. It then asks the LLM to perform various searches on this graph (e.g., find node X’s parent, start a breadth-first search from node Y and return a list of nodes you reach after 2 iterations, etc). These tasks, along with the fact that the nodes & edges don’t represent anything semantically meaningful, mean that this benchmark skews the most towards reasoning and away from memorized world knowledge out of any benchmark listed for Mythos. And, as we can see in the image, it is by far the biggest jump from Opus 4.6 to Mythos.

To me, this suggests that Mythos represents a relatively modest parameter scale-up and a much more aggressive depth scale-up, which would align with Mythos being a looped transformer.

Now for my second reason — compute constraints.

At this point, it’s no secret that the major labs are majorly compute-constrained. TSMC and NVIDIA have their capacities booked out years in advance, and the explosion in demand coming from coding models has only made this problem worse in recent months. As such, any innovation that can reduce the number of GPUs required to serve each batch of requests is of great importance to the labs — if a batch can be squeezed into fewer GPUs at equivalent levels of performance, the lab can serve more simultaneous batches and thus drive more revenue at a fixed level of compute.

Since a looped transformer results in more reasoning depth at a fixed parameter count, what you get is a smarter model that is able to fit in a smaller amount of GPU RAM. Exactly aligning with this critical constraint for the model providers.

Again, this is all speculation! I may be entirely off base! But these are the two major reasons that led me to strongly suspect that some form of looping is being used in Claude Mythos.

If you’ve made it this far, thank you for reading! This piece ended up being much longer than I had initially anticipated. As Joan Didion famously said, “I don't know what I think until I write it down”, and writing this article definitely revealed things I didn’t realize I had thought about these books and papers.

If there’s one thing I can leave you with, let it be this — please share what you’re reading!

The Unreasonable Effectiveness of LLMs in Mathematics

Chris Hayduk — Mon, 13 Apr 2026 12:02:39 GMT

Image from @alexwei_’s announcement post on X

In July 2024, DeepMind unveiled AlphaProof — an AlphaZero-inspired agent that constructs mathematical arguments in Lean, a programming language for proofs. It broke new ground in mathematical performance, achieving a silver medal in the 2024 International Math Olympiad.

One year later, in July 2025, OpenAI announced that they had achieved a gold medal in the 2025 International Math Olympiad using a raw LLM — no reinforcement learning in Lean space, no translation between natural language and formal proof languages. In the span of a few weeks, this same model would go on to add a gold medal at the International Olympiad in Informatics and a 2nd place finish at the AtCoder World Tour Finals to its achievements.

How is it possible that a general LLM, one that operates in natural language and would be just as comfortable answering questions about lasagna recipes in ChatGPT as it is scoring a gold medal in the IMO, could defeat a model that was custom-made to solve math problems by thinking directly in proof space? And is the current LLM architectural paradigm enough for us to solve mathematics?

Subscribe now

A Brief Explanation of AlphaProof

The core reasoning components of AlphaProof. Figure 1 in the AlphaProof paper

AlphaProof is an LLM- and reinforcement-learning-based approach to mathematical proof generation published by DeepMind in November 2025 (with its initial announcement in July 2024).1 The model was inspired by AlphaZero, the successor model to AlphaGo & AlphaGo Zero, which taught itself to play chess, shogi, and Go purely through reinforcement learning (RL) from self-play

The AlphaProof research team began by translating mathematical problems from natural language into Lean, a formal proof language that allows users to build mathematical arguments through explicit axioms, theorems, and deductive steps. Proofs in Lean are built up one step at a time by applying actions (called tactics) to change the current proof state. Lean guarantees that each step must be logically rigorous and consistent — if not, the proof won’t compile.

To generate a sufficiently large dataset to learn from, a Gemini-based LLM (known as the formalizer) was trained to translate natural-language mathematical statements into Lean (see the figure below for an example).

The formalization system translates natural language math statements to valid Lean code. Extended Data Fig. 2 in the AlphaProof paper

With this setup in place, constructing math becomes a game-like RL environment, where the state is the current proof status, the set of actions is the set of possible Lean tactics, and the reward for each action is -1 (encouraging shorter proofs, since we aim to maximize the cumulative reward).

AlphaProof gets its tactics from the combination of a prover agent and a search algorithm inspired by AlphaZero. The prover agent is a 3-billion-parameter encoder-decoder transformer model that suggests tactics to apply next (given the current prompt state) and estimates their expected cumulative return (that is, what I would expect my total reward to be starting from the proof state that this action will take me to and continuing until I complete the problem). The tree search algorithm explores sequences of actions suggested by the prover agent and evaluates their results.

The prover agent learns by training on the Lean problems generated by the formalizer. It, along with the tree search algorithm, generates attempted proofs and receives a learning signal depending on whether a valid proof is found or the agent times out during the search.

The prover agent + tree search algorithm approach allows us to scale along two axes: training time for the prover agent and test-time computation for the tree search algorithm. This dual-scaling enables strong performance on held-out IMO problems, as shown in the table below. When we increase tree search time from 2 TPU minutes per problem to 12 TPU hours per problem, we see validation accuracy jump from 33.2% to 43%.7%

AlphaProof performance on the held-out IMO validation set. Extracted from Table 1 in the AlphaProof paper

However, from the above, we can see that the paper uses another scaling axis — namely, test-time RL (TTRL).

The way this works is that, for challenging problems, a variant generator can create hundreds of thousands of distinct yet similar formal problems for the proof network to continue training on. The prover will then learn from these similar examples, updating its weights as it gets rewards for completing proofs. After TTRL has been executed, the prover agent now “knows” substantially more about the problem area adjacent to the problem we actually care about, thereby improving its accuracy.

Thus, to achieve the highest levels of performance, the AlphaProof system needed to overfit on new data that was extremely similar to the questions asked. It did not succeed out of the box at the IMO, even after training on ~80 million formal problems. In particular, on the IMO holdout set (as seen in the table above), AlphaProof required a 4-order-of-magnitude increase in compute budget to go from a 33.2% to a 58.3% success rate. The final 4.4 percentage points of performance (from 53.9% to 58.3%) required an additional order of magnitude in compute (from 3,000 TPU minutes per problem to 30,000 TPU minutes per problem). The paper explicitly states that “each of these solutions required 2–3 days of (test-time RL) TTRL, demonstrating substantial problem-specific adaptation at inference.”

The main takeaway of all this is the following: AlphaProof used a highly math-specific RL environment, along with supporting models (e.g., a formalization system and a variant generator), and a formal proof language to achieve its groundbreaking theorem-proving results. It did indeed set a state-of-the-art in mathematical benchmarks such as the IMO; however, despite the scaffold already being highly specific to the problem, it was still insufficient, with the model still requiring multiple TPU days of test-time RL and hundreds of TPU days of tree-search time per problem to achieve optimal performance. Thus, although highly performant, AlphaProof was not a general-purpose theorem-proving system — it still required substantial overfitting and custom software to be useful.

GPT-5.x and The Mathematician’s Mind

While AlphaProof performed well with its highly-custom scaffold and approach to overfitting problems, OpenAI’s IMO Gold model blew it out of the water without all of these math-specific bells and whistles.

How is this possible? How can a general-purpose system perform so much better than an application-specific system? And why have all the large labs moved away from these application-specific systems in the intervening ~2 years since AlphaProof’s release?

The answer lies in the subconscious mind.

Jacques Hadamard, one of the great mathematicians of the 20th century, published The Mathematician’s Mind: The Psychology of Invention in the Mathematical Field in 1945. Building on Henri Poincaré’s 1908 lecture titled L’invention Mathématique (Mathematical Invention), he interviewed several of the greatest living mathematicians & physicists at the time (including George Polya, Claude Lévi-Strauss, and Albert Einstein) and assessed how their process of mathematical discovery felt from a phenomenological perspective.2 He wanted to answer the following question: What is occurring in the minds of our greatest mathematicians when they make new discoveries?

The answer Hadamard found, synthesizing results from these interviews, was that mathematical discoveries arise from the interplay between conscious and unconscious processes. And, in particular, the idea for the proof tends to start in the subconscious.

…let us remember that every mental work and especially the work of discovery implies the cooperation of the unconscious, be it the superficial or (fairly often) the more or less remote one; that, inside of that unconscious (resulting from a preliminary conscious work), there is that starting of ideas which Poincaré has compared to a projection of atoms and which can be more or less scattered ; that concrete representations are generally used by the mind for the maintenance and synthesis of combinations. This carries, in the first place, the consequence that, strictly speaking, there is hardly any completely logical discovery. Some intervention of intuition issuing from the unconscious is necessary at least to initiate the logical work. [emphasis mine]

Hadamard asserts that no mathematical discovery is purely logical. The unconscious mind, in all cases that he examined, played a crucial role in the development of rigorous mathematical arguments. This role, and the handoffs between the subconscious and conscious minds, were distilled by Hadamard into the following framework for mathematical discovery:

Preparation (primarily conscious) — the conscious mind focuses on a problem for an extended period of time, collecting relevant information and trying out several avenues for solution
Incubation (primarily unconscious) — the unconscious mind, directed in its goals by the focus of the conscious mind in the Preparation stage, sets to work searching for high-level solutions. This is where the bulk of problem-solving and discovery is actually done. The unconscious mind is better at viewing the problem as a “whole” and at uncovering unexpected insights and connections than the conscious mind. The unconscious mind evaluates proposed solutions based on aesthetic criteria.
Illumination (primarily unconscious) — an idea generated by the unconscious mind that satisfies the unconscious criteria springs forth into the conscious mind
Verification (primarily conscious) — the conscious mind sets to work translating the unconscious’s idea into formal mathematical language and verifies that it is logically correct.

Thus, we can see that the unconscious mind is actually responsible for generating the proof structure. The conscious mind just sets the scene during the Preparation stage and verifies that suggested proof structure during the Verification stage. But it does not produce the critical insights regarding the proof’s structure that actually lead to the discovery itself. As a result, we can say that the discovery process is decidedly not rigorous.

AlphaProof’s downfall was that it was designed to act like the rigorous conscious mind — every step must be constructed rigorously and logically consistent with previous steps, and the proof is built up step by step. In this manner, AlphaProof acts at a local level with the prover trying to find the best next incremental step in a proof, whereas the mathematician’s unconscious mind works at a global level by identifying a full proof sketch all at once. Only then does the conscious mind fill in the rigorous details.

Based on this insight, we see that an AlphaProof-like system is most closely aligned with the Preparation and Verification stages in Hadamard’s framework. It completely omits the Incubation and Illumination phases, where the actual work of discovery occurs.

OpenAI’s gold-medal-winning model, as a standard LLM deployed for mathematical use cases, represents a new paradigm in mathematical reasoning that relaxes AlphaProof’s constraints on only following rigorous thought. It enables messier, higher-level reasoning in the language space rather than in the Lean-verified proof space. The system can think through and mentally pressure-test high-level approaches & proof sketches, rather than being forced to focus exclusively on granular proof steps, as in AlphaProof.

To demonstrate this, I provided GLM 5.1 with question C4 (a medium difficulty combinatorics problem) from the 2024 International Math Olympiad. By using an open-source model, we can see the full reasoning trace and how its thought process leads to its final answer.

In the excerpt provided below of the model’s reasoning trace, you can see how messy and non-rigorous the thinking truly is. The model is jumping around in conceptual space, backtracking frequently, and probing different high-level directions.

Reasoning excerpt from GLM 5.1’s attempt to solve question C4 from IMO 2024.

We can see from the above example that this brings us closer in line with the Incubation and Illumination phases of Hadamard’s process — the LLM is able to search for high-level solutions and make jumps that are not logically rigorous and are not incremental (in the sense that they do not need to be at the scale of a single tactic in Lean). The model is free to jump around in natural language between several, potentially unrelated concepts. It is only forced to converge on a rigorous, verified solution at the end of its answer once one of the high-level solutions satisfies its aesthetic sense.

This relaxation of constraints on the thinking process also explains how the system can generalize across domains — the model does not need all the math-specific scaffolding and so can learn a more general process of discovery. This process, as outlined by Hadamard, is not limited to math (as evidenced by Einstein's inclusion in the group). In fact, it seems to me that discovery in any domain that can culminate in a Verification step will follow this overarching process. And we do see this in the success that OpenAI’s model enjoyed at the IOI and AtCoder competitions shortly following its gold medal at the IMO.

So, is math (and all other easily verified domains) solved with this paradigm?

Not quite. While OpenAI’s gold medal LLM represents a meaningful improvement over AlphaProof in bringing AI proof systems into alignment with how mathematicians actually think, there is still a glaring gap: the model is required to reason in language space and produce one token at a time. Although this allows for broader, less rigorous thought than AlphaProof, it is still far more constrained than Hadamard’s description of the unconscious. Specifically, Hadamard asserts that, in addition to being non-rigorous, unconscious thought is often not even interpretable. He says that at this stage, all mathematicians think without language or precise symbols, and many do not even use clear images. The ideas themselves are vague, amorphous, and global. Hadamard says:

Practically all of [the mathematicians interviewed]… avoid not only the use of mental words but also, just as I do, the mental use of algebraic or any other precise signs; also as in my case, they use vague images.

This differs substantially from LLMs, in which each thought has a fixed amount of computation applied to it, after which that thought must then be made concrete in the form of a token. As a result, current systems are limited in how closely they adhere to Hadamard’s discovery framework. This fundamentally limits how creative they can be — the more we force thoughts to be interpretable and legible, the less unexpected and amorphous they will be, and thus the less they will approximate the unconscious mind.

The Reasoning Paradigm of the Future

As Hadamard astutely observed, true discovery occurs at the subconscious level rather than through rational, conscious processes. When moving from AlphaProof to OpenAI’s IMO Gold model, we took one step towards the unconscious — from the uber-rational realm of Lean and step-by-step proof construction into the messy, unrigorous world of natural-language thought.

Although this thought is messy, it is still articulable and thus operates at a level above what we call the subconscious. Reasoning in natural language allows for more flexibility than reasoning in Lean, but we know that mathematicians don’t reason in language at all during the Incubation and Illumination steps.

If the bottleneck is the token itself — the forced collapse of each thought into a discrete, legible symbol — then the natural question is whether we can let models reason before that collapse happens. What would it look like for an LLM to think in a medium richer than language?

For LLMs, the level below token-based thought is embedding-based thought. A model's actual computation lives in embedding space before it's projected down into the vocabulary — just as the mathematician's thoughts live in the unconscious during Incubation before being projected into the conscious mind during Illumination. Each token is a lossy compression of a much higher-dimensional internal state.

Recall Hadamard's observation that mathematicians reason in vague images rather than precise signs. An embedding is, in a loose sense, exactly that: a vague, high-dimensional representation that has not yet been forced into a precise sign.

Thus, rather than forcing the model to output its reasoning in token space, we can allow it to reason in this much higher-dimensional embedding space found within the model's internals. This allows for much more abstract and “unconscious” thought than standard token-based reasoning does, aligning more closely with Hadamard’s concept of unconscious thought.

ByteDance implemented this idea in their November 2025 paper titled “Scaling Latent Reasoning via Looped Language Models,” which developed a model family called Ouro.3 This model family is what is known as a “looped language model” — that is, rather than outputting a token once the language model reaches the last layer in its computation graph, it loops back on itself by ingesting the embedding from layer N as new input for layer 1 (see leftmost diagram in the image below).

Performance is less “spiky” for the looped language model. Figure 1 in the Looped Language Models paper

As a result, the model can learn to reason deeply about tokens without first converting them to language space. Moreover, the ByteDance team integrated this reasoning process directly into pretraining and implemented a mechanism for the model to determine when to stop looping during inference based on a predicted exit probability. This allows the model to more closely integrate its reasoning capabilities with its world knowledge (since it is trained during the pretraining stage), as well as to learn to modulate its reasoning based on the difficulty of the next token.

And, most importantly, this brings the model’s reasoning process much closer to Hadamard’s core Incubation and Illumination steps.

With this setup and its proximity to Hadamard’s Discovery process, a very general-purpose approach to reasoning in verifiable domains, we would expect the model to perform better across a number of benchmarks. And that is precisely what we see. Not only does the model outperform much larger models, but its performance is also less “spiky” — that is, it performs consistently well across many benchmarks rather than experiencing isolated “spikes” in performance on certain benchmarks (see the central and rightmost figures in the image above).

We can see further evidence that this approach improves reasoning in the image below. On the left-hand side of the image below, the authors compared the bits of knowledge memorized by different train models (y-axis) to the number of parameters of those models (x-axis). For each model size, they trained a looped and a non-looped model variant, represented by large and small circles, respectively. You may not have noticed that there are two circle sizes on the graph because they basically overlap exactly for all model sizes. Thus, adding looping does not allow the model to learn more information.

Now, if we turn our attention to the right-hand side of the image below, we can see the performance of a baseline transformer and a looped variant at different sizes on a reasoning-heavy benchmark. The numbers jump off the page — at each fixed model size, the looped model vastly outperforms the non-looped model. More astoundingly, a two-layer looped model substantially outperforms the largest non-looped model (the 12-layer variant in the first row).

Looping the language models doesn’t increase memorization capacity, but it does improve the ability to manipulate the memorized knowledge. Figure 6 in the Looped Language Models paper.

Hence, from the above, we can see that applying an embedding-space loop to language models, thus bringing them closer to Hadamard’s Discovery process, substantially improves reasoning ability. And this reasoning ability seems to generalize better across benchmarks than standard RL on top of language models. As a result, I expect all frontier LLMs to adopt this paradigm if they haven’t already.

Stepping back a bit, there is something quietly counterintuitive about the trajectory of the last two years. AlphaProof tried to make machines do math by forcing them to be maximally rigorous, with every proof built up from one logically rigorous step to the next. It was, in many ways, the purest expression of what we used to think mathematical reasoning was. And yet it hit a wall.

The lesson of OpenAI’s IMO Gold model, and now of looped language models, is that progress has come not from tightening the constraints on machine reasoning but from loosening them. First, we let models think messily in natural language. Now we are letting them think in a medium that isn’t even language. Each step has moved further from the clean, legible, step-by-step ideal. And each step has improved the models.

But loosening the constraints on Incubation only gets us half of Hadamard’s cycle. I’ve spent most of this piece arguing that LLMs are getting better at Incubation and Illumination, and that looped models push them further in that direction. But this leaves out the fact that current LLMs are weak at Verification. They hallucinate proofs, confidently assert false lemmas, and have no reliable internal signal for when their reasoning has gone off the rails. AlphaProof’s step-by-step rigor was, for all its limitations, a genuine Verification engine — every tactic was checked by Lean before the proof could proceed. Standard LLMs lack such a mechanism and are thus often prone to hallucinating results. The IMO Gold result is remarkable precisely because the model managed to produce verifiable outputs despite lacking a built-in verifier.

Thus, the real frontier is synthesis. AlphaProof gave us a formalizer and a verifier but no unconscious. Looped language models are giving us an unconscious but no verifier. Hadamard’s framework is designed to be a loop — when Verification fails, the mathematician returns to Preparation with new information about why it failed, which reshapes the next round of Incubation. The system that finally closes this loop will be the first one to actually instantiate Hadamard’s full cycle rather than just its generative half.

This, I think, is the actual research agenda for the next few years, and it has two halves that have to advance together. On one side, we need to keep deepening the unconscious by pushing latent-space reasoning further, scaling up looped architectures, and figuring out how to train them with the kinds of reinforcement signals that sharpen aesthetic judgment. On the other, we need to build formalizers and verifiers for a broader range of fields, so that the unconscious can receive concrete feedback from a Verification step. Neither half is sufficient on its own, as the unconscious without a verifier can drift into hallucinations, while the verifier without an unconscious is too rigid to make true intellectual leaps. A synthesis of both is required.

What this looks like concretely depends on the field. For mathematics, both halves are nearly in place. The unconscious is arriving via looped language models, and the verifier already exists in the form of Lean. All we need to do is use AlphaProof’s Formalizer to convert the Looped Language Model output to Lean, and we’re good to go.

The more difficult question is what we do in fields that don’t already have a Lean. The unconscious half seems to generalize quite well, judging by the less spiky benchmark results from the Looped Transformer paper. But the verifier half does not generalize. Instead, it has to be built field by field. And what counts as a verifier in one domain looks nothing like what counts as a verifier in another.

In physics, building the verifier means investing in experiment automation. That is, robotic labs that can run, measure, and report on experiments proposed by a latent-reasoning model, closing the loop between hypothesis and data without a human in the middle. In biology, the same logic points to high-throughput assay platforms that can test a model’s proposed interventions at scale, with the same feedback loop into the unconscious. In economics and the social sciences, the verifier half has to be weaker, such as large-scale simulation environments, prediction markets, or structured forecasting tournaments that can at least provide a noisy verification signal where a clean one is impossible. But even a noisy verifier may be enough to close the loop, so long as the unconscious half is there to propose hypotheses worth testing in the first place.

Hadamard told us in 1945 that the real work of discovery happens in a place that is neither rigorous nor articulable, and that the conscious mind’s job is to set the stage and then check the work afterward. Looped language models are finally giving us the unconscious that Hadamard envisioned. The next paradigm wires the verifier into the output of the looped language models, closing the loop on Hadamard’s Discovery process. The fields where we can build both halves are the fields where Hadamard’s full cycle will finally run on a machine and where we should expect AI progress to accelerate most dramatically in the coming years.

Olympiad-level formal mathematical reasoning with reinforcement learning

Hadamard, J. (2020). The Mathematician’s Mind: The Psychology of Invention in the Mathematical Field. Princeton University Press.

Scaling Latent Reasoning via Looped Language Models

2025 Year in Books

Chris Hayduk — Fri, 16 Jan 2026 21:35:58 GMT

I really enjoyed the format of ’s book review lists, so I’m going to make it a yearly tradition to summarize my reading for the year in a similar manner. I’ll list the completed books in chronological order below. For books that were notable (whether notably bad, notably good, or containing some notable idea), I’ll include a short blurb with my thoughts and my major takeaways.

#1. Sean McMeekin, Stalin’s War: A New History of World War II

A history of World War II that starts by asking the following questions:

Which leaders (if any) were in power pre-, during, and post-war in order to really influence the course of events (the build-up, the actual fighting, and the treaty negotiations afterward)?
Which leaders (if any) benefited the most from the war?
Which leaders (if any) had a vested interest in the European powers going to war?
Which leaders (if any) successfully bent the leaders of the other powers to their will?

McMeekin asserts that, when we stop to consider these questions, there is only one leader who satisfies all four: Joseph Stalin.

Starting from this point, the book outlines a contrarian history of the war, centering on Stalin's role in instigating the war between Germany and the Western Powers, attempting to contain Germany through resource constraints, and, finally, manipulating the Western Powers to achieve his war aims as cheaply as possible.

I’m not enough of an expert on WWII history to refute or support McMeekin’s claims, but it certainly made for one of the most interesting history reads I’ve ever had.

#2. Nicolas Cole, The Art and Business of Online Writing: How to Beat the Game of Capturing and Keeping Attention

#3. Amaury De Riencourt, The Coming Caesars

Published in 1957, this book outlines the progression of American government from its original, firmly aristocratic character (we didn’t even vote for president in the beginning of the Republic!) to its then-new mass democratic flavor. De Riencourt draws parallels between this progression and the progression of the Greco-Roman world leading up to the end of the Roman Republic and the beginning of the Roman Empire. He uses this parallel to analyze the trajectory of the US and predict the arrival of an eventual American Caesar.

Unfortunately, this book has become extremely relevant in recent years.

#4. Niall Ferguson, The Pity of War: Explaining World War I

Another contrarian historical perspective, this time blaming the United Kingdom for the outbreak of World War I. It argues that had the UK decisively demonstrated its willingness to enter the war before its outbreak, or had instead committed to staying out of it, the total devastation for everyone would have been significantly reduced.

Given the death and destruction wrought by World War I, as well as the knock-on consequences of the war (the Spanish Flu pandemic, the Great Depression, & finally World War II), it’s hard to argue that the UK’s decision to support France was optimal in the long run. All of the European powers would likely have been better off if the UK had left France to fend for itself and it had lost to Germany more swiftly. It hardly seems like the counterfactual could be any worse than what Europe endured from 1914 to 1945. However, I’m not sure that I’m totally convinced that the war was England’s fault in the sense that Ferguson asserts. As with McMeekin’s book, I’ll need to do more reading on the WWI period to really evaluate these claims.

#5. Samo Burja, Great Founder Theory

Some books aim to inform, diving into great detail on a specific concept or event, analyzing all of its minutiae from every angle. Other books aim to enlighten, expounding a big idea that attempts to fundamentally shift your worldview.

This book is one that aims to enlighten.

Burja’s big idea is a sort of re-skinning of the Great Man of History ideal — except, this time, it incorporates institutional design as its centerpiece.

His theory basically goes as follows: we can measure an individual's “greatness” by the effect they have on shaping the world. Any shaping that is limited to the individual’s lifetime is inherently highly constrained, and thus, they cannot be considered a Great Founder. (Note that, in Burja’s view, this means that Napoleon is not a Great Founder for his military campaigns, but he is a Great Founder for things like standardizing units of measurement, creating conditions that led to growing nationalism throughout Europe, and reorganizing the French military structure to support the first foray into total war). In order to ensure that the individual continues shaping the world long after their death, they must create some mechanism by which to transfer their worldview down through the generations. These are our institutions, and their founders are those who have made the world we live in today.

The remainder of the book examines the implications of this theory from various angles.

#6. Amaury De Riencourt, The American Empire

#7. Peter Attia, Outlive: The Science and Art of Longevity

#8. Alonso Cueto, La Hora Azul

#9. Chris Miller, Chip War: The Fight for the World’s Most Critical Technology

I was late in reading this one, but it remains just as relevant (if not more so) today.

I wish I had read it earlier because it quite clearly lays out the case for investing in Intel as the fab of the future in the US.

#10. William L. Hamilton, Graph Representation Learning

#11. Patrick McGee, Apple in China: The Capture of the World’s Greatest Company

One of the best business books I’ve ever read. It treats the history of Apple as a history of its manufacturing division, centering on its ability to churn out massive volumes of some of the most advanced devices in the world. By centering the manufacturing story, China is inevitably centered as well — at first as an apprentice, later as a partner, and finally as an uneasy frenemy of sorts.

This book will teach you so much about Apple, about manufacturing, and about China. I cannot recommend it enough.

#12. Jeff Pepper, Mulan, Woman Warrior: An Easy-to-Read Story in Simplified Chinese and Pinyin, 240 Word Vocabulary

I started learning a bit of Mandarin in 2025, and so far it’s been great fun! I typically like to learn languages through reading (I’ve used this technique before in Spanish and, to a lesser extent, in French). So, to get started on my journey, I looked for some simple yet enjoyable books I could work my way through.

This led me to discover Jeff Pepper, who has simplified many culturally relevant Chinese texts to make them accessible to beginner and intermediate audiences. These include The Journey to the West, various Chinese folktales, a biography of Confucius, and more. Mulan, Woman Warrior is one of the easier of his published books (and tells the same story as my favorite childhood movie!), and so I found it extremely enjoyable to read. I’m very much looking forward to working through The Journey to the West this year.

#13. Dan Brown, The Secret of Secrets

Gemini defines “narrative thrust” as follows:

Narrative thrust is the compelling force that drives a story forward, creating momentum that pulls the reader from one scene to the next, keeping them engaged and eager to discover what happens next

Narrative thrust covers a multitude of sins. Dan Brown, for all his faults, knows how to tell a story with thrust. The Secret of Secrets is no different.

#14. Fyodor Dostoevsky (translated by Michael R. Katz), Crime and Punishment

This is a book that I very much did not enjoy reading, but cannot stop thinking about now that I have read it.

For those who have read Dostoevsky, I’m sure the reasons why I didn’t enjoy the experience of reading the book are obvious — the stilted prose, the confined narrative arc, and the oppressive, suffocating atmosphere that permeates every scene. But for those same readers, the reasons I can’t stop thinking about it must be equally as obvious — the psychological drama, the probing questions into the nature of morality, and the battle between logical rationality and subconscious thought.

Two themes really struck me throughout this book.

The first is Raskolnikov’s ability to reason himself into positions that his subconscious refuses to comply with. To him and other characters in the book, his reasoning is flawless when he expounds on his life philosophy (which, to summarize, is that he is a Napoleonic figure, an example of a higher man, and, as such, is unbound by the constraints of standard morality). In particular, when he is alone and able to ruminate, he frequently finds himself reinforcing this idea in his head. However, whenever confronted with real human connection throughout the story, his Ubermensch idealization breaks down and gives way to a more Christian morality. In addition, there are moments, even when Raskolnikov is alone and avoiding human connection, where thoughts bubble up from his subconscious, questioning his crime. This dichotomy between what he wants to be (the Ubermensch) and what he subconsciously knows he is (the Christian) tears him apart internally throughout the novel.

The second theme concerns the book's ending. After admitting to his crimes, Raskolnikov is imprisoned in Siberia. At the beginning of his imprisonment, he continues his oscillation between feeling himself an Ubermensch and hating himself for it. Only after a year of isolation, during which Sonya Marmeladova, the daughter of a man Raskolnikov helped earlier in the novel, fell ill and could no longer visit him, did Raskolnikov repent and release himself from his delusions. In a sense, the imprisonment in Siberia acted as a death & resurrection for him, after which he had shed the sins of his past life and resurrected as a moral, human agent rather than an aspiring Ubermensch.

Both themes capture the centrality of human connection & emotion in our cognition — Raskolnikov can’t maintain the behavior implied by his reasoning when he comes into contact with others, and he only fully gives up his delusions when he experiences the pain of briefly losing his last human connection, Sonya, during her illness. This tension between self-overcoming and human connection is worth keeping in mind, particularly for those of us who do aspire to more and, in doing so, shed a bit of our natural humanity.

There is another theme I’d like to explore, but I’ll refrain from going on too much of a tangent here. I’ll just say that there are also really interesting parallels between the second theme I mentioned, the life of Edmond Dantes from The Count of Monte Cristo, and the life of Napoleon. I hope to explore these parallels in-depth in a future article.

#15. Tae Kim, The Nvidia Way: Jensen Huang and the Making of a Tech Giant

A really fun history of today’s most important company. Lots of great insights into how Jensen works.

#16. Harry M. Schey, Div, Grad, Curl, and All That: An Informal Text on Vector Calculus

This book builds up the machinery of multivariable calculus through the lens of electromagnetism and physics applications. I’ve been trying to go back and review some math (e.g., calculus, linear algebra, etc.) from a more physics- and geometry-forward perspective so that I can improve my intuitions for the machinery of those topics. Div, Grad, Curl, and All That was great for that purpose.

#17. Michio Kaku, Quantum Supremacy

Yes, I’m aware that many of Kaku’s books devolve into quackery. Yes, I’m aware that Scott Aaronson called this “the worst book about quantum computing… that I’ve ever encountered”. And yes, it does live up to that illustrious billing.

And yet, I still find myself feeling invigorated to learn more and explore new areas every time I read one of Kaku’s books. I’m not sure what this says about me, but here we are.

#18. Richard P. Feynman, Surely You’re Joking, Mr. Feynman! Adventures of a Curious Character

Surprisingly, Feynman’s autobiography focuses quite little on his actual work in physics. Instead, we get to peer into the creative and whimsical mind at play of one of the titans of 20th-century physics. Expect lots of stories of mischief and side quests, like lockpicking at Los Alamos or learning to play the drums to join a band during Carnival in Brazil.

#19. Dwarkesh Patel, The Scaling Era: An Oral History of AI, 2019–2025

This book cuts up many of Dwarkesh’s interviews and groups them together in chapters organized by theme (e.g., “Chapter 1: Scaling”, “Chapter 2: Evals”, etc).

Even if you’ve listened to all of these podcasts previously (as I have), the book still provides a lot of value through its theme-based organization and the extensive footnotes/marginalia that Dwarkesh has included to clarify topics and provide additional detail.

Overall, a nice book to quickly assess where we are and where we’re going in AI.

#20. Tyler Cowen and Daniel Gross, Talent: How to Identify Energizers, Creatives, and Winners Around the World

Finding above-average (or even great) employees is not too difficult — we broadly know how to conduct interviews to find people who are competent in their role and conscientious in their execution.

But how do you find people who are truly exceptional?

In Talent, Cowen and Gross aim to answer this question by covering interesting, non-standard interview techniques to help quickly assess whether someone is an outlier among outliers. Very useful if you are a venture capitalist, startup founder, or anyone else aiming to find people who are truly at the tails of the distribution.

Overall, this was a bit of a lighter year of reading than I usually have. In 2026, I’m aiming to increase both the volume and the complexity of the reading I’m doing — hopefully, there will be many more math, physics, and deep learning books in next year’s list than were in this year’s.

In addition, I really like learning history (both political and business) through biographies, and I got away from that a bit in 2025. I’m going to work on incorporating many more of those in 2026.

Lastly, my reading over the last few years has slanted heavily towards non-fiction, but, like Raskolnikov, I’ve been missing the human connection and insights that great fiction can bring. My goal for 2026 is to explore many more of the great books, particularly the seminal novels of the 19th century.

A Tale of Two Futures

Chris Hayduk — Sun, 11 Jan 2026 23:07:46 GMT

The US and China are engaged in a two-country race towards an AI-powered future. Both countries have directed funding, policy initiatives, and talent towards the AI sector at levels matched only by the internet buildout of the 1990s or railroad construction of the 1880s. But both of these countries are building towards diametrically opposed futures.

One country has taken the view that it is on the path to artificial superintelligence (ASI) — the point at which AI will be more effective than the most capable humans at every task. The view of their prominent AI labs is that once ASI is achieved, there will be a runaway intelligence explosion, with the AI rapidly improving itself and reaching unthinkable levels of intelligence. This genius AI will then be able to solve our most pressing problems in mathematics, physics, philosophy, and more with minimal difficulty.

The other country has taken the view that it is not peak intelligence that matters, but rather the distribution of intelligence. It aims to develop intelligent AI models (though not superintelligent) that are fast and cheap enough to be embedded in machines across the economy. It wants household robots in every home, talking cars, and refrigerators that can do the grocery shopping.

One future is top-down, the other is bottom-up. One is centralized, the other is decentralized. One results in gains accruing to the select few corporations, the other results in gains throughout the entire economy. One is Skynet, the other is The Jetsons.

The great irony of the current AI race situation is that the centralized SkyNet future belongs to the democratic United States, and the decentralized Jetsons future belongs to authoritarian China.

The Skynet Future

Skynet is the fictional AI developed by Cyberdyne Systems in the original Terminator movie. It’s a large, powerful AI system that, once deployed, rapidly increases its intelligence to the point of becoming self-aware. When it becomes self-aware, it becomes self-interested and decides that the best way to preserve its existence is to eliminate all humans. Skynet then appropriates the nuclear codes of the United States and launches these nuclear weapons in an attempt to eliminate the human race from the face of the Earth. Small groups of humans survive the nuclear fallout, and they live in a post-apocalyptic world fighting robots directed by Skynet that are attempting to exterminate humanity once and for all.

This vision is very apocalyptic, but I think it captures the sentiment of the US AI scene better than any other popular depiction of AI. From the initial ambitions of Skynet running the United States to its intelligence explosion to its final destruction of humanity, these views are not outlandish in Silicon Valley. In fact, they may be the norm among AI researchers and AI lab CEOs. And these views have substantial implications for how AI research is playing out in the United States.

The CEOs of AI research labs are explicitly building towards this form of superintelligence. Crucially, these labs view pushing the intelligence frontier of these models as the core goal of their research. They explicitly look to benchmarks that measure intelligence on extremely difficult tasks (such as FrontierMath, GDPVal, and ARC-AGI-2) as the core metrics that they’re optimizing against. Their goal is to produce a “country of geniuses in a datacenter”, as Dario Amodei put it in his article “Machines of Loving Grace”. Amodei believes that, once achieved, artificial superintelligence could compress 50-100 years of biological research into 5-10 years.

Moreover, many of the prominent figures in AI view the path to superintelligence as a race. They believe that as we move closer to superintelligence, we will be able to achieve an automated AI researcher that can analyze its own codebase and improve it rapidly. These algorithmic gains from AI producing its own code improvements will result in an intelligence explosion, such that the first lab to produce an automated AI researcher will immediately gain an insurmountable lead in intelligence over the other labs. As such, not only do the AI lab leaders believe this Skynet scenario, but they also view it as a race that must be won at all costs. The very existence of their companies in their minds depends upon reaching the superintelligent AI first. To understand the pace of AI investment and the amount of that investment that becomes allocated to training ever-larger and more capable models (rather than more cost-efficient or broadly distributed models), you need to internalize that the prominent figures in AI strongly believe this to be the true state of the world.

To race towards superintelligence, massive increases in the two core inputs to AI training are needed: data and compute. As a result, the AI labs have invested hundreds of billions of dollars into massive compute scale-outs, data acquisitions, and RL environment development. New data centers are coming online that consume gigawatts of electricity to train larger models with higher parameter counts. In addition, these new models are fed data and trained in RL environments produced by hired PhDs and leading experts across math, computer science, finance, and more.

To afford the energy, compute, and data to produce these models that push benchmark metrics on FrontierMath and GDPVal, more and more centralization is encouraged in AI research. As shown in the chart above, the cost of building a frontier data center has been increasing exponentially, from around $7 billion in 2022 (when ChatGPT was first released) to a projected $106 billion in 2027. Hence, if there is a fixed amount of private & public funding available to AI companies, it is in investors’ interest to allocate that funding to a small number of companies so that they can afford the requisite frontier data centers to train these models. With funding too widely distributed, no single player would be able to produce a model that outperformed the state-of-the-art on these frontier benchmarks.

We can observe this trend more clearly in the training compute cost of frontier models from 2012 through 2025. We can see that model training costs rapidly increased from roughly $3 million in 2022 to over $300 million in 2025. In addition, the trend line shows this cost increasing at a rate of 0.5 orders of magnitude per year, indicating that the cost of a single training run next year (2027) will increase to about $3 billion. Projected forward to the end of the decade (2030), a training run would cost roughly $100 billion (with the data centers powering such a run likely costing in excess of $1 trillion).

Given all of this, we can think of the American AI ecosystem as a bet on increasing centralization and increasing scale. It is a bet on a benevolent Skynet future — leveraging unprecedented resources to build a single, massive AI model capable of solving the most difficult problems in science, technology, politics, and philosophy.

The Jetsons Future

The Jetsons is an animated sitcom from the 1960s that depicts a future defined not by a single technological breakthrough but by the accumulation of countless small conveniences — flying cars, household robots, and apartments that predict each family’s needs. The Jetsons’ future is not one of transcendence but of leisure — technology has not produced a godlike intelligence but has instead seeped into every object, automating away the drudgery of daily life. The show’s vision is one of abundance through distribution: no single machine is particularly impressive, but the sheer proliferation of helpful machines has transformed the nature of work and home life entirely.

The Jetsons aired the same year as the Cuban Missile Crisis. Sixty years later, it is the CCP & China, not the United States, that is building toward its vision.

Chinese AI labs burst onto the scene in early 2025 with the release of DeepSeek-R1. Unlike its American counterparts, the notable aspect of DeepSeek’s model was not its raw performance — it was strong, but it lagged behind the frontier. The truly impressive aspect of DeepSeek-R1 was that it performed similarly to frontier models at a fraction of the cost, both in terms of serving cost and training cost. For example, as I detailed in another article (Open Source LLMs Are Eating the World), DeepSeek-R1 was able to perform nearly as well as OpenAI o1 (the frontier model at the time) on the MMLU Pro benchmark, while only costing $6.75 to run the full benchmark suite compared to o1’s $75. This represented an 11x drop in the cost to serve the model at roughly equivalent performance levels.

The success of DeepSeek-R1 has sparked a wave of innovation in open-source Chinese AI. Various companies have entered the fray, including Alibaba with its Qwen series, Z.ai with its GLM series, and Moonshot AI with its Kimi series. Each of these three core competitors, along with DeepSeek, has steadily pushed the cost of economically useful intelligence towards zero.

Produced by GPT 5.2 Pro using MMLU Pro benchmark results

In addition, the speed of innovation has compressed the time delta in model performance between closed-source AI labs and their Chinese open-source equivalents. The chart above shows the catch-up time required for open-source AI models to match the performance of closed-source models at different performance thresholds on the MMLU-Pro benchmark. We can see that earlier performance levels, such as the 60% threshold, required roughly two-thirds of a year before open-source AI could match closed-source AI. However, recent performance thresholds have been achieved in far less time, between a quarter to a third of a year. Chinese AI labs have ramped up their investments in data centers and energy. They now have access to purchase NVIDIA H200 chips, and the Chinese chip ecosystem is maturing more quickly than expected. As a result, we should likely expect this time gap to continue to compress rather than to expand.

The implications here are genuinely massive. This chart shows that open-source AI models from China can match the performance of our leading closed-source models in less than 6 months, often at an order of magnitude lower cost.

In parallel with these advancements in open-source AI, China has been experiencing two massive buildouts over the last five years. The first has been a huge increase in energy generation, specifically in solar and nuclear energy. These rapid increases in energy generation, which frequently exceed the rest of the world combined, for example, in installed solar capacity, are enabling manufacturing at scales never before seen. In particular, the improvements to solar and batteries that are occurring and will continue to occur due to technological and cost improvements driven by this increased manufacturing will allow energy not just to be more plentiful overall, but also to be more local to specific needs. What this means practically is that energy will become local and mobile, allowing for various devices to become much more energy-intensive than they have been previously through the use of improved batteries and local solar panels. This will allow devices to include the substantial onboard compute required to run the top open-source AI models.

The second key build-out has been in advanced manufacturing. China was already the world's manufacturing center before this push into advanced manufacturing. But now it has moved up the value chain and very quickly has gone from a laggard in key industries to dominating them. For instance, five years ago, China was largely irrelevant in the electric car market. Now its electric car companies are leading the world and outselling Tesla. Several leading companies are now making strong pushes into humanoid robots and are setting the pace in that category.

This confluence of advanced manufacturing in robotics & battery-powered vehicles, increased energy generation (specifically solar), and open-source AI that is energy- and compute-efficient will allow China to develop a truly intelligence-powered economy. The major factors will be in place to have human-level AI embedded into a large share of both consumer and industrial products.

With this approach, China aims to enable a new level of general abundance with household robots, self-driving electric cars, self-directed delivery drones, and household appliances that can make decisions for themselves, such as a refrigerator that can detect when you’re running low on specific supplies and order them agentically.

And crucially, this future does not depend on reaching superintelligence. As I’ve detailed in my other article, Open Source LLMs Are Eating the World, many economically relevant tasks operate in a task-saturation regime. That is, once the models exceed some threshold level of performance, future increases in model scale and training compute do not make meaningful differences in task-level performance. Moreover, models today are already capable of performing many economically viable tasks, such as coding complex apps, serving as customer support agents, and more. Hence, this broad deployment of cheap AI in physical goods will deliver returns quite quickly.

Making this intelligence cheap and abundant through energy- and compute-efficient open source AI will unlock massive economic value across the spectrum. There isn’t much doubt about this. The Jetsons future is clearly within reach. However, there is a question mark whether we will reach the benevolent Skynet future.

Consequences of the Divide

US labs are betting on transcendent intelligence. Chinese labs are betting on abundant intelligence. Who wins depends on which future actually arrives.

Four scenarios are possible. Only one favors the American approach.

The Scenario Matrix

From the above matrix, we see that the American bet requires threading a needle: superintelligence must be achievable, it must trigger a runaway intelligence explosion, and that explosion must translate into massive economic returns unconstrained by physical bottlenecks.

Remove any link in that chain, and the calculus shifts.

If superintelligence arrives but can’t escape physical constraints: The leading lab pulls ahead on benchmarks, but drug discovery still bottlenecks at FDA trials. Robotics still bottlenecks at manufacturing. Most economically useful tasks don’t require superintelligence anyway. Open-source competitors deliver similar real-world value at a fraction of the cost.

If superintelligence arrives but no takeoff occurs: Energy infrastructure takes years to build. Training runs take months. Even with a superintelligent AI optimizing your codebase, the results manifest slowly enough for competitors to close the gap. No insurmountable lead materializes.

If superintelligence never arrives: Intelligence gains follow a sigmoid curve—rapid improvement, then diminishing returns. At that plateau, the race shifts from “who’s smartest” to “who’s cheapest and most distributed.” China wins that race.

The Asymmetric Bet

The Jetsons future requires no miracles. Cheap, capable AI embedded in robots, vehicles, and appliances delivers value whether or not superintelligence is possible. China’s bet pays off in three of four scenarios.

The Skynet future requires everything to go right. Superintelligence must be reachable, takeoff must occur, and physical constraints must not bind. America’s bet pays off in one scenario.

The Implication

We see from the scenarios enumerated above that the US AI Lab approach is decisively dominant in only one of them. In all other scenarios, Chinese open-source AI is able to keep pace with the closed-source frontier, and, in doing so, it is guaranteed to bring about the Jetsons future that China is building towards. The American benevolent Skynet future is far from guaranteed.

In sum, to ensure that the United States broadly benefits from the AI revolution it has itself started, we need to take a page from the Chinese AI playbook. We must ensure that, even if superintelligence is out of reach, we will have cheap, abundant intelligence suffusing the economy. We must ensure that our portable energy infrastructure (i.e., solar panels and batteries), our robotics manufacturing capabilities, and our open source AI efforts are sufficient to power a truly intelligent economy. The failure to do so may cede technological leadership in the 21st century to the CCP & China.

Open Source LLMs Are Eating the World

Chris Hayduk — Fri, 09 Jan 2026 20:15:04 GMT

The default way we evaluate large language models is fundamentally misaligned with how they create economic value. We track frontier capabilities across broad benchmarks (such as ARC-AGI-2, FrontierMath, and SWE-Bench Verified) and implicitly assume that whoever leads on these metrics captures the most value.

However, this view assumes that what matters is the model’s maximum intelligence across a broad range of tasks — the “PhD intelligence for all” mantra repeated by the large labs.

For practical company building, this framing is wrong, and understanding why reveals a structural advantage for open source models that holds regardless of when (or whether) we achieve AGI.

I. Introduction: The Benchmarking Problem

The standard narrative goes something like this: general model value scales with general performance. A model that scores higher on a diverse battery of benchmarks is more valuable than one that scores lower, and the companies training the most capable models will capture the lion’s share of economic returns.

But this misses how value actually gets created in practice. Companies don’t build products that require uniformly excellent performance across all possible tasks. They build for specific use cases: contract analysis, customer support, code generation, and medical documentation. Revenue comes from solving customer problems, and customer problems are specific. The “average” benchmark performance that frontier labs optimize for doesn’t map to any real product; it’s just an abstraction that obscures the actual economics.

For any given application, what matters is whether your model is good enough at this particular thing, not whether it can solve PhD-level mathematics problems or write publishable research. A legal tech startup needs strong performance on contract reasoning and citation accuracy. A customer support platform needs reliability in intent classification and tone. Neither benefits from improvements to the model’s ability to prove novel theorems.

Moreover, the relationship between capability and value is S-curved. Early capability improvements unlock entirely new use cases: a model that goes from 40% to 70% accuracy on a task might cross the threshold from “useless” to “useful with human oversight.” But a model that goes from 92% to 96% often delivers no additional value, because the human workflow was already designed around spot-checking outputs, and the bottleneck has shifted elsewhere to latency, cost, integration complexity, or user experience.

This is the crux of the argument: once a model clears the capability threshold for a given task, further intelligence improvements face rapidly diminishing returns. The contract analysis tool that’s “good enough” for lawyers to trust with first-pass review doesn’t become twice as valuable when the underlying model gets twice as capable. It just becomes overprovisioned.

II. The Task Saturation Phenomenon

For any specific task, the marginal value of model capability saturates at some threshold. Beyond a certain point, users cannot meaningfully distinguish between a model of size X and a model of size n×X for any n ≥ 1.

Consider what’s happened to standard benchmarks over the past few years. The chart below tracks top scores on benchmarks like ARC, MMLU, Winograd, HellaSwag, GSM8K, and TruthfulQA against their human baselines:

The pattern is consistent: rapid improvement followed by convergence toward (and often slightly beyond) human-level performance. Once a benchmark is effectively “solved,” additional capability improvements deliver zero marginal value for tasks that the benchmark measures. A model scoring 95% on MMLU isn’t twice as useful for MMLU-adjacent tasks as one scoring 90%. For most practical purposes, they’re equivalent.

III. Reframing the Analysis: Cost at Fixed Performance

If capability saturates for specific tasks, then the relevant question isn’t “which model is most capable?” but rather “which model solves my task at the lowest cost?”

Once we fix a performance threshold (the point at which a task is effectively solved) we can track how the cost to achieve that threshold evolves over time. The a16z team did exactly this analysis for MMLU scores:

The trend line shows roughly a 10× cost reduction every year for a fixed capability level. But the more important pattern is which models sit on that cost frontier over time. Early in a capability tier’s lifecycle, closed-source models from frontier labs define the frontier. But within months, open-source alternatives emerge at dramatically lower price points.

Look at the progression for MMLU > 83: GPT-4 at $45 per million tokens, then GPT-4o at ~$10, then Claude 3.5 Sonnet at ~$10, and finally Llama 3.1 70B pushing costs down toward $0.50. The same pattern plays out for every capability threshold: closed-source models solve the task first, and then open-source models quickly make it cheaper.

Thus, if we imagine a fixed benchmark score as a proxy for the threshold at which a task is “solved”, we see that closed source models have historically had a payoff horizon of roughly one year before open source models made

IV. Case Study: MMLU Pro Replication Speed

MMLU Pro extends the original MMLU benchmark by increasing the number of multiple-choice options from 4 to 10, introducing misleading distractors, and emphasizing reasoning-heavy questions. It’s a harder benchmark, which allows us to separate out the performance levels of recently released models.

Benchmark results available here

Consider the 83% performance threshold. That is, models that answered at least 83% of questions correctly:

OpenAI o1 was the first model to reach this level and did so upon its release on December 5, 2024. Its API pricing was $15 per million input tokens and $60 per million output tokens. The total cost to run the benchmark was $75.
DeepSeek R1 was the first open source model to reach this level when it launched on January 20, 2025, priced at roughly $1.485 per million input tokens and $5.94 per million output tokens. The total cost to run the benchmark was $6.75

That’s an order of magnitude cost reduction in one month for equivalent task performance. If we want to be generous and use the release date of o1-preview, this still results in a time horizon of only 4 months before DeepSeek matched its performance with an open source model costing an order of magnitude less.

To drive the point home further still, DeepSeek V3.2 came out on December 1, 2025, and again achieved the 83% performance threshold, but this time at a two orders of magnitude reduction in cost when compared with OpenAI o1. Specifically, the total cost to run the benchmark was only $2.24.

Thus, for a fixed level of performance, we see the price drop from $75 to $6.75 to $2.24 over the course of a single year. As a result, I argue that any task solved by a closed-source model will see enterprise buyers transition to cheaper open-source models within 6 months to one year.

And there’s reason to expect this pace to accelerate. As Huawei and SMIC close the gap with NVIDIA and TSMC, and now that NVIDIA potentially regains the ability to sell H200 chips in China, the Chinese open-source labs will have access to better hardware while maintaining their cost structure advantages. We may be looking at only a couple of months between closed-source frontier releases and open-source replication with substantial cost reduction.

V. The AGI-Agnostic Conclusion

What I think makes this view most compelling is that it doesn’t depend on AGI being decades away.

The conventional case for open source often rests on an assumption that we’re approaching a capability plateau. That is, that base model improvements will slow down, shifting competition to fine-tuning, cost, and vertical specialization. This assumes that the vision of the future espoused by the US AI labs, predicated on artificial superintelligence (ASI) and runaway intelligence explosions, is wrong, while China’s view of commoditized intelligence is correct. That may well be true, but it’s a bet on a particular trajectory of AI progress.

The task saturation argument is stronger because it’s agnostic to the AGI timeline. When you’re building a company, you’re typically building for a specific use case. That means you’re operating in the saturation regime, not the model scale-up regime. Even if frontier models continue improving rapidly and the AI-2027 timeline plays out, the task your company is built around has a capability threshold beyond which additional model intelligence doesn’t matter.

And once you’re in the saturation regime, the only dimension of competition that matters is cost. Open source wins on cost, systematically and structurally, because open-source economics allow for lower margins and broader distribution.

The practical takeaway for company builders is this: bias toward open source, and do so for cost reasons rather than capability bets.

If you’re building an AI-native product, ask yourself: what capability threshold does my use case actually require? Chances are, that threshold is either already achieved by current open-source models or will be within 6-12 months of a closed-source model first reaching it. Build your infrastructure and workflows around the assumption that you’ll be running on open-source models, even if you start with closed-source APIs for speed to market.

The benchmark that matters isn’t “which model is smartest.” It’s “which model solves my task cheaply enough.” And open source is destined to systematically win that competition through relentless cost deflation.

Knowledge Is Power Law Distributed

Chris Hayduk — Tue, 06 Jan 2026 02:40:50 GMT

Pictured above: British polymath Thomas Young. “The man who knew everything”

There is a common meme called “the last person who knew all of X”. Some variants include:

“The last person who knew everything” — often referring to Thomas Young (1773-1829)
“The last person who knew all of math” — often referring to David Hilbert (1862-1943)
“The last person to know all of physics” — often referring to Enrico Fermi (1901-1954)

The implication of these statements is that it is now impossible (or at least unheard of) to know everything in these domains. And the dates above show that this phenomenon of the generalist ended somewhere between 100 and 200 years ago, depending on how broadly we want to define the field of study. Thus, we have had around a century of specialization, with the idea that “becoming a true generalist is impossible” largely accepted as fact, whether explicitly or implicitly, by academics everywhere.

The march from high school diploma to undergraduate degree to PhD results in ever-greater degrees of specialization and an ever-narrowing aperture through which to view knowledge. The benefit commonly attributed to the narrowing of our education is that it helps us to reach the frontier of knowledge and thus make novel contributions more quickly. Its proponents say that no one can know everything anymore, and thus it is a waste of time to branch out to fields other than your chosen field of study.

However, my view is that we are artificially constraining our best and brightest by continuing to espouse this narrative that it’s impossible to be an effective generalist. Useful knowledge is not broadly distributed — it is concentrated in a handful of topics contained in only a handful of subjects. You can sense this intuitively if you walk through the aisles of your nearby Barnes & Noble: there are tens of thousands of books spread across the shelves, and the books on the shelves change every few weeks or months. But how many of those titles will be remembered 1 year from now? 10 years from now? What about 100 years from now? It would be a safe bet that no knowledge or insight contained in any new release will be foundational moving forward. By contrast, the knowledge contained in the core of useful books can be reapplied across different sets of facts in disparate fields to solve problems more effectively and creatively than many specialists.

Specifically, I claim that this useful core consists of roughly 300 books of the 170 million unique titles in existence (roughly 0.00017% of all books), and that this core will afford you ~99% of the value of all knowledge ever produced. That is, if you are ~98-99th percentile 18-year-old high school graduate (roughly corresponding to ~1500+ on the SAT), then you should be able to get to the forefront of knowledge in the core academic subjects (or have the tools necessary to rapidly reach the forefront of non-core subjects) in about 300 books.

In other words, knowledge is highly power law distributed.

The value of books represented as a power law distribution.

The core subjects that need to be learned to realize this outsized payoff consist of:

Physics
Mathematics
Computer Science
History
Philosophy
Literature (with the canon changing depending on your cultural context)

It is quite possible, within the constraints of our 300-book total, to reach the level of knowledge of a 2nd or 3rd year PhD student at an elite institution in each of these subjects. This is possible for two reasons:

Each of these fields has a small core of topics that are explored from many different angles to form the many subtopics of that field. For example, symmetries & action principles in physics, linearity & topology in mathematics, or computability & complexity in computer science.
Each of the core topics in these fields provides insight into multiple other fields in the list. Learning mathematics aids both physics and computer science (and, to an extent, philosophy). Learning philosophy aids both history and literature. And so on.

Moreover, once you have reached this level of knowledge across each of these core subjects, you will be able to dive into any subfield of these fields or any field not listed in this core and rapidly reach the frontier of research. This is only possible because other fields use the core insights of these fields as their fundamental scaffolding. Chemistry, for instance, is ultimately the study of valence electrons governed by the laws of quantum mechanics; biology is the complex chemical machinery of life; economics is the application of multivariable calculus, game theory, and psychology; and political science is the ongoing, real-world stress test of history and moral philosophy. When you own the "root nodes" of the knowledge graph, the specialized nodes are often just specific parameters applied to general frameworks you have already mastered. Consequently, the transition from this generalist foundation to a specialist frontier is not a climb up a new mountain, but a lateral step onto a bridge you have already built, allowing you to perceive structural similarities between disciplines that the siloed expert remains blind to.

Once you accept this framework, you see that learning becomes much more about curation rather than exhaustively studying all that has been published. It is the via negativa — choosing not to read a vast majority of writing so that you can instead learn a vast majority of knowledge. Thus, the barrier to becoming a modern Thomas Young is no longer cognitive capacity, nor is it the "impossibility" of the expanding universe of knowledge. The barrier is the discipline to ignore the noise of the 99.9999% in favor of the signal of the 0.0001%, and the patience to master those root nodes before attempting to climb the branches. The library of the generalist is small, but it is heavy.

End note: As an example of what this curation looks like, here is a condensed roadmap I am using to go from no knowledge of physics to a graduate school level [link]

It's Time For Google to Acquire Intel

Chris Hayduk — Thu, 25 Sep 2025 13:02:32 GMT

Nvidia made headlines this week when it announced it would invest up to $100 billion into OpenAI and help deploy at least 10 GW of AI infrastructure. The move, frequently memed as an “infinite money glitch,” with capital and revenue cycling between Nvidia and OpenAI (see the image below), effectively ensures a substantial fraction of Nvidia’s GPUs will land in OpenAI‑aligned datacenters (via leasing or outright purchases).

This comes on the heels of OpenAI’s >$300 billion “Stargate” build‑out with Oracle, which targets ~4.5 GW of capacity, further tightening the market for top‑end accelerators.

And that’s before accounting for OpenAI’s ongoing expansion on Microsoft Azure, where the relationship now runs under a right‑of‑first‑refusal model for new capacity rather than blanket exclusivity, still conferring practical priority on Azure deployments while allowing OpenAI to add capacity with other partners.

Netting this out: through the end of the decade, OpenAI has assembled an envelope of roughly 10–15 GW of Nvidia‑powered capacity across Oracle, Microsoft, and other partners with overlap between these footprints, so think of this as a shared umbrella rather than purely additive numbers. For context, independent analyses estimate ~10 GW of additional AI data‑center power could be needed globally in 2025 alone; in other words, OpenAI’s program is on the scale of a full year of incremental world AI build‑out.

Courtesy of SemiAnalysis’s Dylan Patel

The above data implies Nvidia GPU availability will tighten substantially for other frontier‑model players—Anthropic (primarily on AWS), xAI, Meta, and Google DeepMind—raising effective prices and lead times and forcing harder choices about model cadence, context windows, and training tokens.

Google has been trying to break out of this Nvidia-dominated mold for years through the development of its own AI‑specialized TPUs for training and inference. But these in-house designed chips still pass through chokepoints that Nvidia heavily influences, especially TSMC wafers and advanced packaging. By the end of 2025, analysts expect Nvidia to be ~20%+ of TSMC revenue (second only to Apple), and the CoWoS‑class packaging and HBM ecosystems remain binding constraints even as capacity expands. TSMC’s allocation is fundamentally contractual, driven by prepays and take‑or‑pay deals, and it will be reluctant to shift meaningful share away from Nvidia while demand remains red‑hot.

To escape the straitjacket created by the Nvidia‑OpenAI alignment, Google should buy Intel (or a substantial portion of it), fund the High‑NA EUV ramp, and prepare to manufacture TPUs on Intel fabs as that capacity comes online. That gives Google end‑to‑end control of its AI training infrastructure—chip architectures, training software, chip manufacturing, and data center buildout—and a guaranteed runway independent of Nvidia’s queue.

Recent events make this even more urgent. Nvidia just disclosed a $5 billion Intel investment at $23.28/share (roughly 5% of Intel’s outstanding shares), alongside a product pact in which Intel will build x86 SoCs integrating Nvidia RTX GPU chiplets for PCs and collaborate on custom data‑center CPUs—clear evidence that Intel’s roadmap can be steered by anchor customers. Intel is also now soliciting an Apple investment, according to Bloomberg/Reuters reporting.

Given the quickly changing dynamics around Intel, Google must act quickly and decisively. For example, a $25 billion purchase at $35/share would buy on the order of ~714 million shares, implying ~16%–17% of Intel based on ~4.37 billion shares outstanding—placing Google ahead of both the U.S. government (~10%) and Nvidia (~4–5%) as the largest shareholder. That level of ownership could anchor governance and direct capex toward TPU‑critical fabs and packaging lines.

In practice, this looks like the following:

A minority stake + board influence sufficient to align Intel Foundry’s roadmap to TPU requirements
A TPU-only supply compact: multi-year, take-or-pay wafer and advanced packaging commitments, with right-of-first-allocation during shortages and pricing bands tied to verifiable tool/packaging milestones.
Parallel open‑market TPU SKUs to keep utilization high and de‑risk capex—turning Google’s silicon into a software‑first, capacity‑priced product.

#3 is the longest shot, but perhaps the most enticing benefit of the investment. This would open up a second profit engine to fuel Google’s growth over the next decade, especially as its Search business comes under threat from AI-search competitors (such as OpenAI’s search-enabled offerings). In fact, Nvidia’s data‑center business is now running at an annualized ~$160 billion revenue pace, which is comparable to Google’s Search cash cow. Thus, the addition of the TPU revenue line provides substantial growth opportunities and a potential hedge against Google’s eroding search moat.

If this plan works, Google gets scheduling certainty, lower $/token, a faster model cadence independent of Nvidia’s allocation calendar, and another revenue stream that could potentially reach the level of Google Search. If it stumbles, the downside is capped at a financial position that should still appreciate if Intel’s foundry inflects. Either way, for $25 billion, Google can buy its way out of the Nvidia-TSMC duopoly and into the driver’s seat of AI compute.

The Strategic Implications of GPT-5 for OpenAI

Chris Hayduk — Fri, 08 Aug 2025 15:39:23 GMT

Image courtesy of GPT-5

After years of anticipation and hype, GPT-5 is finally out. And the results are decidedly mixed. GPT-5 is undoubtedly a great model — it is #1 across the board on LMArena, sets new highs in SWE-Bench and a host of other coding tasks, and performs great across a range of math benchmarks. However, the expectations for GPT-5 were that it would blow the competition out of the water. Instead, it has made incremental improvements across all of these benchmarks, and is highly likely to be passed over whenever Google releases its next Gemini model in short order (or just when Gemini 2.5 Deep Think gets benchmarked!)

Note that Claude Opus 4.1 scored a 74.5% on SWE-bench just days earlier, so GPT-5 performance is virtually the same (also what in the world is going on with charts at OpenAI???)

The largest gains to GPT-5 came in a less performance-based metric: it seems that, for this release, OpenAI highly prioritized reducing hallucinations and sycophancy in model output.

So you may not notice large performance differences between GPT-5 and the leading models from other labs (or even when compared to OpenAI’s o3 model), but you likely will notice that the model is much less likely to make things up and say things that are flat out wrong just to produce an answer.

In addition, you will likely notice in the ChatGPT model picker that all of the previous models are gone: now there’s only GPT-5. This is another one of GPT-5’s main contributions — it greatly simplifies the model selection process. GPT-5 is more of a system than a model, dynamically routing requests to faster LLMs (analogous to GPT 4o) or slower, thinking LLMs (analogous to o3) depending on the complexity of the request.

(The two above points are important; we’ll come back to those later).

Likely in response to some widespread dismay at the performance benchmarks, Sam Altman tweeted the following after the GPT-5 announcement:

“we can release much, much smarter models”

It seems Altman is asserting that OpenAI deliberately chose to release a model below the company’s capabilities, barely edging out its competitors (and likely not even edging out Google’s leading model) on most performance metrics. Instead, they deliberately chose to focus on reducing hallucinations and streamlining model selection as the main contributions of GPT-5. Why would OpenAI do this? Why not continue setting the benchmark for LLM model performance, as they’ve done since the ye olde days of GPT-2?

Because the strategic focus of the company has clearly shifted.

Subscribe now

Market Dynamics Affecting OpenAI

ChatGPT is a consumer application with 700 million weekly active users. And it is absolutely trouncing the competition in consumer adoption. The ChatGPT app in the Apple App Store has 3.3 million reviews, compared to just 377,000 for the Gemini app and 23,000 for the Claude app. This suggests that ChatGPT has a mobile install base that is 10x the size of Google Gemini and 100x the size of Anthropic’s Claude. Moreover, in June 2025, openai.com had 1.12 billion visits, while gemini.google.com had 265 million and Claude had 113 million — again suggesting a lead of at least an order of magnitude for OpenAI over its competitors in the consumer chat space (source: https://www.semrush.com/).

By contrast, according to a Menlo Ventures report from July 2025, Anthropic is actually the leading market provider for enterprise LLM API usage, with 32% market share vs. OpenAI’s 25% market share in mid-2025. Google is also growing and not far behind OpenAI, at 20% market share, up from 12% in 2024. OpenAI’s enterprise position has also been trending strongly negative, cratering from 50% market share in 2023 down to its current position of 25% market share in mid-2025.

So we see the market dynamics pressing on OpenAI as a company — absolutely dominant positioning on the consumer side of the market, with a weak (and steadily weakening) position on the enterprise side of the market. This leaves OpenAI with a choice — double down on its success on the consumer side of the market, or attempt to win in the highly competitive enterprise space. This choice mainly comes down to where the company has a moat that can result in durable profit margins.

How Market Dynamics Affect the Models

First, let’s explore the dynamics of the consumer market. Consumers, by and large, make buying decisions not based on performance or objective metrics, but instead based on “vibes”. You can improve the vibes for a consumer by improving brand positioning (i.e., make the consumer feel a certain emotion from using your product) or by improving the user experience (UX) and user interface (UI) of your product (i.e., make the product more enjoyable for the user).

OpenAI’s lead in consumer usage stems primarily from precisely these areas, with its extremely strong branding and UI/UX improvements in its chat interface versus the competition. OpenAI had a strong, multi-year lead due to its first-mover advantage, providing its “ChatGPT” brand with significant mindshare in the consumer base. In addition, since the release of the original ChatGPT, OpenAI has focused strongly on the web and mobile chat experience. With features like memory and ChatGPT Projects, OpenAI has introduced a high level of personalization for users of the app, thereby creating a high switching cost moat — if you switch to Claude or Gemini, you can’t take ChatGPT’s memories or projects with you. This instantly makes the competing consumer apps less appealing to users in the same way that users of Spotify are reluctant to shift over to Apple Music once they have built up a library of playlists that they enjoy.

Hence, to improve consumer market share, OpenAI will need to continually pull the two levers of brand positioning and UI/UX improvements. Models can’t really improve brand positioning much, as that is more a function of marketing, so the main pressure on the model side of the equation will come from the UI/UX push. This pressure results in making models that are simpler and more enjoyable to use.

Now we’ll shift our view to the enterprise market. Businesses, unlike consumers, strictly focus on return on investment when allocating capital expenditures. These ROI calculations will essentially have four inputs when it comes to LLMs:

The API cost per million tokens
The number of tokens needed to solve a task
The value of that task
The performance of the LLM on that task

We can then model the ROI of using an LLM as follows:

So, from the above, we can see that the only levers that LLM providers can pull to improve the ROI calculations for a company are:

Decrease the cost per million tokens for the API
Decrease the number of tokens needed to solve the task
Increase LLM performance on the task

Given the pressure from open source contributions (e.g., DeepSeek, Kimi, and Qwen), closed source model providers will never be able to compete on #1. #2 also runs counter to the current scaling of AI models — to increase test-time compute (and thus make the LLM useful for more difficult tasks), we by definition have to increase the number of tokens used. Hence, LLM providers competing in the enterprise have started to converge on #3 — improving LLM performance on the given task.

Now, there are two additional levers that an LLM provider can pull to improve the performance of the model on a specific task:

Make the model smarter overall
Customize the model for that task

Broadly, Google has taken the first approach, with Gemini models consistently leading the pack in intelligence (particularly the new Gemini 2.5 Pro Deep Think model). OpenAI would struggle mightily to compete along this dimension because Google has such massive advantages in terms of scale — it has access to ridiculous amounts of compute and has indexed virtually all of the world’s data. Having a lead in algorithms is not a durable moat due to the speed of diffusion of inventions in Silicon Valley, and since model performance is a function of algorithms, data, and compute, Google will maintain a decisive lead here.

Meanwhile, Anthropic has taken the second approach, specializing its models for code using targeted reinforcement learning and building the Claude Code agentic harness. This is the lowest-hanging fruit for specialized models, given that this is the domain in which today’s LLMs perform best. Since Anthropic already has a large lead here, this then leaves OpenAI with two choices: find a less obvious niche for which it can start customizing its models, or compete directly with Anthropic in the coding space, where its competitor already has a large advantage.

From the above analysis, we can see that OpenAI has a large lead in the consumer market with durable moats, and that in order to improve those moats, OpenAI would need to improve the UI/UX of its models by making them simpler and more enjoyable to use. By contrast, to compete in the enterprise market, OpenAI would need to either produce the smartest model (where it is at a disadvantage compared to Google) or start customizing its models for targeted use cases (where it is at a disadvantage compared to Anthropic in the most obvious market of coding agents).

Conclusion - GPT-5 as AI for the Common Man

Now, let’s wrap up this argument.

We have already seen that OpenAI has a large and commanding lead in the consumer market, with a low and shrinking market share in the enterprise market. Now we have also shown that it has solid, defensible moats in the consumer market, and it is at a strong technical disadvantage in the enterprise market. We have also established that prioritizing consumers means improving model UI/UX, while prioritizing enterprise means improving model performance and specialization. Lastly, from the opening paragraphs, we have established that OpenAI deliberately did not make the highest-performing model possible.

Instead, they prioritized reducing hallucinations and streamlining the model selection process in ChatGPT. Both of these changes significantly improve the consumer experience, as confabulations can destroy consumer trust and erode brand advantages, while the old model picker with nearly 10 different models intimidated new users and caused high cognitive load when using the app.

As such, the logical conclusion is that OpenAI has chosen to prioritize consumers over enterprise, and GPT-5 is the result of this.

Hence, over the coming years, don’t expect OpenAI to consistently lead in model performance as they have over the past 3 years. Instead, look for continuing improvements in the usage experience of ChatGPT. If you want to find the best models overall or the best coding models, you’ll probably need to look to Google and Anthropic, respectively.

Managing Civilizational Tail Risks

Chris Hayduk — Wed, 16 Jul 2025 14:37:43 GMT

While listening to ’s excellent podcast with Stephen Kotkin (go listen to it if you haven’t yet. I’ll wait.), Kotkin made a very interesting point on the dynamics that led to the Communist revolutions in Russia and China. I’ll provide an excerpt below (emphasis mine):

[In Russia and China], you have this peasant land hunger. The peasants are often without their own holdings. They work on someone else's property, or their holdings are so small that if there's a little bit of bad weather let alone a massive drought, they're on the verge of starvation. Subsistence level agriculture is not politically stable…
You need to deal with the peasant land hunger so that it becomes a stabilizing political force. You have the peasants get the land and then they have a piece of the status quo and want to retain the system, versus the peasants not having the land and they want to overthrow the system to get the land…
So the peasants had their own revolution in 1917 and 1918, which was not about the socialist parties. It's not about the Bolsheviks, it's not about Lenin, it's about the peasants seizing the land. But that creates an intense radicalism that becomes the platform for the socialists in the cities to gain and hold power in the system. You don't have that in the German case.

Kotkin’s claim here is that a large, landless peasant class provides a powder keg, and the Communist movements in Russia and China were smart enough and opportunistic enough to capitalize on this latent instability in the system. By contrast, in Germany, the peasants were predominantly landowners, which provided the stability to weather the Communist movements. Meanwhile, in England, peasants as a class were starting to give way to the urban factory worker, who was able to progressively win voting rights and the right to organize into labor unions.

This result with a Red Russia & China, a fascist Germany, and a capitalist UK seems obvious in retrospect, but for an observer in pre-WWI Europe, this would actually be an extremely surprising result. It was often assumed by Communist intellectuals (including the leaders of the Communist movement) that the global Communist Revolution would begin in England and Germany, which were perceived as the ripest environments for labor to overthrow the capitalist class.

So why did these dynamics play out so counterintuitively? To understand this, let’s analyze the sociological model implicit in Kotkin’s explanation for the causes of the Russian and Chinese revolutions. Broadly, I think Kotkin’s framing of the issue rests on separating the populace into two groups

The Dampeners represent the social classes who have a stake in the prevailing order—they have this stake because they stand to gain from political and economic stability and to lose from political and economic instability. In the tsarist Russian context, this would include the tsar and his bureaucrats, as well as the large landowners.
The Powder Kegs represent the social classes who feel they do not have a stake in the prevailing order—that is, they feel they do not stand to gain from political and economic stability and, similarly, do not stand to lose from political and economic instability. In the tsarist Russian context, as we’ve discussed, this group is most easily seen in the landless peasant class.

The Dampeners, then, can be thought of as any groups that are currently benefiting and, importantly, feel as though they are benefiting from the current economic, political, and social structure. They represent the vanguard of the current society, defending its interests since the society’s interests are their interests as well. That is, their natural inclination is to dampen any dramatic oscillations in their society in order to defend their current interests and their perceived upside. The Powder Kegs, on the other hand, can be thought of as disenfranchised groups that do not have a stake in the nation’s future. Due to their lack of downside in the face of societal collapse and lack of upside if society remains stable, this class is incentivized to flip the game board over and try to start over. However, in many cases, they will remain latent without an outside spark to light the powder keg.

In this framing, we can think of the Powder Kegs as generators of civilizational tail risks.

An Explanation of Tail Risks & Non-Ergodicity

In the finance world, tail risks are low-probability events that cause immense losses (i.e., the left side of the normal distribution in the above image). Tail risks in finance and many other domains also tend to have the following characteristics that make them especially pernicious:

They tend to more closely follow a fat-tailed distribution—that is, low probability events occur more often than would be expected by a standard normal distribution modeling the event (e.g., 1-day stock returns)
The impact of the tail risk is typically much higher than the impact of a more typical data point. For example, a single day when the stock market crashes can wipe out years’ worth of gains.

Given the above, in fat-tailed domains with high-impact left-tail events, we need to be especially careful about defending against (ahem, dampening the effects of) tail risks. And society is the fat-tailed domain. As difficult as forecasting in finance is, forecasting future political and social developments using history might as well be astrology. It’s highly qualitative, subject to opinion, and, as a result, suffers from human status quo biases (or “nothing ever happens,” as the Twitterverse says). Hence, we almost definitely underestimate low-probability events in human society. Thus, the distribution is fat-tailed. In addition, we know that the risk of tail events in human society is immense—just look at the Bronze Age Collapse or Europe after the fall of the Roman Empire. Thus, the distribution of impactful societal events is fat-tailed with extremely high-impact tail events.

You may now be thinking, “Okay, these tail events are definitely a concern. But they are, by definition, low probability, so do we really need to worry about them?

Enter non-ergodicity.

This image captures the notion of spatial (ensemble) averages vs temporal (time) averages well.

To discuss non-ergodicity, we’ll start by intuitively defining ergodicity. Let’s imagine a volume, such as a box, and some time-dependent process, like the motion of a particle in the box. Now, suppose we take two distinct measures of the box:

We look to take a particular point in time and compute the average position of all particles in the box.
We fix our attention on one particle and track its trajectory over time. We then compute the average of this one particle’s position based on the data we record as we follow it around the box

In the first measure, we compute a spatial average—we look across the box, identify the positions of all particles within it, and then compute the average position of these particles.

In the second measure, we compute a temporal average—we focus on a single particle over time, identify all positions it visits, and then average these positions.

Mathematically, we can identify the spatial average as the standard expected value of a continuous random variable that you should be used to if you’ve taken an intro course in statistics or probability:

Meanwhile, the temporal average is given by the following:

If the above two notions of an average are equal for a given random process, then we say that it is ergodic. Continuing our physical intuition, this equality is saying that, if we follow the motion of one particle in our box, it will visit all of the locations in the box in the exact proportions that we would expect to find all of the particles at a fixed time. That is, we can use spatial and temporal averages interchangeably because they are the same for ergodic processes. The particle motion example was chosen deliberately, as Brownian motion (i.e., the random motion of particles in a volume) is one of the canonical physical processes exhibiting ergodic behavior.

Now, non-ergodic processes are random processes in which the spatial average and the temporal average are not equal. The behavior of one data point over time does not equal the average behavior across all data points at one instant in time.

Why do we care to make this distinction? Because whenever people talk about averages, they almost always default to the spatial average and then go on to use these averages to make predictions about particular data points. For example, people will often say, “The average yearly return for the S&P 500 is 10%, and therefore I can model my portfolio as gaining 10% per year.” I hope now the issue with this thinking is clear: our hypothetical stock allocator has computed a spatial average and then applied it to one data point over time without checking if the process is ergodic! And, spoiler alert, stock returns are a canonical example of non-ergodic processes, so this reasoning would lead to dangerously faulty conclusions!

Let’s now explore exactly why the combination of underestimating tail risks and confusing ergodic and non-ergodic processes is so dangerous with a concrete example.

Tail Risks, Non-Ergodicity, & Society

We’ll start by building a simple model of the real GDP growth of a standard developed economy that has a latent probability of collapse. We can model this as a discrete-time process, where each year is treated as a single period. In any given year, we will estimate a 0.1% chance of ruin (i.e., the hypothetical country fails and its real GDP goes to 0). This can represent governmental collapse, nuclear war, incurable diseases that wipe out the population, etc. In the vast majority of years (with a probability of 99.9%), we’ll forecast that the country’s real GDP growth is normally distributed with a mean gain of 2.5% and a standard deviation of 0.5 percentage points.

This setup combines the main concepts of tail risks and non-ergodicity that we mentioned previously:

Tail risks — we have a 0.1% chance of ruin in any given year. So our risk is extremely rare but also very impactful, just as we’ve defined tail risks.
Non-ergodicity — when a country fails (i.e., its GDP goes to 0), it doesn’t have the chance to start over. Hence, if we follow the failed country over time, after it fails, it doesn’t have the chance to explore the full state space (and hence the temporal average will differ from the spatial average).

These two effects, when combined, produce very shocking effects, even though at first glance it might look like we only have a 0.1% chance of catastrophe.

We’ll now run this simulation for 500 years and observe the effects. In this simulation, every country will start with a GDP of $1.

I want to start by directing your attention to the left side of the graph. You’ll see that a full 39.7% of countries in this simulation collapsed, with their GDPs going to 0. Our initial intuition that we had a 0.1% chance of catastrophe was incredibly wrong—we have a 0.1% chance of catastrophe in any given year, but the chance that society fails in any year increases as time advances! Rather than saying “we have a 0.1% chance of catastrophe”, which sounds relatively benign and pushes people towards complacency, it would be far more accurate to say “we have a 40% chance of catastrophe in the next 500 years”, which sounds much more dire.

Now, take a look at the legend in the top-right corner of the graph. Here, you can consider the mean to be the spatial average we’ve been discussing—hence, in the vast majority of conversations where this data is getting discussed, someone would likely say “the average country over this 500-year sample saw its GDP grow from $1 to $12,032.61, a gain of 1,203,160%”. Hence, the reported “average”, the spatial average, actually doesn’t capture the experience of living in these conditions at all! You would think, based on seeing an average gain of over 1 million percent, that all or most countries developing under these conditions would be on the march to utopia. However, these countries only have barely better than a coin flip’s chance of not collapsing into complete chaos in this 500-year sample!

To better understand the dynamics of this process, we need to instead look at the temporal average. We can do this by computing the geometric mean of returns for each individual path and then averaging this set of geometric returns. This yields an average yearly loss of 39.5%, with an expected ending value of $0, aligning much more closely with the collapse dynamics we see in the histogram. This brings us to an even broader insight that can be made here, which we can see in the plot below.

As time moves forward in our sample, more and more of the countries are collapsing. Taking this to the limit, eventually, all societies in this simulation will collapse. Even more generally, when the probability of catastrophic ruin is greater than 0, any society will collapse given enough time.

To tie things back into the beginning of our discussion, what Kotkin has identified in his analysis of the latent instability in the 20th-century Russian and Chinese societies is that the presence of disenfranchised peasant groups increased the yearly chance of ruin for these societies, and thus led to their demises. Combining with our analyses above, we see that when one or more Powder Keg groups exist in a society (making the probability of ruin non-zero), the long-run survival probability of that civilization is exactly 0. The more such Powder Kegs exist, and the more groups that appeal to these groups, the greater the probability of collapse in a given year.

On the flip side, if we re-run the simulation with an order of magnitude lower probability of ruin (i.e., a 0.01% risk of collapse per year), we see a huge reduction in risk in the 500-year window. Only 4.7% of societies in this scenario collapse, reducing our risk of ruin by 35 percentage points.

Given these analyses and the conclusions we’ve drawn, it would seem that identifying potential Powder Kegs and implementing policies to let off some steam are some of the most important actions a government can undertake. If we can identify these groups and pacify them, we can substantially improve the survival probability of our society over large timescales and avoid outright collapse. The late Russian tsars and Chinese emperors failed to identify and pacify their Powder Keg group. The Communist parties in those countries, however, did not overlook the peasant Powder Keg group and were thus able to leverage them to bring the country’s risk of ruin to reality.

With this lesson in mind, let’s look at a Powder Keg group that is roughly analogous to the landless peasants in tsarist Russia and examine some ways we can defuse the impending explosion.

A 21st Century Powder Keg & Its Potential Solutions

To identify our Powder Keg, let’s return to our definition of the group: we need to find a social class that doesn’t have a stake in the prevailing political order. That is, we need to find a group that doesn’t stand to gain from the country’s success since it doesn’t own a stake in the country, and also has the subjective feeling that it will never achieve such a stake in the current system. Hmm… where can we find such a group?

If you clicked on any of the myriad links in the previous paragraph, I think it’s clear that Gen Z fits the bill—they are currently dealing with extremely high housing costs that have put home ownership out of reach for many members of that generation (at least, that is, in the cities with the jobs they’d want to work and places they’d want to live). Most importantly, this generation feels like it will never get better. There is a very strong, pervasive sense that society is fundamentally set up so that they won’t succeed. We can see this dynamic starting to play out politically, with Gen Z shifting strongly to Trump during the presidential election, who openly espoused the need to tear down the current economic and social order to “Make America Great Again”. We are also seeing this in New York City, where Gen Z voters are turning out in strong numbers to vote for Zohran Mamdani, the once‑underdog‑turned‑favorite mayoral candidate who has some fairly openly held anti-capitalist views.

Whether or not the issues Gen Z faces require a radical overthrow of the prevailing social order, much of Gen Z certainly feels that this is the case. And that is enough for a radical group (like the Communists did in 20th-century Russia and China) to leverage this group as the bedrock for a revolution. Which means, to protect our civilization from collapse, we have to resolve the issues they’re facing and give them (and future generations) a stake in the current social order.

The way I see it, having a stake in American society as it’s currently constructed essentially amounts to two things:

Owning a home
Owning a piece of the stock market

Owning a home is the traditional view of success in the United States, and so the inability to buy causes immense stress to younger generations. It makes them feel shut out from the typical path to prosperity that their parents and grandparents have followed. And, as we’ve discussed, simply feeling that there is no hope is enough to provide a latent layer of unrest, regardless of its truth. By providing them with a pathway to home ownership, we could reduce the anxiety and anger directed at the US social order. In addition, home ownership typically connects people to a community more than renting does, as they can form more permanent roots, which could also help to assuage the social isolation felt by many in the younger generations.

Similarly, the more modern view of success amounts to having significant capital in the US stock market. Gen Z generally feels resentment towards US companies and CEOs, mostly because they feel that they are being “robbed” or taken advantage of by these companies. This feeling is likely augmented due to their lack of ownership in the stock market—if they owned a slice of all of these corporations, it would be much harder to feel robbed since Gen Z would, in turn, be getting richer from their success.

So we now have a two-pronged approach—give young people a vested interest in US corporations and a vested interest in the housing market. Both of these assets will reduce the desire for societal disruption and instead incentivize stability and growth, since younger people would now be directly harmed by disruption and directly benefit from growth.

What would policies targeting these two areas look like?

Housing Policy

To defuse the Gen Z Powder Keg through housing, we must confront the sacred cows of American urban planning and economic policy head-on, adopting measures that prioritize radical supply-side interventions over the palliative demand-side subsidies that have only inflated bubbles and entrenched resentment. The goal here is not mere affordability tweaks but a structural overhaul that floods the market with housing units, crashing prices to levels where ownership becomes a default expectation rather than a distant dream, thereby converting potential revolutionaries into vested stakeholders who crave stability to protect their newfound equity.

First, deregulation stands as the linchpin: easing zoning laws, building codes, and permitting processes to unleash denser, faster development. In cities like San Francisco or New York, where Byzantine zoning ordinances have artificially constricted supply, we could emulate the Houston model on steroids. Imagine upzoning entire neighborhoods to allow mid-rise apartments and mixed-use developments without the endless environmental impact studies or community vetoes that currently serve as de facto barriers erected by incumbent homeowners (our modern Dampeners) to preserve their scarcity-driven windfalls. By slashing approval timelines from years to months, we'd not only multiply housing stock exponentially but also incentivize innovation in modular construction and prefabrication, driving down costs through economies of scale. This isn't about anarchic sprawl; it's about recognizing that the current regulatory thicket is a form of crony capitalism, protecting legacy interests at the expense of intergenerational equity. The contrarian insight? Such deregulation would disproportionately benefit high-density urban cores, where Gen Z wants to live, fostering vibrant, walkable communities that counteract social atomization, turning isolated renters into networked homeowners with skin in the civic game.

Complementing this, implement a land value tax (LVT) while eliminating traditional property taxes. Drawing from Georgist economics, an LVT taxes the unimproved value of land itself, not the structures or improvements upon it. This shifts the fiscal burden onto speculative land hoarders and absentee owners, who currently profit from scarcity without contributing to productivity, and rewards those who build densely and efficiently. In practice, this would discourage vacant lots in prime areas (a plague in places like Los Angeles) and encourage vertical development, as the tax incentivizes maximizing output per square foot. Eliminating property taxes, which penalize improvements, removes the disincentive to renovate or expand, further accelerating supply. The net effect? Housing prices plummet as land speculation collapses, making entry-level ownership feasible even for baristas or entry-level coders. Critically, this policy is revenue-neutral or even positive for governments, funding infrastructure without the distortionary effects of income or sales taxes.

Next, a targeted reduction in union power within the construction sector to slash labor input costs. Unions, while historically vital for worker protections, have in many blue states metastasized into rent-seeking guilds that inflate wages far above market rates through prevailing wage laws and project labor agreements, adding 20-30% premiums to public projects alone. By reforming or outright repealing these mandates (perhaps via federal preemption in interstate commerce-affected builds), we could introduce competitive bidding from non-union labor, including skilled immigrants under expanded visa programs. This isn't anti-labor animus; it's a recognition that artificially high costs perpetuate the very inequality unions purport to fight, locking out younger workers from affordable homes while padding the pensions of Boomer incumbents. The result? Faster builds at lower prices, with spillover effects into related industries. Tie this to broader labor market fluidity (e.g., right-to-work expansions), and we create a virtuous cycle where construction booms absorb underemployed Gen Z, giving them immediate income upside alongside future ownership stakes.

Finally, ramp up domestic energy production to crater energy input costs, which permeate every stage of housing development from raw materials extraction to HVAC installation. Fracking, nuclear deregulation (streamlining NRC approvals to Yucca Mountain levels of efficiency), and offshore drilling expansions could flood the market with cheap BTUs, reversing the green energy mandates that have jacked up costs via intermittent renewables and grid unreliability. By prioritizing baseload abundance, we not only halve construction energy bills but also enable all-electric homes that are cheaper to operate, further lowering the ownership threshold. In our tail risk framework, this policy acts as a multiplier: cheaper energy stabilizes the broader economy, reducing the probability of ruin from exogenous shocks like oil crises, while directly empowering the housing supply surge that integrates Gen Z as Dampeners.

Implemented holistically, these policies could halve median home prices in high-demand areas within a decade, per supply elasticity models, transforming Gen Z's latent rage into conservative impulses. After all, nothing quells revolutionary fervor like a mortgage and rising property values.

Stock Market Policy

Shifting to the equity side, the objective is to forge a direct umbilical cord between young Americans and the productive engine of capitalism: the stock market. Rather than the anemic 401(k) matches or tax-advantaged IRAs that presuppose stable employment and disposable income, we need audacious, universal mechanisms that inject ownership stakes from cradle to career, preempting the feeling of disenfranchisement that fuels Powder Kegs.

The recently enacted MAGA (Money Accounts for Growth and Advancement) accounts (rebranded as Trump Accounts in the final text of the One Big Beautiful Bill) represent a strong starting point in this. These accounts automatically seed every eligible newborn with a $1,000 federal contribution (under a pilot for births from 2024 to 2028), invested in diversified funds tracking U.S. equity indices like the S&P 500, with tax-advantaged growth and additional contributions capped at $5,000 annually from taxable entities or unlimited from tax-exempt organizations. Access is tiered—no withdrawals until age 18, then partial for qualified purposes like education, entrepreneurship, or homebuying, expanding at 25 and fully unrestricted at 30—with favorable long-term capital gains taxation on qualified distributions. These accounts instill from infancy a visceral sense of ownership in corporate America, equating personal gain to national prosperity. To secure broader political alignment and ensure the program's longevity beyond electoral cycles, however, a further rebranding is essential, shedding the partisan connotations of "MAGA" or "Trump Accounts" in favor of a neutral, inclusive name like "American Future Funds" or "National Prosperity Accounts." This move transcends left-right divides, framing the initiative as a non-ideological investment in collective stability; by inviting buy-in from all societal factions, we mitigate the risk of future repeals or sabotage, transforming it into an enduring institution that unites rather than polarizes, much like Social Security evolved from its contentious origins into a bipartisan bedrock.

However, these accounts don’t go far enough: the $1,000 seed is paltry in the face of compounding's long horizons (barely scratching $10,000 by age 18 at historical 7% real returns), eligibility skews toward newborns and young children (leaving Gen Z and millennials largely sidelined without retroactive catch-ups), contribution limits throttle broader participation, and the delayed access risks alienating a generation already primed for immediate disruption over deferred gratification. To truly dampen tail risks, we must expand them aggressively: scale the initial deposit to $5,000, extend automatic enrollment with catch-up lumps ($10,000+) for 18-35-year-olds funded by reallocating corporate tax windfalls from buybacks to "citizen shares," eliminate age-based restrictions in favor of penalty-enforced long-term holds, and broaden investment mandates to include total-market ETFs for true diversification. This supercharged version subverts the Marxist narrative of "exploitation by capitalists" by making everyone a micro-capitalist, diluting class warfare into shared prosperity. In tail risk terms, widespread equity ownership dampens systemic shocks; if a critical mass holds stakes, political pressures shift from "eat the rich" to "protect the market," reducing the annual ruin odds by fostering a Dampener supermajority.

To amplify this foundation even further, redirect any universal basic income (UBI) pilots or proposals straight into mandatory stock purchases for recipients under an expanded "Universal Basic Equity" (UBEI) framework, rather than cash disbursements that dissipate into consumption traps. Imagine channeling equivalent funds (say, $1,000 monthly equivalents) into the augmented MAGA/Trump Accounts or similar diversified portfolios, perhaps via Vanguard-esque vehicles with algorithmic rebalancing. This forces skin-in-the-game; recipients become vested in corporate efficiency and innovation, as their "income" derives from dividends and appreciation rather than zero-sum transfers.

Collectively, these stock policies embed a profound incentive alignment: young people gain from US economic ascendance, viewing disruptions—like tariffs or regulations—as direct threats to their wealth. This vested interest in continued success transmutes Powder Kegs into Dampeners, slashing civilizational ruin probabilities and extending our society's survival horizon indefinitely.

Conclusion

In revisiting Kotkin's insights on the peasant-driven upheavals that toppled empires in Russia and China, we've unearthed a timeless sociological truth: societies teeter on the edge of ruin when disenfranchised masses, our Powder Kegs, accumulate without recourse, their latent volatility amplified by opportunistic sparks. Through the lens of tail risks and non-ergodicity, we've modeled this as an inexorable march toward collapse in any system where the annual probability of catastrophe exceeds zero, where spatial averages seduce us into complacency while temporal realities reveal the fragility of civilizational paths. Gen Z, as the 21st-century analog to those landless peasants, embodies this threat not through malice but through a profound sense of exclusion from the American dream's twin pillars: homeownership and equity participation.

The policies outlined—radical housing deregulation paired with Georgist taxation, union reforms, and energy abundance to crash barriers to entry; alongside expansions of rebranded, universal stock accounts and equity-directed UBI to democratize capital—are strategic dampeners designed to integrate the alienated into the status quo's vanguard. By granting tangible stakes in stability, we slash that ruin probability, extending our society's viable horizon from inevitable doom to indefinite prosperity. This represents a first attempt at pragmatic ergodic engineering: in a fat-tailed world, the true revolutionaries are those who preempt the keg's ignition, ensuring that history's surprises favor continuity over cataclysm. Ignore this at our peril; embrace it, and we might just forge history’s first immortal civilization.

On Implicit and Explicit Learning

Chris Hayduk — Wed, 09 Jul 2025 23:32:33 GMT

I’ve been an off-and-on language learner for the last decade, and, after a brief hiatus over the last couple of years, I have returned to study Spanish and Mandarin in earnest.

Returning to the language learning grind has me reflecting on different modes of learning, their interplay, and how people often get them wrong. To provide us a jumping-off point, I’m going to define two broad modes of learning:

Explicit Learning, which we can consider any learning that requires teaching rules or procedures. This type of learning is often accompanied by a textual explanation of the procedure and some examples to practice applying that procedure. Most subjects in school are taught this way: e.g., learning a new approach in math or working through a grammar workbook in Spanish class.
Implicit Learning, which we can consider any learning that occurs as a mostly unconscious process. A helpful rule of thumb is that any learned material that requires a response in seconds falls under the purview of implicit learning - at that time scale, there isn’t enough time to work through a procedure (and hence to use the tools of explicit learning).

Now, anyone who has learned a language in school (or read my not-so-subtle leading example in the Explicit Learning section) knows that explicit learning is the modus operandi of foreign language instruction in the United States. Virtually the whole process occurs in a grammar textbook where the student is shown some concept, such as the conjugation of verbs ending in -ar in the present tense in Spanish. The student then reads through the “logic” of how this grammar concept works and works through a set of examples in order to “apply” the procedure.

I hope the scare quotes I used above made it abundantly clear how ridiculous I think it is to make language learning logical or procedural. In fact, there is no logic to language grammar since it did not grow out of a logical process! Language grew organically alongside humans, and so it has all the same messy, non-logical characteristics that we see in all things biological and anthropological. This is why every grammar “rule” that is taught has dozens, if not hundreds, of exceptions! We’re applying a method of learning suitable for math to a completely unrelated field in language learning that categorically does not follow a logical structure with teachable procedures!

(Trigger warning for anyone traumatized by this teaching method: I am about to show an example of one of these accursed grammar worksheets.)

Don’t say I didn’t warn you.

Now you might be thinking: “Okay, so easy fix — we just move all of language learning and any other similar fields to implicit learning. Explicit learning remains reserved for the logical subjects of math and physics. QED.” In this paradigm, all of the non-rigorous subjects would eschew explicit learning, and all of the rigorous subjects would focus only on explicit learning.

However, despite this aligning more with the natural contours of each subject, we would be leaving potential learning speed on the table by sticking only to implicit or explicit learning. I will give two examples to illustrate the point:

Language Learning (Implicit Learning Paradigm) — Suppose you are learning a foreign language and, like the good language learning student we’ve outlined thus far, you’re doing it fully through immersion (that is, implicit learning). You avoid all grammar explanations and worksheets, and instead spend all of your time listening to podcasts, reading books, and watching YouTube videos in your target language. If you come across a particular grammar construction that you don’t understand, you just try your best to understand it in context and keep immersing. With this setup, it might take dozens of hours of immersion to understand the grammar concept, particularly for rare constructions or variants of the same construction that look unrelated on the surface. Meanwhile, reading about the grammar topic and doing some exercises might take under an hour and prime you for your immersion.
Mathematics (Explicit Learning Paradigm) — Now suppose that you’re a math student and, like the good math student we’ve outlined thus far, you’re doing it fully through working through derivations and proofs (that is, explicit learning). You avoid any form of memorization or intuition building, and instead spend all of your time focusing on logically deriving all of your course content. Now, as the course moves on, it will assume knowledge that you learned earlier in the semester. However, since you didn’t bother to do any memorizing or intuition-building, you don’t remember those earlier concepts with any sort of automaticity! You need to rederive all the knowledge from scratch, and, after the course material builds up beyond a certain point, the cognitive load of this derivation becomes too much, and you start to fall behind.

As the two above examples show, even in subjects that are skewed towards one of the learning paradigms, the other still has an important role to play.

In implicit-dominant subjects, we can use explicit learning to identify or study patterns that we would like to notice during the implicit skill development. This would be like a language learner studying a grammar concept (explicit learning) to improve their understanding during immersion (implicit learning), or a basketball player breaking down defensive coverages in film (explicit learning) so he can make better reads during scrimmages and games (implicit learning).

In an explicit-domain subject, we can use implicit learning to memorize and internalize concepts and procedures without worrying about their logical derivation. This would be like a mathematics student memorizing common integrals (implicit learning) to make it easier to understand the logical derivations of concepts that use integration, such as probability distributions (explicit learning).

Thus, broken down this way, the problem of learning any subject can be reduced to a three-part problem:

Where can you use explicit learning techniques in this subject?
Where can you use implicit learning techniques in this subject?
What should the balance and interplay of these learning paradigms be in this subject?

If you nail these three questions for your particular subject, then you will have achieved optimal learning speed and efficiency.

Gemini 2.5 Pro: How Data + Compute Moats Beat Algorithmic Tweaks

Chris Hayduk — Mon, 14 Apr 2025 21:08:30 GMT

The race towards Artificial General Intelligence (AGI) and state-of-the-art AI models is often framed around breakthrough algorithms and novel architectures. However, a deeper analysis reveals that the true drivers of durable leadership lie elsewhere. While algorithmic innovation is crucial, the path to AI supremacy is increasingly paved with massive datasets and unparalleled computational power. When viewed through this lens, Google DeepMind emerges not just as a competitor, but as the likely frontrunner.

The Trifecta of AI Progress: Algorithms, Compute, and Data

Training large-scale AI models hinges on three interdependent pillars:

Algorithms: These are the recipes, the architectures (like Transformers, Mixture-of-Experts), and the training methodologies (loss functions, optimization techniques) that dictate how effectively models learn patterns and relationships from data. Efficient algorithms extract more "knowledge" per unit of data and compute.
Compute: This represents the raw processing power, typically measured in FLOPs (Floating Point Operations Per Second), required to execute the vast number of calculations involved in training deep neural networks. It's the energy input transforming potential into a trained artifact.
Data: This is the raw material – the text, images, code, audio, video, and other modalities – from which the model learns the structure of the world, language, and reasoning. The quality, quantity, and diversity of data fundamentally shape the model's capabilities.

These factors exhibit strong interplay. An algorithmic leap, like the transition from RNNs/LSTMs to Transformers for sequence modeling, unlocked the potential to effectively utilize vastly larger datasets and compute budgets. Before Transformers, training on web-scale text data with massive parameter counts often hit diminishing returns due to limitations in handling long-range dependencies and parallelization. The Transformer architecture, with its self-attention mechanism, was significantly more scalable, allowing marginal increases in data and compute to translate into tangible performance gains once more. The performance wasn't just better; the scaling properties improved.

The Illusion of Algorithmic Moats

Recent history is replete with examples emphasizing algorithmic prowess. The excitement around models like DeepSeek-R1, achieving remarkable performance with comparatively modest training resources, underscores the power of efficient architectures (like Mixture-of-Experts) and optimized training strategies. It proves that clever algorithms can significantly improve the compute/data-to-performance ratio.

However, as I argued previously in On Algorithmic Moats and the Path to AGI, algorithms alone do not constitute a sustainable competitive advantage in the current AI landscape. Why?

Talent Mobility: The AI research community is fluid. Top researchers frequently move between major labs like Google DeepMind, OpenAI, Anthropic, and Meta, carrying conceptual knowledge and insights about successful (and unsuccessful) architectural experiments and training techniques. While NDAs exist, the fundamental ideas diffuse rapidly.
Open Source and Publication: Key players like Meta (LLaMA series) and innovative teams like DeepSeek often open-source their models and research. Academic institutions and arXiv ensure rapid dissemination of novel techniques. This accelerates the entire field but levels the playing field algorithmically. A breakthrough published today can be replicated and built upon by competitors within months, if not weeks.

Therefore, relying solely on being the first to discover the next architectural tweak is a fragile strategy. Being a fast follower, capable of rapidly implementing and scaling proven algorithmic advances discovered elsewhere, might be just as effective, provided you possess advantages in the other two factors.

The Real Moats: Data and Compute Scale

If algorithms are becoming increasingly commoditized, what provides a durable edge? The answer lies in the factors that are far harder to replicate: data and compute.

Why Scale Matters: The principle of scaling laws in deep learning empirically demonstrates that model performance often improves predictably, following a power law, as model size, dataset size, and training compute increase. While we've seen impressive results from smaller, efficient models, we are likely still far from the point of diminishing returns for many complex reasoning and multimodal tasks. Reaching the next plateau of AI capability will almost certainly require scaling data and compute far beyond current levels.

Why They Are Moats:

Non-Portability: Unlike algorithmic knowledge, engineers cannot easily take petabytes of proprietary, curated internal data or access to tens of thousands of specialized accelerators (like TPUs or GPUs) with them when they change jobs.
High Barrier to Entry: Building world-class compute infrastructure (data centers, custom silicon, high-speed interconnects) and accumulating diverse, high-quality datasets at the scale required represents billions of dollars in capital expenditure and years, often decades, of cumulative effort and investment. This is not something startups or even well-funded competitors can easily replicate overnight.
Synergistic Flywheels: Access to vast compute allows for more ambitious experiments and training larger models. These improved models, when deployed, can generate new, valuable interaction data, which feeds back into further model improvements, creating a virtuous cycle that is difficult for competitors with lesser resources to match.

Gemini 2.5 Pro: A Glimpse of the Advantage

Gemini 2.5 Pro Experimental, recently released by Google, offers a glimpse into how these interacting factors of data and compute will lead to a durable advantage in Google’s AI model performance. Despite OpenAI and DeepSeek releasing highly performant thinking models months in advance of Google (representing a large lead in algorithmic innovations), Gemini 2.5 Pro has managed to score #1 across the board in Chatbot Arena and across a wide range of benchmarks.

While Google describes Gemini 2.5 Pro partly through algorithmic concepts like "thinking models," the sheer breadth and depth of its capabilities, validated by both benchmarks and human preference, strongly suggest that these algorithms are being scaled and refined using computational resources and data diversity that few, if any, competitors can match. The "significantly enhanced base model" (as described by Google) is almost certainly a product of larger parameter counts trained for longer durations on more diverse data, enabled by Google's vertical integration of hardware (TPUs) and software within their hyper-scale data centers.

Google's Unassailable Advantage

This brings us to Google. When assessing data and compute advantages, Google stands in a league of its own.

1. Data Dominance:

Breadth and Modality: Google possesses arguably the most diverse and extensive collection of multimodal data on the planet. Consider the sources:
- Google Search: Billions of daily queries provide unparalleled insight into human intent, language variation, and real-time information needs (text, images, implicit semantics).
- YouTube: The world's largest video platform offers vast amounts of video, audio, transcripts, comments, and multilingual content – crucial for multimodal understanding.
- Android: Interaction data from billions of devices provides insights into user behavior, application usage, and sensor inputs (potentially anonymized and aggregated).
- Google Maps: Geospatial data, satellite imagery, Street View imagery, reviews, and real-time traffic information.
- Gmail, Docs, Workspace: While respecting user privacy is paramount, Google potentially has access (for internal R&D, aggregated/anonymized analysis, or opt-in features) to colossal amounts of text, code, and collaborative data reflecting professional and personal communication patterns.
- Google Books: A massive corpus of digitized text spanning centuries.
- Chrome: Web interaction data (aggregated and anonymized) reflecting how users navigate and consume information online.
Scale and Freshness: The sheer volume is staggering, but equally important is the constant influx of new data, keeping datasets fresh and reflecting current events, language evolution, and emerging trends. This continuous stream is vital for maintaining model relevance and accuracy.

2. Compute Superiority:

Custom Silicon (TPUs): Google made a strategic bet on custom AI accelerators years ago with its Tensor Processing Units (TPUs). Now in their 7th generation, TPUs are designed specifically for large-scale ML training and inference, offering potentially significant advantages in performance-per-watt and performance-per-dollar for Google's specific workloads and scale compared to general-purpose GPUs. This vertical integration allows hardware and software co-design for optimal efficiency.
Infrastructure Mastery: Google operates some of the world's most sophisticated and efficient data centers. Decades of experience in distributed systems (MapReduce, Borg/Kubernetes, Spanner) translate into an unparalleled ability to orchestrate and execute massively parallel training jobs reliably and efficiently across thousands of accelerators. This isn't just about owning chips; it's about the networking fabric, power delivery, cooling, and system software that make large-scale training feasible.
Capital Investment: Google has the financial resources to sustain and expand this infrastructure lead, continuously investing billions in data centers and next-generation TPUs.

Conclusion: The Inevitable Frontrunner?

While the AI race is far from over, and competitors like OpenAI and Anthropic continue to innovate, the fundamental dynamics favor players with entrenched advantages in data and compute. Algorithmic breakthroughs will continue to happen across the ecosystem, but they diffuse quickly. The ability to scale these algorithms using proprietary data and custom-built, hyper-scale infrastructure is the real differentiator.

Google's unparalleled data ecosystem, harvested across its diverse product portfolio, combined with its long-term investment in custom TPUs and mastery of planetary-scale computing, creates a formidable moat. Gemini 2.5 Pro is likely just an early indicator of what this integrated advantage can produce. As the demands for data and compute continue to escalate on the path to more capable AI, Google's lead in these foundational resources positions it strongly to outpace the competition and ultimately define the next era of artificial intelligence.

The Foundation Model Trap

Chris Hayduk — Wed, 05 Mar 2025 20:14:07 GMT

Harvey Sawikin recently wrote a great article analyzing the AI industry through a very Munger-like lens: will AI turn out more like the cereal industry (where there are many competitors with very healthy profit margins) or more like the airline industry (where competition compresses profit margins to near 0).

This idea has major implications for the major AI labs training foundation models today, such as OpenAI, Anthropic, and xAI. In this article I'll attempt to flesh out my understanding of this cereal vs. airline distinction and discuss why the airline scenario is more likely for the foundation model providers.

Before we dive in, you can find Harvey’s article below. I highly recommend giving it a read before continuing here.

Harvey’s Substack

AI Companies: Cereals or Airlines?

In the post The Munger Games, inspired by my first-ever attendance at the Berkshire annual meeting and purchase and reading of Poor Charlie’s Almanack, I promised more commentary on Charlie Munger’s book once I’d reflected on it. One insight that stuck with me has come to the fore lately as I’ve tried to get my head around AI – an effort that isn’t theo…

a year ago · 6 likes · 3 comments · Harvey Sawikin

Okay, so let’s start with differentiating why cereals allow for competition with healthy profit margins, whereas airlines are a rough business for all involved.

Cereals have different flavors, so consumer preferences for a certain flavor can cause some degree of demand inelasticity. From a firm perspective, rather than chasing the flavor and profit margins of another firm's cereal, the more profitable long-term strategy is to specialize in a different flavor and reap your own healthy profit margins.

By contrast, the main service that airlines provide is transporting you from Point A to Point B. There isn't really an "experience" to speak of that differentiates airlines from one another (particularly for non-business class flyers), so the calculus for a consumer then comes down to only two factors: speed and cost. Speed can be achieved through two means: faster planes (which hasn't happened in decades) and more direct flights. Airlines are incentivized to provide direct flights between major cities/transit hubs because, if they did not, then any travelers going between major hubs (say, NYC and London) would choose the other airlines which did have those direct flights. Thus, we can assume that most major airlines will have direct flights between most major transit hubs/cities within a certian distance of each other. Hence, for any two airlines that have a direct flight between a fixed pair of cities, the only way to compete is on price, since this will be the only criterion differentiating the airlines for consumers. Thus, a small difference in price will lead to nearly all consumers choosing the cheaper options. This inherently must drive down profit margins as airlines seek to charge the lowest possible price while still maintaining profitability.

Translating this argument to AI, we see two potential paths forward:

Cereal Mode: We know that the data input to a model during its training process basically determines its behavior on the other end - what it acts like, what tasks it's good at, etc. Access to different types of data may thus give rise to different "flavors" of AI models, providing varying skill profiles and personalities. In this scenario, we could imagine that OpenAI provides the best chat experience (due to its large dataset of user chats), while Grok might provide the best news aggregation and summarization (due to its up-to-the-second Twitter data). This may provide enough distinction to allow each AI company to charge healthy profits margins on their respective foundation models.
Airline Mode: In this case, maybe the data on the margins provided by chat interactions, Twitter, etc. doesn't move the needle much in terms of model behavior and capabilities. Perhaps the web-scale pretraining data drowns out the idosyncracies across each AI lab's datasets, leaving each lab's state-of-the-art AI models performing roughly identically. In this case, the only way to compete would be on the API pricing, with consumers rapidly moving to the cheapest option available that can perform the given task.

Based on trends of the last few months, I think Airline Mode is looking more and more likely. The Chatbot Arena leaderboard shows that all of the leading models from the main labs perform roughly similarly to each other (Grok 3 and GPT-4.5 are even currently within 1 Elo point of each other as of this writing!). And DeepSeek was able to reproduce OpenAI o1 in the span of a couple months (R1's Elo is actually 11 points higher than o1's). We're seeing more convergence between models over the last couple of years, not less.

Given that, unless a lab gets to AGI and the idea of recursive self-improvement leading to a permanent advantage turns out to be true, I don't see how foundation model training can provide durable, healthy profit margins without a significant change in business model for these companies.

Understanding DeepSeek Part II: DeepSeek-V2

Chris Hayduk — Wed, 05 Mar 2025 13:31:45 GMT

Summary

DeepSeek-V2, released in June 2024, built off the success of DeepSeek's previous papers to set a new standard for training and inference efficiency. The core changes made to DeepSeek-V2 that set it apart from prior open source models occur in two core components of the transformer architecture: the attention block and the feed-forward network (see below image).

The two key changes can be summarized as follows:

1. Feed-Forward Network Optimization: DeepSeekMoE architecture

Mixture of experts (MoE) layers are a drop-in replacement for the feed-forward layer in the standard transformer architecture. Prior to DeepSeekMoE, most MoE architectures functioned by splitting the feed-forward layer into several large feed-forward layers. Each input token would then "choose" 1 or 2 of these parallel feed-forward layers, also known as "experts", for its own computation. This architecture had one key problem - namely, each expert needed to learn large amounts of redundant information, since processing any token on any topic requires understanding of grammar, semantics, etc. DeepSeek solved this redundancy problem, thereby greatly increased the learning efficiency of the MoE architecture, through three key innovations. These included: more numerous, finer-grained experts; separating experts into shared and routing experts; and load balancing tokens across experts and devices. For more details on these innovations, see the previous blog post in the series.

2. Attention Layer Optimization: Multi-head Latent Attention (MLA)

Multi-head attention, described in detail in my other post, utilizes three matrices to produce new representations of the input tokens: the Query, Key, and Value matrices. Each of these matrices has dimension n x d, where n is the maximum length of the sequence and d is the dimension of the vector representing each token in the sequence. Standard transformers cache the Key and Value matrices for every layer fully in-memory at inference time, improving speed but resulting in large memory overhead. DeepSeek-V2's solution is to compress the Key and Value matrices at each layer into a single latent vector. At inference time, only this vector needs to be cached, substantially reducing memory requirements.

Since we already described the DeepSeekMoE architecture in detail in the previous blog post of this series, this post will focus primarily on multi-head latent attention. We'll start by describing the problem it aims to solve, then move on to describing the intuition behind MLA's solution, and finally dive into the concrete math describing the method. We'll then end this post by discussing the effects that the combination of DeepSeekMoE and multi-head latent attention has on training and inference efficiency. Let's dive in!

Note: This post is part of the “Understanding DeepSeek” Series:

Understanding DeepSeek Part I: DeepSeekMoE
[This article] Understanding DeepSeek Part II: DeepSeek-V2
[Upcoming] Understanding DeepSeek Part III: DeepSeekMath
[Upcoming] Understanding DeepSeek Part IV: DeepSeek-Prover-V1.5
[Upcoming] Understanding DeepSeek Part V: DeepSeek-V3
[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1
[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World

The Memory Efficiency Problem with Standard Multi-Head Attention

Standard multi-head attention, at its core, solves the problem of deciding how to update our understanding of one concept, given a set of other, potentially-related concepts. In the case of language modeling, we want to update our understanding of a particular token using the understanding of the other tokens present in the sequence. To accomplish this, at each attention layer in a transformer, the model learns to parametrize three key matrices: the Query, Key, and Value matrices. These three matrices work together to identify the most relevant portions of the sequence for each token, and then to update each token's representation based on the relevant portions that were found. I won't cover the full details of how this is done here, but you can reference my other blog post for more information.

Now, when language models are producing output at inference, we essentially need to place the transformer in a while loop. Until the transformer outputs an “End of Sequence” token, we’ll feed the input sequence into the transformer to produce the next token. Then, appending that next token to the input sequence, we’ll feed the newly elongated input sequence back into the transformer, repeating the process.

The key insight that enables caching here is the following: since modern LLMs are causal, meaning future tokens cannot influence previous tokens, by adding a new token to the end of the input sequence, we do not change the representation of any of the previous tokens. Hence, we do not need to recompute the hidden representations for the previous tokens, since these will be identical.

The only token for which we need to compute a new representation is the next token in the sequence (that is, the one token that doesn’t exist yet)! Another core insight coming from this observation is that we only need the key and value vectors for each previous token to compute the new token’s representation. Since the previous tokens’ representations do not change, we don’t need to use the other tokens as "queries” to update their representations. However, we do need their key and value vectors so that we can “query” these vectors with the new token's query vector.

The above observations then give us a road map for caching values in the transformer in order to limit the number of computations we perform and speed up inference time. In particular, we must cache the Key and Value matrices at each hidden layer so that we can use these to compute the hidden representation for the new token.

Now let’s compute the memory requirements to store these cached values for Llama 3.3 70B, a state-of-the-art open source model at the time of writing. (In practice, Llama 3.3 uses Grouped-Query Attention, which actually reduces caching requirements. For the sake of simplicity, we'll assume it uses standard attention here.)

Llama 3.3 has 80 attention layers. Each key and value vector in these attention layers has a dimension of 8192. And Llama 3.3 has a maximum context length of 128,000 tokens.

If Llama 3.3 is used in the default floating point 16 (FP16) mode, then each stored number will take up 2 bytes (16 bits). Hence, a single vector consisting of 8192 floating-point numbers will take up 16,384 bytes, or equivalently 16.384 kilobytes. For each cached token in our input, we need to store both a key vector *and* a value vector at each layer. Hence, at every layer, a cached token will require two vectors, totaling 32.768 KB in memory. Since there are 80 such layers, the cost to cache one token is thus 80 * 32.768 KB = 2621.44 KB (equivalently, 2.62 MB).

Now, suppose our input is 10,000 tokens long and we are producing the next token in the sequence. To cache the necessary data for the previous tokens, we need 10,000 * 2.62 MB = 26,200 MB (equivalently, 26.2 GB).

If our input uses the full Llama 3.3 context length of 128,000 tokens, the required space is 128,000 * 2.62 MB = 335,360 MB (equivalently, 335.36 GB).

As can be seen by the above example, memory requirements for the cache expand quickly as the input length increases. This makes it incredibly difficult to serve models with long context windows. In order to solve this problem with the standard transformer architecture, DeepSeek introduced Multi-head Latent Attention (MLA).

Multi-head Latent Attention (MLA)

In order to overcome these memory efficiency issues, DeepSeek created the Multi-head Latent Attention layer. This layer modifies standard multi-head attention (depicted on the left side of the above image) by compressing the key and value matrices into a single vector. In practice, this looks like the following:

That is, our model now must learn three additional matrices per layer - one down-projection matrix and two up-projection matrices. By learning these three matrices, we now no longer need to store the entire Key and Value matrices when caching previously-computed tokens. Instead, we can store the compressed latent vectors for each layer, where the compressed latent vector at a layer contains all of the information needed to produce the full Key & Value matrices.

Thus, if we have L layers, we only need to store d_c * L values (d_c numbers in each latent vector and L latent vectors total, one per layer).

Let's take the example of Llama 3.3 that we illustrated above to see how much this gains us - previously, caching the full Key and Value matrices for the full 128,000 token context length of Llama 3.3 required 335.36 GB. Now, instead of caching the full matrices, let's imagine we've augmented Llama 3.3 to use MLA. DeepSeek sets the dimension of the latent vector to four times the hidden dimension, so we will use 32,768 as the dimension for our latent vector here. Hence, each vector takes up 0.06554 MB. Then, to cache one latent vector at each of Llama 3.3's 80 layers corresponds to using 80 * 0.06554 MB = 5.243 GB.

This is a substantial reduction from the initial requirement of 335.36 GB for standard attention, demonstrating the efficiency gains that can be driven using this approach.

Training and Inference Efficiency

DeepSeek-V2 introduces significant efficiency improvements in both training and inference compared to its predecessor, DeepSeek 67B, primarily through innovations in its architecture—especially the Multi-head Latent Attention (MLA). By compressing the Key and Value matrices into a single latent vector, MLA dramatically reduces memory consumption during inference. The reduction of the KV cache by approximately 93.3% translates directly into substantial gains in maximum generation throughput, allowing DeepSeek-V2 to achieve throughput levels up to 5.76 times greater than those observed in DeepSeek 67B. These optimizations enable DeepSeek-V2 to handle much longer contexts (up to 128K tokens) efficiently, positioning it as one of the most practical choices among large-scale language models for real-world applications where large-context inference is critical.

Additionally, the integration of DeepSeekMoE into the Feed-Forward Network layers synergizes well with MLA, enabling significant computational savings without sacrificing model performance. By activating only a fraction (21B) of its total parameters (236B), DeepSeek-V2 demonstrates economical training by saving 42.5% of training costs compared to dense models of similar scale. Thus, MLA plays a critical role not only in inference-time efficiency but also in making the pretraining phase more cost-effective.

Results and Key Takeaways

The innovative Multi-head Latent Attention layer significantly enhances the practical deployability of DeepSeek-V2. Compared to traditional Multi-Head Attention, MLA achieves superior inference performance while simultaneously overcoming the KV cache bottleneck. With its novel low-rank joint compression strategy, MLA significantly reduces inference memory overhead, making DeepSeek-V2 particularly suited for high-throughput, real-time applications requiring extensive context management.

Empirical evaluations on various benchmarks illustrate the clear strengths of DeepSeek-V2, even when compared against other leading open-source models of the time. Notably, DeepSeek-V2 consistently achieved top-tier performance on benchmarks such as MMLU, math reasoning tasks, and coding challenges, highlighting the architectural advantages introduced by MLA. Moreover, these enhancements enabled DeepSeek-V2 to be trained and served at a fraction of the cost of comparably performing dense models (see the above image).

All in all, Multi-head Latent Attention represented another significant milestone for DeepSeek on the path towards highly optimized training and inference that marked their revolution with DeepSeek-R1 and DeepSeek-V3. The next blog post in this series will dive into the new innovations introduced for DeepSeek-V3, building upon the foundations laid here and forming the base model used to train DeepSeek's state-of-the-art reasoning model.

A Primer on Multi-Head Causal Self-Attention

Chris Hayduk — Sat, 01 Feb 2025 00:32:37 GMT

Lately, I've been writing quite a few series that center around the transformer architecture. For many of those blog posts, I struggle to decide whether I should include the background information necessary to understand attention (greatly increasing the length of the blog post) or I should assume the reader already knows this information (limiting the reach of my audience). Thus, this post is intended to be a compromise between the two positions, allowing me to link this post as background reading in any future blog post that requires knowledge of the nuts and bolts of the attention architecture.

This will be a "living" blog post, in that it will be edited and expanded upon as my own understanding of the architecture grows and deepens. If there are any radically large changes that I make, I will re-email the post out to subscribers for their review. Otherwise, feel free to check back periodically to see how the article has changed!

The Basic Terminology of Multi-Head Causal Self-Attention

The standard attention block used in first-generation LLMs like GPT-2 and GPT-3 is multi-head causal self-attention.

The goal of this variant of attention, like any attention variant, is to learn how to update a vector using other context vectors in order to accomplish some goal. In the case of language modeling, our vectors represent tokens, which you can think of as roughly analogous to words. The goal of these vector updates is to accurately predict the next word in the sentence. It is called causal because this type of attention ensures that each word can only update itself using previous words in the sentence - that is, it can't look ahead and update itself using words that haven't been written yet! It is called self-attention because the things that each word is paying attention to are the other words in the sentence. There is no outside data or context involved here. And finally, it is termed multi-head because, at each attention layer, we have multiple attention operations occurring in parallel. These parallel attention operators are referred to as "heads".

To produce the results of attention, each attention head takes as input a sequence of tokens represented as vectors. These vectors are passed through three feed-forward networks per head in parallel, projecting each token's vector into three new vectors. These new vectors are commonly referred to as the query, key, and value vectors. These query, key, and value vectors are then used to update the vector representations of the words in our sentence, improving the model's understanding of the concepts contained in the sentence.

Let's take a look at how this is done in practice.

The Mathematics of Causal Self-Attention

As mentioned, the attention block takes as input a sequence of tokens represented as vectors. Suppose the input sequence is given by:

where each x_i is the vector representation (embedding) of a token.

Each input vector x_i is simultaneously projected into three different spaces using learned linear transformations. That is, for every token x_i, we compute:

where:

W^Q, W^K, and W^V are the weight matrices for the query, key, and value projections, respectively.
The sets of all query, key, and value vectors are often denoted as Q, K, and V.

For a given token x_i, we will compute a similarity score with every token x_j such that x_j comes before it in the sentence (or is the token itself). This is done by taking the dot product of the query vector for x_i with the key vector for k_j. This result is then divided by the square root of the key vector's dimension. Mathematically, this is given by:

This value is precisely the unnormalized measure of how much token i should attend to token j. In linear algebra, the dot product of two vectors is just a scaled version of the cosine of the angle between them. An angle of 0 degrees gives a cosine value of 1, while an angle of 180 degrees gives a cosine value of -1. Hence, the closer the two vectors get to pointing in the same direction, the closer their dot product gets to 1, and the closer the two vectors get to pointing in opposite directions, the closer their dot product gets to -1. Intuitively, then, cosine (and by extension the dot product) has very desirable properties to use as a similarity function in the attention mechanism.

Armed with these similarity scores, we now have a way of measuring how "similar" two tokens in our sequence are. However, in order to use them to produce new vector embeddings, we're going to want to rescale them. The dot product between the query and key vectors could end up being quite large, and using this value directly can cause large changes in the scale of the vector representation for a given token. Moreover, knowing the score of a particular token pair (let's say between tokens x_i and x_j) tells us nothing about how important that pair is - importance is always relative, and what if the score for x_i and x_k is bigger?

Given the above discussion, we know we need to introduce some function that will re-scale our scores in such a way that we do not radically change the magnitude of the token's vector representation and that we can quickly determine how "important" each token pair is. A convenient differentiable function that does just this is softmax. The equation to produce the softmax output for token x_i is given below:

The softmax function will take our scores for all possible pairs made with x_i (i.e. all key vectors we multiplied by x_i's query vector) and squash them into the range of 0 to 1. Moreover, it will ensure that these values sum to 1. Hence, we can view these outputs, referred to as attention weights, as probabilities or percentages. It can be useful to think of the attention weight for the pair x_i and x_j as the percent of x_i's attention that should be paid to x_j.

Once we have these attention weights, we can use them to produce a new vector representation for token x_i. We do this by taking a weighted sum of the value vectors for each token, where the weight is the attention weight. As mentioned above, we can think of this attention weight as the percent of attention that x_i pays to each vector that precedes it in the sequence. Mathematically, this looks like:

This weighted sum integrates information from the tokens that x_i “attends” to, based on the learned attention weights.

In multi-head attention, these operations happen in parallel multiple times over. That is, we will produce multiple instances of the query, key, and value vectors for each token in the sequence. We will then use those unique instances to produce distinct updated vector representations for each token. If we have k heads, then we will produce k updated vectors for token i:

These vectors get concatenated together, forming a single vector to represent the updated token i:

The intuition behind using multiple heads to create the final updated representation for token i is that each head can learn to capture different aspects of language. One might learn grammatical structure, while another might learn vocabulary related to the legal profession. By splitting responsibilities between the attention heads, each can learn unique, non-redundant information.

This vector will then be passed through a linear projection layer, producing the final output of the multi-head attention layer.

Let's walk through a toy example now to make things concrete.

A Toy Example: "the dog barks"

Suppose our sentence is "the dog barks", and our tokenizer splits it into three tokens: "the", "dog", and "barks". Initially, these tokens are embedded into vectors:

When entering the first attention block, each of these vectors is projected into three new vectors:

Thus, we transform 3 input vectors into 9 new vectors (3 each for queries, keys, and values).

Updating the "dog" Token

Let’s focus on updating the token "dog". In our example, "dog" corresponds to the second token x_2. To update its representation, we use its query vector and compute dot-product scores with the key vectors of "the" and "dog" (i.e., the tokens that precede or are the token itself):

where d_k is the dimensionality of the key vectors. The division by the square root of d_k is used to normalize the scores.

These scores are then passed through a softmax function to obtain attention weights (or probabilities) that indicate how much attention "dog" should pay to itself and to "the":

Softmax ensures that these attention weights are between 0 and 1, and that they all sum to 1. Hence, they are valid probabilities and, to make things easier, you can of what percent of its attention the word "dog" should pay to the word "the" or to itself.

With the attention probabilities computed, we update the original vector representation of "dog" by taking a weighted sum of the corresponding value vectors. In this case, we combine the value vector of "the" and the value vector of "dog":

This new vector is an updated representation that incorporates contextual information from the preceding token "the" as well as from "dog" itself.

Recap

To recap, these are the major steps for updating the "dog" vector in our example using causal self-attention:

1. Input Embedding:

Each token is embedded into a vector x_i.

2. Linear Projections:

Each x_i is projected into query, key, and value vectors:

3. Score Calculation (for causal attention):

For token "dog" (second token), calculate:

4. Softmax to Obtain Weights:

Convert scores to probabilities:

5. Contextual Update:

Update "dog" by a weighted sum of the value vectors:

In the full multi-head attention mechanism, this process is performed in parallel over multiple "heads" (with different learned projections), and the results are concatenated and transformed further to form the final output of the attention block.

Key Takeaways

Let's now summarize the key points of what we've learned:

Attention is a neural network mechanism used to update vectors using the context from other vectors
Input vectors to an attention layer are replaced by 3 intermediate vectors: the query, key, and value vectors
The query and key vectors work together to produce similarity scores between pairs of vectors. If we are updating token i and want to know how much token j should influence our input, we multiply the query vector of token i by the key vector of token j.
The scores produced by the query and key vectors can be turned into probabilities using the softmax function. These probabilities are used to measure how much token i should consider the tokens that came before it in the sequence when updating its vector representation
The new vector representation for token i is produced by multiplying the softmax probabilities by the value vectors for each corresponding token. These are then summed together
The above process occurs independently across several parallel attention operations, called heads. At the end of the attention block, the new vector representations for token i coming from each head are concatenated together and passed through a linear projection layer.
The above process (steps 1-6) is performed in parallel for the full sequence of input tokens.

If you keep these 7 key points in mind while reading Arxiv papers (or my future blog posts!), you'll have a strong understanding of what multi-head causal self-attention is doing, where it faces limitations, and whether or not a given architectural change actually addresses those limitations.

Understanding DeepSeek Part I: DeepSeekMoE

Chris Hayduk — Thu, 30 Jan 2025 03:28:29 GMT

Series Introduction

Recently, the announcement of DeepSeek-R1 shook the AI world, as an open source project managed to match the performance of OpenAI's state-of-the-art API, o1, within months of its release. The market reacted vehemently to this news, with Nvidia's stock dropping 18% in a single day. AI researchers, engineers, and commentators alike took to Twitter/X to share their thoughts on DeepSeek-R1's implications for the AI industry and the United States, with many asserting that the age of American AI had come and gone in a flash, with China now firmly taking the lead.

But were these takes correct?

In order to dissect the true implications for the world going forward, we first need to understand DeepSeek-R1 on a fundamental level - what it is, what it does, how it works, and what key innovations it introduced. This blog post series will aim to arm you with that knowledge.

To do this effectively, we are going to start at the beginning of DeepSeek's major papers and work our way forward in time, tracing out the researchers' reasoning and how they arrived at the final design for DeepSeek-R1. This final design included two key components:

An efficient mixture of experts language model base
Reinforcement learning-tuned chain of thought capabilities

In this blog series, we will explore two separate but related series of papers in order to deeply understand the two key components of DeepSeek-R1. First, we will trace the evolution of the mixture of experts architecture from DeepSeek-MOE to DeepSeek-V3, their newest state-of-the-art language model. We will then turn our attention to reinforcement learning-tuned chain of thought, beginning with the seminal DeepSeekMath paper and working our way forward to the current AI darling - DeepSeek-R1.

With this strong foundational knowledge of the theoretical underpinnings of DeepSeek-R1, we will be able to separate the hype from the noise. In light of what we've learned from these paper deep dives, this blog series will conclude with an analysis of the implications of DeepSeek-R1 from several perspectives:

Technological progress
AI market dynamics
Geopolitical risks

By the end of this series, you will have a clear, evidence-based understanding of DeepSeek-R1—what makes it powerful, where it stands relative to its competitors, and what its long-term impact might be. As the AI landscape continues to shift at an unprecedented pace, cutting through speculation and focusing on the fundamentals will be key to making sense of the road ahead. Let’s dive in.

Note: This post is part of the “Understanding DeepSeek” Series:

[This article] Understanding DeepSeek Part I: DeepSeekMoE
Understanding DeepSeek Part II: DeepSeek-V2
[Upcoming] Understanding DeepSeek Part III: DeepSeekMath
[Upcoming] Understanding DeepSeek Part IV: DeepSeek-Prover-V1.5
[Upcoming] Understanding DeepSeek Part V: DeepSeek-V3
[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1
[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World

Paper Summary

Mixture-of-experts (MoE) models are an extension of the standard transformer architecture in which a collection of expert modules (typically feed-forward networks) each learn to specialize in different aspects of the data. For a given token or input, only a subset of these specialized experts is activated, allowing the model to dynamically focus its computation on the most relevant components. This selective activation enables MoE models to achieve a high effective capacity—since many different specialists are available—while maintaining computational efficiency, because only a limited number of experts actually process each input. As a result, MoE approaches excel at capturing diverse patterns, efficiently scaling model size, and flexibly adapting to a wide variety of tasks.

Standard mixture-of-experts models, used prior to DeepSeekMoE, typically rely on selecting the top K experts (often 1 or 2) out of N possible experts for each token in a sequence. While this approach does reduce computational load—since only a small fraction of experts are activated—it also forces those few activated experts to capture all aspects of the token, including common linguistic structure that is often duplicated across experts. Consequently, an enormous portion of each expert’s capacity is spent memorizing redundant information, leaving less room for true specialization.

DeepSeekMoE improves upon the standard MoE architecture, solving this redundancy problem by:

1. Using a larger number of smaller experts (Fine-Grained Expert Segmentation)

Instead of a few large experts, DeepSeek splits capacity into many more experts, each of which is smaller in dimensionality. The model then increases the number of selected experts by the same factor, creating a dramatically larger space of potential expert combinations. Despite this combinatorial explosion, the overall parameter count and per-token activated parameters remain exactly the same as in a conventional MoE setup—meaning we gain richer representational capacity without paying extra in total parameter count or computational cost.

2. Separating Experts into Shared and Routing Experts

DeepSeek also partitions its experts into two sets. The shared experts, which are always activated for every token, learn the broad “common knowledge” required by all inputs (e.g., syntax, high-level semantics). The routing experts, by contrast, are only activated if they are relevant to a specific token, allowing them to focus on niche or domain-specific information. This further decreases redundancy and promotes parameter efficiency: shared experts handle language “fundamentals,” while routing experts handle specialization.

3. Load Balancing Through Additional Loss Terms

Finally, DeepSeek addresses load balancing in two senses. It enforces a roughly equal usage of each active routing expert across tokens—ensuring no single expert is under- or over-utilized—and distributes the experts themselves across multiple GPUs to avoid hardware bottlenecks. Both of these aims are achieved by incorporating new balancing terms into the training objective.

Taken together, these modifications produce a model that is both parameter-efficient and highly flexible. By boosting expert variety, removing needless duplication, and balancing the workload across experts and devices, DeepSeekMoE provides a substantially more effective way to leverage MoE architectures—achieving greater specialization and capacity without increasing the overall parameter footprint.

Let's dive in deeper into these three optimizations now and see how they alter the standard MoE transformer architecture.

Standard Mixture of Experts Models

In standard MoE architecture, expert layers will typically replace the feed-forward layer that occurs after self-attention. Experts can be thought of as a set of N feed-forward layers that are structurally identical to the original feed-forward layer. Only a subset of these N possible feed-forward networks will be activated for any individual token, with many prior MoE architectures selecting 1 or 2 of these N possible networks for a given token.

Whether or not a network is activated is determined by taking the dot product of the output of the attention layer for that token (i.e., the hidden vector for token i) with the centroid of the current expert. We then take the softmax of this value to force it into the range of 0 to 1. You can think of this like an attention score computed over the experts instead of the tokens - we want to see which expert aligns most closely with the current token under consideration. These scores are computed for each expert, and then the experts are ranked according to this score. The top K (usually 1 or 2) experts are selected based on this ranking, and the token embeddings are then passed to those feed-forward expert networks.

The output of these experts is added together alongside the initial hidden state for the token (i.e., the token vector prior to the application of the experts). This produces the final output for the given layer.

The major obstacle with this approach is the following: since most prior MoE models only selected the top 1 or 2 experts for each token, the selected expert(s) must capture everything about a given token, including redundant information such as language structure. This wastes a large amount of the model's capacity to learn useful information, forcing the weights of each expert to memorize redundant information that is already captured by the other experts.

Fine-Grained Expert Segmentation

One of DeepSeek's solutions to the redundancy problem is to make experts smaller but more numerous. That is, the DeepSeekMoE approach reduces the dimensionality of each individual expert's feed-forward network (and therefore its computational cost and representational capacity) by a factor of 1/m compared to the network's standard feed-forward networks. Correspondingly, it increases the number of total experts by a factor of m and the number of selected experts by the same factor of m. This results in the same number of parameters for the model on net, but allows for substantially more variety when selecting the experts to use for a specific token.

We can see this increased variety when examining the combinatorics of the expert space. Suppose our standard feed-forward network has hidden dimension 4096, and our standard mixture of experts model uses 8 of these experts in total, with 2 selected for any given token. This results in the following number of possible expert combinations for each token in the standard mixture of experts model:

Now, using the DeepSeekMoE architecture, suppose we have m = 8. That is, we are going to increase our number of experts by a factor of 8 (and reduce the hidden dimension by a factor of 1/8). This gives us a hidden dimension of 512 per expert, with 64 total experts and 16 experts selected for any given token. This results in the following number of possible expert combinations for each token in the DeepSeekMoE version of the model:

That is, we go from 28 possible expert combinations to nearly 489 trillion possible expert combinations! This allows for significantly more specialization across experts and much more variety in knowledge application on a token-by-token basis. Astonishingly, even with this huge increase in variety, the number of tokens stays exactly the same! The number of total parameters in each model is given by:

Similarly, the number of parameters activated for any given token is exactly the same:

Hence, we get basically a free lunch here - significantly higher representational capacity in our model with the same number of parameters used!

Shared Experts

Another approach DeepSeek took to avoid capturing redundancy in its experts is to segment the expert population into two groups: shared experts and routing experts.

Shared experts are always activated, regardless of the input token. This incentivizes these expert modules to capture common knowledge relevant to all queries (e.g., language semantics). By contrast, routing experts are only activated if the token is relevant to the expert, as described in the "Standard Mixture of Expert Models" section.

That is, the initial mN experts are split into two groups: K_s shared experts and K_r = mN - K_s routing experts. All of the K_s shared experts are activated for all tokens, while a subset of the K_r are selected for each token. Mathematically, this looks like the following:

Hence, we can see that the hidden vector output of token t at layer L always uses all of the shared experts (denoted by the first summation in the equation) and always includes the residual (denoted by the last term). The middle term, representing the routing experts, includes a gating factor that controls which experts are turned on for any specific token. In particular, the gating factor is the output of a softmax if the expert ranked in the top mK experts. Otherwise, it is 0. As a result, not only do we eliminate most of the possible experts (thereby greatly reducing the number of active parameters), we also weight the final output based on how close each chosen routing expert is to the token. In other words, the more a chosen routing expert "knows" about a topic, the more heavily we weight its opinion.

This setup allows the routing experts to ignore the redundant information captured by the shared experts and instead focus on learning concepts and information that are relevant to their area of specialization. This promotes parameter efficiency in the model, as each marginal parameter added to the routing experts will be encouraged through the learning process to acquire information that is distinct from the existing parameters.

Load Balancing

Now that we have a better-designed MoE network with fine-grained experts and expert sharing, there still remains one major challenge to ensure the parameters are used maximally - we need to load balance requests across the available experts. Essentially, our goal is to force each token to attend to the outputs of the mK chosen routing experts roughly equally. This makes certain that, when we activate routing expert parameters to process a particular token, all of the activated parameters are contributing meaningfully to the output. As a result, we maximize the utilization of the MoE architecture.

In addition to load balancing across experts, we would like to load balance across devices. Experts are typically stored on many separate GPUs, since these models are too large to fit in the memory of a single GPU. Given this fact, we would like the chosen experts for a token to be evenly spread across devices, thus preventing overloading of any single GPU.

These two goals are achieved by DeepSeekMoE through introducing two new terms to the loss function.

Results and Key Takeaways

With the above optimizations, DeepSeek was able to mitigate many of the most challenging problems facing MoE models. Together, fine-grained segmentation, shared experts, and load balancing work to maximize the amount of unique, useful information stored in a given set of parameters. As a result, DeepSeekMoE is able to outperform models with fewer active parameters. Below, we can see that DeepSeekMoE outperformed LLaMA2 7B (a dense model that does not use any experts) across a number of benchmarks with fewer than half of the active parameters.

When compared to another mixture of experts model, GShard, we see that DeepSeekMoE again outperforms it with the same total parameters and only half of the activated parameters.

In sum, DeepSeek's optimizations for the MoE architecture served to substantially expand the possibilities for local and edge inference. Since only a small percentage of the model's total parameters are active for any given token, during inference, the model's performance requirements are much closer to those of a small, weak model. However, its output quality matches that of a large, well-trained dense LLM. This innovation was critical for laying the groundwork towards DeepSeek-R1, ensuring that state-of-the-art base LLM performance would be possible for smaller models.

Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

Chris Hayduk — Wed, 22 Jan 2025 15:53:30 GMT

Note: This post is part of the “Understanding Protein Language Models” Series:

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
[This article] Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

Overview of the Main Ideas

AlphaFold2’s MSA: AlphaFold2 identifies evolutionarily related proteins to the target sequence and builds a multiple sequence alignment (MSA). In the Evoformer block, row-wise (within-sequence) and column-wise (across-sequences) attention on this MSA yields information about co-evolving residues. This MSA-based representation is then integrated into a pair representation matrix, ultimately helping AlphaFold2 predict the 3D structure.
ESMFold’s Language Model Encoding: In ESMFold, the MSA step is replaced by a large protein language model (ESM-2) trained via a Masked Language Modeling (MLM) objective. As in standard large language models for text, the hidden layers of the encoder learn semantic and syntactic regularities—in this case, biochemical and structural patterns. The result is that ESMFold can leverage these learned encodings to identify motifs and co-evolving positions without explicitly performing genetic database searches or building large MSAs.
Conceptual Motif Lookup: We can interpret ESM-2’s embeddings as performing a “continuous fuzzy lookup” within an implicit database of protein motifs. Because the language model was pretrained on massive amounts of protein data, it has effectively learned how residues co-occur—and thus co-evolve—within protein families. This internal representation replaces the explicit MSA step.

Below, we will dive into how this replacement works in more detail, starting with a short recap of AlphaFold2’s MSA-based pipeline and then exploring how ESMFold (and ESM-2 as its core) sidesteps explicit alignment by using learned representations.

1. Revisiting AlphaFold2’s MSA-Based Approach

1.1 Gathering Evolutionary Information

AlphaFold2 conducts genetic searches against databases such as MGnify, UniRef90, Uniclust30, and BFD to identify sequences that share evolutionary relationships with the target sequence. From these hits, it constructs an MSA:

where L is the length of the target sequence, and S is the number of evolutionarily related sequences found. Here, s_{k,i} denotes the i-th residue of the k-th sequence in the alignment. By hypothesizing that residues co-evolve, the MSA is an external source of statistical correlations about which residues likely pair or contact each other in 3D space.

1.2 Evoformer and Pair Representation

In the AlphaFold2 pipeline:

MSA Representation M: A 3D tensor M,
where c is the dimensionality of each residue embedding.
Pair Representation P: A 2D grid P,
where each P_{i,j} is a learned embedding representing the pairwise relationship between residue i and residue j in the target sequence.

Inside the Evoformer block, row-wise and column-wise attention update the MSA representation:

Row-wise (Within-sequence) Attention
where α_{i,m} are attention weights.
Column-wise (Across-sequence) Attention
where β_{k,n} are attention weights.

After these attention layers (plus MSA transitions via 2-layer MLP), AlphaFold2 computes an Outer Product Mean that integrates MSA embeddings into the pair representation:

where u_{k,i} is the final MSA embedding vector for residue i in sequence k. This OPM_{i,j} is then added (or concatenated and projected) into P_{i,j}, effectively injecting co-evolutionary signals gleaned from the MSA into the residue-pair representation.

2. ESMFold: Replacing MSA with Language Modeling

2.1 The Core Mechanism: Encoder-Only Transformer

ESMFold (and its backbone ESM-2 model) is built around a large encoder-only transformer. It is trained with the masked language modeling objective, meaning it tries to reconstruct masked or hidden residues from context. This training strategy, originally popularized by BERT in natural language processing, has an important effect: it forces the model to encode in its weights the relevant “contexts” that predict each amino acid.

Mathematically, if x=(x_1,x_2,…,x_L) is the protein sequence and x_k is replaced by a special [MASK] token with some probability, the MLM training objective is

where p_θ is parameterized by the encoder transformer. Over billions of observed residues, the model internalizes the patterns of co-occurrence across diverse protein sequences.

2.2 Implicit Motif Lookup

Where AlphaFold2 uses explicit lookups in an MSA database (plus explicit attention across sequences), ESM-2’s learned embeddings do something analogous “in one shot.” After pretraining, the internal representation of each residue h_i (the hidden state at position i) captures average contexts encountered during training. In effect, for any position i,

h_i has high similarity to h_j if residues x_i and x_j frequently appear in similar sequence contexts in the training set.
By extension, if an entire sequence x has patterns analogous to known motifs (e.g., an ATP-binding site pattern, a signal peptide motif, or secondary-structure fragments), then the embeddings reflect these patterns—allowing ESMFold to “retrieve” them without an explicit MSA.

You can view this as a “continuous fuzzy matching” process, wherein the [KEY], [QUERY], and [VALUE] matrices of the transformer contain compressed representations of how residues co-occur. Rather than computing the dynamic-programming-based edit distances (or alignment) across a large external database, the model’s attention modules effectively do an alignment on-the-fly in a continuous, high-dimensional space.

2.3 Integration into Folding

ESMFold then appends a structure-prediction head on top of these ESM-2 embeddings, akin to how AlphaFold2 appends its structure module after the Evoformer. Even though ESMFold no longer has an explicit pair representation from an MSA, it still must estimate which residues interact or contact each other. In current ESMFold architectures:

The final hidden states from the ESM-2 encoder are projected into a lower-dimensional representation that acts like a “pair embedding” for each (i,j).
A geometry module or a series of feed-forward layers further refines these embeddings to produce coordinates or distance/contact maps.

In practice, ESMFold’s results are often on par with AlphaFold2 for many proteins, especially those with strong evolutionary constraints. For proteins with scant evolutionary data, ESMFold can sometimes do better than AlphaFold2, because it does not rely so heavily on a large MSA. On the flip side, certain proteins with well-studied deep MSAs can benefit from the explicit signals that AlphaFold2’s large MSA provides.

3. Mathematical Rationales for Replacing MSA

3.1 Complexity and Speed

One major advantage of dropping MSAs is computational efficiency. MSA searches can be prohibitively expensive for large proteins or large sets of queries, requiring queries against massive databases (MGnify, UniRef, etc.) and heuristics to align thousands of sequences. In ESMFold:

No MSA Search: The model simply takes the query sequence and feeds it through the encoder in a single forward pass.
Linear vs. Quadratic Complexity: A single Transformer forward pass for a sequence of length L has complexity O(L^2 d) where d is the dimension of embeddings, whereas building an MSA might involve searching and aligning thousands of sequences, each of length up to L.

3.2 Continuous Fuzzy Matching Perspective

If we interpret the MSA as a form of nearest-neighbor search (looking for “neighboring” sequences in a large database), then the language model is effectively a learned data structure that has:

Compressed the manifold of known protein sequences into θ (the weights).
Learned an attention-based mechanism to query that internal manifold for relevant contexts.

In typical fuzzy string matching, one might compute edit distances between the query and every entry in the database. In the ESM-2 architecture, the attention mechanism:

acts as a trainable similarity function to identify relevant contexts. The intangible advantage is that these contexts may mix and match partial motifs from multiple “virtual neighbors,” creating a new representation not limited to the top few explicit matches in a database.

3.3 Co-evolutionary Signals Without Explicit Alignments

A major reason MSA is so powerful is that it captures co-evolving residues—positions that change in correlated ways across evolutionary history. In a typical MSA-based approach, if residue i mutates from A to G, residue j might consistently switch from T to S. Over many sequences, one infers that i and j likely contact or interact structurally.

By training on a massive corpus, the language model sees countless such correlations in raw sequence form. The emergent embeddings reflect these patterns. Hence, the final hidden state h_i is (indirectly) sensitive to all correlated positions that have ever appeared near that residue in training. So even though ESMFold does not align the query sequence to a database, it has internalized an approximate version of that same statistical correlation from its pretrained weights.

4. Example: From MSA to Language Model—A Toy Mathematical Sketch

Suppose we consider a short hypothetical protein sequence x=(M,K,L,L,P,V,L). In an MSA-based approach, you might gather 10,000 sequences from a database, building:

You then compute attention across these sequences (column-wise) and across residues (row-wise), deriving correlation maps.

In the ESM-2 approach, no explicit MSA is constructed. Instead, during training, the model saw thousands (or millions) of sequences resembling (M,K,L,L,P,V,L) or partial subsequences thereof. The MLM objective forced the model to fill in [MASK] tokens in contexts like _ K L _ P V _. Over many instances, it learned which next residues are probable. As a result, once we feed (M,K,L,L,P,V,L) into ESM-2, the hidden states reflect a “compressed MSA,” effectively picking up correlations that used to require explicit cross-sequence operations.

5. Implications and Future Directions

Efficiency Gains: ESMFold runs significantly faster than AlphaFold2 when no large MSA is available, since it avoids the alignment process. For proteome-scale structure predictions, this is a game-changer.
Handling Novel Proteins: If a target protein has few homologs in public databases, MSA-based models struggle. ESMFold is robust in these “low-homology” cases since it learned general protein grammar from the entire training corpus.
Limited Interpretability: One downside is that MSA-based approaches produce an explicit record of hits and alignments, which can be biologically interpretable (e.g., which species and families contributed the signals). ESMFold’s learned embedding, while powerful, can be less transparent.
Hybrid Approaches: Some emerging methods combine pre-trained embeddings with an MSA for the best of both worlds—particularly for proteins where deep MSAs exist.
Scaling Laws and Emergent Behavior: As ESM models grow (ESM-2, ESM-3, etc.), they exhibit emergent behaviors akin to large language models in NLP. This suggests we may see further improvements in structure prediction, function annotation, and protein design.

6. Conclusion

AlphaFold2’s success showed how vital MSAs are in revealing co-evolutionary signals, which guide 3D structure inference. ESMFold’s fundamental insight is that you can pre-learn these signals at massive scale by treating protein sequences as “language.” Then, instead of collecting an MSA at inference time, the model effectively “queries” its internal knowledge of sequence co-occurrences, learned through the MLM objective.

In both approaches, the central idea is to approximate how residues covary. In AlphaFold2, that covariance emerges explicitly from a large MSA. In ESMFold, it is embedded implicitly in a high-dimensional transformer space. The advantage of the language-model approach is that it (1) eliminates the bottleneck of database searching/alignment, and (2) leverages far more global knowledge than just sequences that happen to align with the target protein.

Mathematically, we can view these approaches as two different ways to compute a “similarity function” over the manifold of protein sequences:

AlphaFold2 + MSA: An explicit alignment-based approach that organizes relevant sequences so the model can learn correlations.
ESMFold + Transformer: A large-scale learned approach that stores correlation statistics in the weights, retrieving them through self-attention rather than explicit alignment.

As these language models grow and become more accurate, their potential to replace, or at least augment, MSA-based pipelines will only increase, promising ever-faster and more versatile protein structure prediction.

In summary, ESMFold’s fundamental contribution is demonstrating how one can use a large, pretrained protein-language transformer to replicate (and in some cases surpass) the evolutionary context that an MSA provides. It is a step toward an era where generative models of protein sequence space might supersede explicit database lookups, enabling faster, more flexible, and equally accurate structure predictions—even for proteins with scarce evolutionary data.

Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching

Chris Hayduk — Wed, 15 Jan 2025 19:21:17 GMT

Note: This post is part of the “Understanding Protein Language Models” Series:

Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2
[This article] Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching
Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold

In the quest to understand modern protein language models like ESM2 and ESM3, we often focus on their impressive empirical results while treating their internal mechanisms as a black box. This post attempts to build intuition about how encoder-only transformers work by drawing an analogy to a simpler, well-understood algorithm: fuzzy string matching. I argue that encoder-only transformers can be viewed as performing a kind of continuous fuzzy lookup against a compressed form of their training data, encoded in their weights and latent space representations.

Subscribe now

The Central Analogy

When working with large text corpora, we often need to find similar strings or patterns. The traditional approach employs fuzzy string matching: maintaining a database of all strings and computing edit distances to find matches. An alternative approach, which I argue is conceptually similar but mathematically more sophisticated, uses an encoder-only transformer to compress the patterns in the corpus into model weights, then uses attention mechanisms to find similarities.

Both approaches fundamentally solve the same problem - finding contextually appropriate matches - but do so in radically different ways. Understanding this connection helps demystify how encoder-only transformers work and suggests ways to improve them.

How Encoder-Only Transformers Process Input

To understand the analogy, we first need to build detailed intuition about how encoder-only transformers work. Unlike the full encoder-decoder architecture used in translation, encoder-only transformers take a sequence of tokens and return a sequence of the same length, where each output token is a refined representation incorporating contextual information.

The process begins in the embedding layer, where discrete tokens are converted into continuous vectors. Each input token is first converted to a one-hot encoding - a vector of zeros with a single one indicating the token's identity. This sparse vector is then multiplied by an embedding matrix to produce a dense vector representation. Mathematically, for a token x_i:

one_hot = [0, 0, ..., 1, ..., 0] # 1 at position x_i 

embedding = one_hot @ W_emb # Matrix multiplication with embedding matrix

To this embedding, we add a positional encoding vector that encodes information about where the token appears in the sequence. The original transformer paper used sinusoidal positional encodings:

def get_positional_encoding(seq_len, d_model): 
   position = np.arange(seq_len)[:, np.newaxis] 
   div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) 
   pos_enc = np.zeros((seq_len, d_model)) 
   pos_enc[:, 0::2] = np.sin(position * div_term) 
   pos_enc[:, 1::2] = np.cos(position * div_term) 
   return pos_enc

These positional encodings have an elegant property: the relative position of two tokens can be computed through linear combinations of their encodings.

The heart of the transformer architecture lies in its self-attention mechanism. For each position in the sequence, the model generates three vectors through learned linear transformations:

Q = H @ W_Q # Query vectors 
K = H @ W_K # Key vectors 
V = H @ W_V # Value vectors

where H is the matrix of hidden states. The attention scores are then computed as:

attention_scores = softmax(Q @ K.T / sqrt(d_k)) 
output = attention_scores @ V

This can be written more formally as:

This attention mechanism allows each position to gather information from all other positions, with the weights determined by learned compatibility scores. The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients.

The transformer employs multiple attention heads in parallel, each with its own set of query, key, and value projections:

where each head is computed as:

After attention, each position's representation goes through a two-layer feed-forward network:

Traditional Fuzzy String Matching

To appreciate the analogy, we need to understand how traditional fuzzy string matching works. Given an input string and a database of reference strings, fuzzy matching computes the edit distance between the input and each reference. The edit distance represents the minimum number of operations (insertions, deletions, or substitutions) needed to transform one string into another.

The core of fuzzy string matching is the computation of edit distance. For strings s and t, the dynamic programming recurrence is:

def edit_distance(s, t): 
   m, n = len(s), len(t) 
   dp = np.zeros((m+1, n+1)) 

   # Initialize base cases 
   for i in range(m+1): 
      dp[i,0] = i 
   for j in range(n+1): 
      dp[0,j] = j 

   # Fill dp table 
   for i in range(1, m+1): 
      for j in range(1, n+1): 
         if s[i-1] == t[j-1]: 
            dp[i,j] = dp[i-1,j-1] 
         else: 
            dp[i,j] = 1 + min( 
               dp[i-1,j], # deletion 
               dp[i,j-1], # insertion 
               dp[i-1,j-1] # substitution 
            ) 
   
   return dp[m,n]

Mathematically, the recurrence relation is:

The computation proceeds through dynamic programming, building a matrix where each cell represents the minimum number of operations needed to match a prefix of the input string to a prefix of the reference string. The final cell gives the total edit distance between the strings. By computing this distance for each reference string and sorting the results, we can find the closest matches in the database.

While conceptually simple and mathematically elegant, this approach becomes computationally expensive for large databases, requiring time proportional to both the string lengths and the size of the database. Various optimizations exist, such as trie structures and early pruning, but the fundamental challenge of scaling remains.

The Transformer as Compressed Fuzzy Matching

Here we arrive at the core insight: the encoder-only transformer effectively compresses the pattern-matching capabilities of fuzzy string matching into its weights. During training, the model learns to encode the essential patterns and relationships present in the training data into its parameters.

The embedding matrix learns to map tokens to a continuous space where similar tokens are close together. The attention weights learn which patterns of tokens commonly co-occur, while the feed-forward layers learn to combine these patterns into higher-level features. Each successive layer captures progressively more abstract relationships. This process is analogous to building an optimized index of the training data, but instead of storing exact strings, we store distributed representations of patterns and relationships.

The connection between transformers and fuzzy matching becomes clearer when we compare their similarity computations. In fuzzy matching, the similarity between strings s and t is:

In transformer attention, the similarity between vectors q and k is:

We can view the transformer's learned weights as parameterizing a continuous relaxation of edit distance. The attention mechanism implements this relaxed distance metric:

def attention_similarity(query, key, value): 
   # query shape: [seq_len, d_k] 
   # key shape: [seq_len, d_k] 
   # value shape: [seq_len, d_v] 

   scores = query @ key.T / np.sqrt(query.shape[-1]) 
   attention_weights = softmax(scores) # [seq_len, seq_len] 
   return attention_weights @ value # [seq_len, d_v]

Several observations support this compression view. The systematic improvement in model performance with increasing size suggests that larger models can store more detailed patterns from the training data. Analysis of attention patterns reveals that different heads learn interpretable relationships that match linguistic or domain structure. The organization of the embedding space shows meaningful clustering of similar tokens and preservation of analogical relationships.

When we run a sequence through the transformer, the process mirrors fuzzy matching but operates in a continuous space. The initial embedding maps tokens to vectors, analogous to preparing strings for comparison. The self-attention mechanism computes similarity scores between positions, playing a role similar to edit distance calculation, but using a learned, context-dependent metric. Multiple layers progressively refine these representations, like iteratively improving string alignment.

We can formalize this connection mathematically. In fuzzy matching, similarity is measured as the negative minimum cost of operations needed to transform one string into another. In transformer attention, similarity is measured through scaled dot products between query and key vectors. The transformer effectively learns a continuous approximation of edit distance that can capture more nuanced relationships.

Concrete Example: Protein Sequences

Consider a concrete example from protein sequence analysis. In traditional fuzzy matching, we might have a query sequence "MKLLPVL" and search a database containing sequences like "MKLLTVL" (one substitution) or "MLKPVL" (two operations). Each comparison requires explicit computation of edit distances. This type of genetic database search is used by MSA-based models such as AlphaFold2 and AlphaFold3. The mechanics of this is described in the previous post in the series, Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2.

The transformer approach is markedly different. After embedding the sequence into continuous vectors, self-attention finds similar patterns that have been compressed into the model's weights during training. The output reflects patterns observed during training, but crucially, the model can combine these patterns in novel ways. The transformer effectively "remembers" which amino acid substitutions are biochemically plausible in each context, without storing explicit sequences.

In the next post in this series, we will flesh out this section more and learn how protein language models allow us to replace the MSA step in AlphaFold, creating faster strucutre prediction models that generalize better to proteins that have few available related sequences.

Conclusion

Viewing encoder-only transformers as performing compressed fuzzy matching provides powerful intuition about their operation. Rather than seeing them as black boxes, we can understand them as learning to compress and query a vast database of patterns from their training data. This perspective suggests that improvements in transformer architecture may come from better compression techniques for storing training patterns, more efficient similarity computations, and explicit incorporation of string matching algorithms.

Future research might investigate how much pattern information is stored in different parts of the model, how different architectures affect compression quality, and whether we can design better compression mechanisms inspired by string algorithms. We might also explore the theoretical limits of this compression approach and its implications for model scaling.

The success of this architecture in domains like protein sequence modeling suggests that the ability to learn and compress domain-specific similarity metrics is a powerful paradigm. As we continue to develop these models, maintaining this conceptual connection to classical algorithms may help guide the way to more efficient and effective architectures.

OpenAI o3 and the Rise of the Intelligence Allocator

Chris Hayduk — Fri, 20 Dec 2024 19:19:58 GMT

OpenAI's announcement of their o3 series of models represents a pivotal moment in AI development - but not for the reasons many might expect. While the headline achievement of 87% on ARC AGI is impressive, the more transformative aspect lies in the economics of the model's deployment.

Let's start with the raw numbers: A single inference task on o3 at its highest compute setting costs over $1,000 (see the figure below). This isn't $1,000 per evaluation set or per session - this is per individual task. To put this in perspective, that's roughly equivalent to 5-10 hours of skilled human labor cost, dedicated to solving a single problem. The model offers lower compute settings, but with corresponding decreases in capability. This creates a direct tradeoff between cost and intelligence that we haven't had to grapple with before.

This cost structure represents a sharp departure from the trend we've observed over the past two years. During that period, the cost of running general-purpose language models has approached zero, even as their capabilities have steadily improved. GPT-3.5 became GPT-4, yet inference costs remained relatively stable. GPT-4 then became GPT-4 Turbo and GPT-4o, maintaining intelligence while rapidly decreasing inference costs. This led to a proliferation of AI applications - we could afford to experiment freely, integrating AI into virtually every workflow to see what stuck.

The o3 series shatters this paradigm. When each inference costs more than a decent laptop, you can't simply "throw AI at the problem" anymore. Every use of high-compute o3 needs to be justified by the value it creates. This introduces what we might call the "inference allocation problem" - how do we determine which tasks are worth deploying our most powerful (and expensive) models on?

Subscribe now

Consider a software development team using o3 for code analysis. Running the model at high compute to analyze a critical security vulnerability in a payment system might be easily justifiable. But what about using it to optimize a non-critical internal tool? Or to review routine pull requests? The team now needs to develop frameworks for making these decisions systematically.

This fundamentally transforms AI deployment into a capital allocation problem. Just as investment managers spread limited capital across opportunities to maximize returns, organizations must now optimize their allocation of inference compute to maximize value creation.

Consider a hypothetical AI budget of $1 million per month. Currently, this might support tens or hundreds of millions of GPT-4o inferences spread across hundreds of different use cases. With o3, the same budget only covers about 1,000 high-compute inferences. This scarcity forces us to think like capital allocators: Which thousand problems, if solved with our highest level of artificial intelligence, will generate the most value?

Beyond simply identifying high-value problems, intelligence allocators will need to understand the relationship between compute investment and value creation. Sometimes, a medium-compute inference at $100 might capture 80% of the potential value at 1/10th the cost. In other cases, the step-change in capability from high-compute might be worth the premium. Like any good investment decision, it requires understanding both the cost of capital and the expected returns.

In another parallel to traditional capital allocation, just as investors develop frameworks for evaluating investments across different sectors and risk levels, organizations will need frameworks for evaluating AI compute allocation across different use cases. These frameworks will need to consider factors like:

The value delta between using high-compute versus lower-compute models
The cost of being wrong or suboptimal
The potential for value capture from improved accuracy
The frequency with which the task needs to be performed

We might even see the emergence of "AI portfolio theory" - methods for optimizing the allocation of compute resources across different types of tasks to maximize expected return while managing risk. Some organizations might adopt a "barbell strategy" - using basic models for routine tasks while reserving expensive high-compute inferences for their most critical problems.

This shift in focus for AI engineers means that success looks more like developing the frameworks and metrics needed to make intelligence allocation decisions effectively, rather than focusing purely on technical implementations. The best AI engineers will be those who can think like capital allocators, understanding both the technical capabilities and the business value of different compute investments.

In this light, o3 represents the beginning of an era where artificial intelligence must be treated as a scarce resource requiring careful allocation. The organizations that thrive will be those that develop robust frameworks for deploying this resource where it can generate the highest returns.

The future of AI might look less like unlimited abundance and more like traditional capital markets, where success comes from making smart allocation decisions with limited resources. As models continue to become more powerful and computationally intensive, these allocation skills will only become more crucial.

On Algorithmic Moats and the Path to AGI

Chris Hayduk — Thu, 19 Dec 2024 21:59:19 GMT

The past few weeks have provided a remarkable natural experiment in AI development dynamics. OpenAI releases what appears to be a breakthrough technology, and Google promptly demonstrates superior capabilities:

OpenAI's Sora demonstrated remarkable text-to-video generation, only to be superseded by Google's Veo 2 with notably higher quality output
OpenAI's o1 introduced novel "thinking" capabilities, followed within weeks by Google's Gemini 2.0 Flash Thinking, implementing similar functionality
Gemini 2.0 has now surpassed both GPT-4 and Claude Sonnet across a broad range of benchmarks

This pattern reveals something fundamental about the nature of competitive advantage in artificial intelligence. To understand why this Google dominance was inevitable, we need to examine a broader principle: the myth of algorithmic moats.

Subscribe now

Algorithmic Moats

It has frequently been said that part of Silicon Valley's success is the lack of non-compete clauses for employees. This allowed trade secrets to proliferate rapidly in the Bay Area, creating more efficient competition dynamics and allowing many engineers to learn from each other, rather than restricting learning and competitive advantages to a single firm.

However, I rarely see this same line of argument applied to business moats. If this holds true, then it implies that algorithms alone cannot provide a durable moat to a business. Employees can easily leave one company and take all of its hard-won knowledge to a competitor, allowing the competitor to catch up.

Consider a thought experiment: You discover a revolutionary new algorithm. How long can you maintain that advantage? In a world of mobile talent and reverse engineering, the half-life of algorithmic secrets approaches zero as their value approaches infinity.

This creates what we might call the algorithm diffusion principle: Any sufficiently valuable algorithm will spread through the industry at a rate proportional to its perceived importance. Silicon Valley's prohibition on non-compete clauses accelerates this process, creating an upper bound on how long any single player can maintain algorithmic superiority.

Hence, algorithms only provide moats insofar as they facilitate the construction of another type of moat. When we talk about algorithmic moats, we're really discussing two separate concepts: the technical implementation details that can be replicated, and the emergent properties that arise from being first to market with those implementations.

Consider Google's own history with PageRank. While revolutionary for its time, the core insight – that incoming links could be weighted by the importance of their source – was relatively straightforward to replicate once published. What made Google dominant wasn't PageRank itself, but rather the virtuous cycle it enabled: better search results → more users → more data → even better search results. The algorithm was merely the catalyst for building a data moat.

This pattern repeats across the technology landscape. Spotify's recommendation algorithms, while sophisticated, aren't what prevent users from switching to Apple Music or YouTube Music. Instead, it's the years of accumulated listening history, carefully curated playlists, and social sharing features that create switching costs. The algorithms enable these benefits, but they aren't the moat themselves.

Moats on the Path to AGI

The implications of the lack of direct algorithmic moats become clear when we consider AGI development as a function of three primary variables:

Algorithmic innovation (A)
Computational resources (C)
Training data quality and quantity (D)

We might express AGI capability as: AGI_capability = A * f(C,D)

Where f(C,D) represents the effective utilization of compute and data. The algorithm diffusion principle suggests that A will quickly equilibrate across major players. Therefore, the decisive factor becomes f(C,D).

This is where Google's position becomes overwhelming. Consider their structural advantages:

Data Supremacy:

Google Search: The world's most comprehensive map of human knowledge and intent
YouTube: The largest repository of human audio-visual communication
Google Books/Scholar: A near-complete corpus of formal human knowledge
Android/Gmail: Vast behavioral and communication datasets

Compute Dominance:

Custom TPU architecture optimized for AI workloads
Vertical integration from silicon to software
World-class data center infrastructure
Decades of distributed systems optimization

These advantages compound non-linearly. Having twice the data and twice the compute doesn't yield four times the capability – it might yield eight times or more due to emergent properties in large-scale systems.

The recent pattern of Google rapidly matching and exceeding OpenAI's innovations perfectly illustrates this dynamic. When OpenAI develops a new technique, Google can quickly replicate it (algorithm diffusion) and then apply it with vastly superior resources, achieving better results almost immediately.

This creates what game theorists would call a dominant strategy for Google: Wait for algorithmic innovations, replicate them with superior resources, and achieve better results than the original inventors. The math becomes almost deterministic.

One might object that breakthrough algorithms could create discontinuous advantages that trump resource differences. However, the observed scaling laws in neural networks suggest otherwise. The smooth power-law relationships we've seen indicate that resource advantages compound predictably rather than being disrupted by algorithmic breakthroughs.

In retrospect, the tech industry's focus on OpenAI and other startups represents a failure to reason from first principles. In a world where algorithmic innovations cannot be contained, the player with overwhelming advantages in compute and data will inevitably emerge victorious. Google's position isn't just strong – it's strategically dominant in a game-theoretic sense.

The universal rule of algorithmic diffusion suggests a surprising corollary: The most effective strategy for other players might not be to compete directly with Google, but rather to focus on specialized domains where Google's general-purpose advantages are less relevant. This could, ironically, lead to a more specialized and diverse AI ecosystem than many currently predict.

Rejuvenating the Political System

Chris Hayduk — Mon, 22 Jul 2024 23:39:38 GMT

At this point, there’s fairly broad bipartisan agreement that the United States government has stagnated. Our politicians are too old, our government institutions less effective, our spending less efficient than in the halcyon days of Big Government dating back to the 1940-1970 period of US history. But why has the government ossified? To answer that question, let’s start by looking at one system that has managed to continually renew itself and improve since its widespread appearance roughly 200 years ago — the free market economy.

Subscribe now

Renewal and Markets

For most of recorded human history, progress was nearly too small to measure along nearly all relevant axes: life expectancy, infant mortality, GDP per capita, etc. Progress along each of these vectors was measured on the order of centuries or millennia, and any hard-won progress was extremely fragile, with events such as the fall of the Western Roman Empire taking centuries to recover from in areas such as Britain. However, beginning in the late 1700s, a confluence of technological, sociological, and ideological innovations led to capital P “Progress” — the sort of rapid, hockey stick growth that we’ve come to expect as a natural part of our world. One of these key innovations, perhaps the most important of them all, was the introduction of capitalism and the free market economy.

Capitalism and free markets together unleashed several key forces that ignited global growth: competition, distributed information processing, and incentive alignment. I’ll outline briefly the importance of each of these below:

Competition. Competition is the driving force behind innovation and efficiency in capitalist economies. When companies vie for market share, they are compelled to improve their products, lower prices, and increase efficiency to attract consumers. This constant pressure to outperform rivals fosters an environment where only the most efficient and innovative businesses thrive. Competition not only benefits consumers with better and more affordable products but also pushes companies to continually innovate, ensuring that progress never stagnates.
Distributed Information Processing. Information about consumer preferences, resource availability, and technological possibilities is widely distributed among individuals and businesses. This decentralized information processing allows markets to quickly adapt to changes. Entrepreneurs and businesses can respond to signals from the market, such as shifts in consumer demand or new technological advancements, enabling a dynamic and responsive economic system.
Incentive Alignment. Capitalism aligns incentives by tying the success of individuals to the success of the companies they own or work for. When management and workers have a stake in the ownership of their companies, their incentives align closely with the economic growth and dynamism of the broader economy. Managers and employees who hold equity in their firms are directly rewarded for the company’s performance, encouraging them to work harder, innovate, and make decisions that enhance the company’s value. This ownership structure creates a powerful alignment of interests that drives productivity and fosters a culture of continuous improvement.

These three factors have endowed capitalist free market economies with a remarkable capacity to enrich society through a built-in mechanism for rejuvenation. While individual companies may not endure indefinitely, the market system itself thrives and evolves, welcoming new entrants and jettisoning outdated competitors.

The cycle of rejuvenation in free markets follows these steps:

Innovation Emergence: A new technology, methodology, or market becomes available, often driven by advancements in science and technology or shifts in consumer behavior.
Entrepreneurial Entry: Entrepreneurs and new companies seize the opportunity to exploit this new angle, creating products or services that meet emerging demands.
Incumbent Inertia: Established companies, often characterized by bureaucratic inertia and resistance to change, struggle to adapt to the new paradigm.
Competitive Displacement: The new entrants, leveraging their innovative approaches and exploiting the incumbents' weaknesses, begin to outcompete and displace these established firms.
Decline of the Old Guard: Incumbent companies shrink or exit the market, unable to maintain their previous dominance.
Cycle Continuation: The high-growth new companies eventually become the new incumbents, setting the stage for the next wave of innovation and disruption.

This cycle ensures that, while no single company survives for eternity, the system itself survives and grows more efficient and prosperous. And, crucially, virtually everything about the companies within the system can change. They can have different governance structures, different incentive structures, leverage different technologies, target different market segments, and more. While the system itself remains in place, its constituent parts can change to the point of being unrecognizable. This allows our economy to respond dynamically to rapidly changing technological and environmental factors and to leverage these changes for continuous improvement — that is, in the parlance of Nassim Nicholas Taleb, to be Antifragile.

The Ossified Nation-State

In stark contrast to the above, there is no equivalent process for rejuvenation and change in nation-states. Governments are often enshrined in constitutions that either remain largely unchanged (as in the United States) or undergo only superficial modifications while retaining the same overarching structure (as seen in many Latin American countries). The only mechanism for large-scale governmental change appears to be a total collapse of the existing order, which can come through:

Violent Revolution: Internal uprisings that overthrow the existing order.
External Conquest: Invasions and conquests that impose new governmental structures.
Post-War Reorganization: Major wars resulting in governments' dissolution and reformation, such as Germany after World War I.

Historically, the structure of democratic republics was able to partially fill the role of free market economies in the context of nation-states. Democratic Republics such as the United States allowed voters to periodically replace presidents, cabinets, and policy agendas through regularly scheduled elections, avoiding the violent regime changes mentioned above. The combination of these regularly scheduled elections with de facto term limits on the executive branch of government created a system where new leaders, new ideas, and new policy structures were continually brought in to renew the government. However, several contemporary factors are undermining the Democratic Republic as a rejuvenating system:

Aging Politicians: Politicians are getting older and staying in office longer. Increased lifespans, coupled with the absence of term limits for Congress, contribute to this trend. Additionally, mass media and a larger, less informed voting populace make name recognition a crucial determinant of electoral success, favoring long-term incumbents.
Influence of the Supreme Court: The Supreme Court plays a more significant role than in the past, as partisan interest groups frequently reach gridlock. By elevating many decisions to the Supreme Court, the nation’s decision-making power becomes more and more centralized in a group of justices who receive lifetime appointments, limiting the capacity for new views to be brought in.
Persistent Bureaucracy: The bureaucratic apparatus surrounding the government has increased in size and does not change with each administration, so the personnel and ideologies making up the government are more constant from administration to administration.

All of these factors together make the US government increasingly brittle — it is less able to change in response to rapidly changing environments, and so is likely to break. Think of the difference between a wood plank and water. Water changes in response to even tiny perturbations, but the benefit is that it will not "break" from a large impact. By contrast, a wood plank will resist most changes until one change that is large enough to overcome its resistance snaps the plank in half. Similarly, the increased rigidity of the US government has now made it harder to bend, making it more likely to break in response to extreme stress and environmental, economic, or sociological change.

How do we move forward?

To improve the situation, we need to ensure a much higher turnover in the government. This includes personnel, policies, legislation, and ideologies. We need to ensure that the government can respond rapidly to changing conditions, which can only occur by allowing the governance structure in place to shift without the strong status quo bias we see today. There are a few factors that I think need to change to allow this to happen:

Radically reducing the size and scope of government agencies with unelected officials. Bureaucratic agencies since WW2 have been able to run roughshod over the country, interpreting and enforcing laws as they see fit and being largely impervious to public scrutiny. The opacity of the organizations and the inability of voters to remove officials from them through elections has allowed them to ossify, making the government apparatus significantly more stagnant from administration to administration. The Supreme Court’s ruling in Loper Bright Enterprises v. Raimondo and Relentless, Inc. v. Department of Commerce is a step in the right direction in this case, limiting the ability of agencies to actively interpret the law without much oversight (which can lead to ambiguous and shape-shifting regulatory implementations, creating huge strains on businesses).
Setting expiry dates on all government legislation. Laws existing in perpetuity create strong inertia in the government — laws that already exist will likely continue to exist because keeping them requires no action, whereas repealing them requires some action. This results in ever-increasing regulations, tax codes, and legal codes, which sap dynamism from the economy and provide significant market opportunities to purely extractive companies and institutions. Both individuals and corporations would benefit from simpler tax codes and reduced regulatory burdens.
Imposing term limits on Congress and the Supreme Court. Term limits would ensure a regular influx of new ideas and reduce the likelihood of long-term incumbency leading to stagnation. In particular, poor choices for appointments to the Supreme Court are disastrous, as lifetime appointments can last for several decades before the spot is turned over to a new justice. By making appointments temporary, we can limit the negative impact of bad appointments while also ensuring that new generations of justices regularly filter into the court.
Implementing cognitive testing for all elected officials. This one is self-explanatory. We should not have to repeat the embarrassments of this national election cycle, where each party has competed to prove that the other’s candidate is the more geriatric and clueless. We as a nation should have the confidence (and frankly, the dignity) to be able to assume that our leaders are cognitively capable of handling the job.

The goal of the above measures is to create a system that fosters innovation and dynamism in the government by allowing it to change. Each measure aims to remove bloat from the current system, increase competition for positions among elected officials, and reduce the status quo bias that is so prevalent in politics today. There is certainly substantially more that needs to be done above and beyond these suggestions, but it does present a starting point for a government that isn’t stuck doing a cheap parody of FDR’s administration.