<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Musings by Chris Hayduk: The Sciences]]></title><description><![CDATA[My science-related posts, typically covering technical AI deep dives, AI policy, tech business strategy, and biology]]></description><link>https://www.chrishayduk.com/s/the-sciences</link><image><url>https://substackcdn.com/image/fetch/$s_!UbBK!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27ef8414-9281-48f9-b72c-6c8dc93c7664_799x799.png</url><title>Musings by Chris Hayduk: The Sciences</title><link>https://www.chrishayduk.com/s/the-sciences</link></image><generator>Substack</generator><lastBuildDate>Thu, 09 Apr 2026 18:55:09 GMT</lastBuildDate><atom:link href="https://www.chrishayduk.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Chris Hayduk]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[chrishayduk@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[chrishayduk@substack.com]]></itunes:email><itunes:name><![CDATA[Chris Hayduk]]></itunes:name></itunes:owner><itunes:author><![CDATA[Chris Hayduk]]></itunes:author><googleplay:owner><![CDATA[chrishayduk@substack.com]]></googleplay:owner><googleplay:email><![CDATA[chrishayduk@substack.com]]></googleplay:email><googleplay:author><![CDATA[Chris Hayduk]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[A Tale of Two Futures]]></title><description><![CDATA[America is building Skynet. China is building The Jetsons.]]></description><link>https://www.chrishayduk.com/p/a-tale-of-two-futures</link><guid isPermaLink="false">https://www.chrishayduk.com/p/a-tale-of-two-futures</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Sun, 11 Jan 2026 23:07:46 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!TYv0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TYv0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TYv0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TYv0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TYv0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TYv0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TYv0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:10015133,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!TYv0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!TYv0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!TYv0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!TYv0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ac0eb39-fc7b-4b08-9d86-b29cd9fa21ef_5504x3072.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The US and China are engaged in a two-country race towards an AI-powered future. Both countries have directed funding, policy initiatives, and talent towards the AI sector at levels matched only by the internet buildout of the 1990s or railroad construction of the 1880s. But both of these countries are building towards diametrically opposed futures. </p><p>One country has taken the view that it is on the path to artificial superintelligence (ASI) &#8212; the point at which AI will be more effective than the most capable humans at every task. The view of their prominent AI labs is that once ASI is achieved, there will be a runaway intelligence explosion, with the AI rapidly improving itself and reaching unthinkable levels of intelligence. This genius AI will then be able to solve our most pressing problems in mathematics, physics, philosophy, and more with minimal difficulty.</p><p>The other country has taken the view that it is not peak intelligence that matters, but rather the distribution of intelligence. It aims to develop intelligent AI models (though not superintelligent) that are fast and cheap enough to be embedded in machines across the economy. It wants household robots in every home, talking cars, and refrigerators that can do the grocery shopping.</p><p>One future is top-down, the other is bottom-up. One is centralized, the other is decentralized. One results in gains accruing to the select few corporations, the other results in gains throughout the entire economy. One is Skynet, the other is The Jetsons.</p><p>The great irony of the current AI race situation is that the centralized SkyNet future belongs to the democratic United States, and the decentralized Jetsons future belongs to authoritarian China.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h2>The Skynet Future</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9mVt!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9mVt!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!9mVt!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!9mVt!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!9mVt!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9mVt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8615005,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9mVt!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!9mVt!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!9mVt!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!9mVt!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6c6b0ca-a9ec-4961-9d18-5ffc475e7c45_5504x3072.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Skynet is the fictional AI developed by Cyberdyne Systems in the original Terminator movie. It&#8217;s a large, powerful AI system that, once deployed, rapidly increases its intelligence to the point of becoming self-aware. When it becomes self-aware, it becomes self-interested and decides that the best way to preserve its existence is to eliminate <em>all </em>humans. Skynet then appropriates the nuclear codes of the United States and launches these nuclear weapons in an attempt to eliminate the human race from the face of the Earth. Small groups of humans survive the nuclear fallout, and they live in a post-apocalyptic world fighting robots directed by Skynet that are attempting to exterminate humanity once and for all.</p><p>This vision is very apocalyptic, but I think it captures the sentiment of the US AI scene better than any other popular depiction of AI. From the initial ambitions of Skynet running the United States to its intelligence explosion to its final destruction of humanity, these views are not outlandish in Silicon Valley. In fact, they may be the norm among AI researchers and AI lab CEOs. And these views have substantial implications for how AI research is playing out in the United States.</p><p>The CEOs of AI research labs are explicitly building towards this form of superintelligence. Crucially, these labs view pushing the intelligence frontier of these models as the core goal of their research. They explicitly look to benchmarks that measure intelligence on extremely difficult tasks (such as FrontierMath, GDPVal, and ARC-AGI-2) as the core metrics that they&#8217;re optimizing against. Their goal is to produce a &#8220;country of geniuses in a datacenter&#8221;, as Dario Amodei put it in his article &#8220;Machines of Loving Grace&#8221;.  Amodei believes that, once achieved, artificial superintelligence could compress 50-100 years of biological research into 5-10 years. </p><p>Moreover, many of the prominent figures in AI view the path to superintelligence as a race. They believe that as we move closer to superintelligence, we will be able to achieve an automated AI researcher that can analyze its own codebase and improve it rapidly. These algorithmic gains from AI producing its own code improvements will result in an intelligence explosion, such that the first lab to produce an automated AI researcher will immediately gain an insurmountable lead in intelligence over the other labs. As such, not only do the AI lab leaders <em>believe</em> this Skynet scenario, but they also view it as a race that must be won at all costs. The very existence of their companies in their minds depends upon reaching the superintelligent AI first. To understand the pace of AI investment and the amount of that investment that becomes allocated to training ever-larger and more capable models (rather than more cost-efficient or broadly distributed models), you need to internalize that the prominent figures in AI strongly believe this to be the true state of the world. </p><p>To race towards superintelligence, massive increases in the two core inputs to AI training are needed: data and compute. As a result, the AI labs have invested hundreds of billions of dollars into massive compute scale-outs, data acquisitions, and RL environment development. New data centers are coming online that consume gigawatts of electricity to train larger models with higher parameter counts. In addition, these new models are fed data and trained in RL environments produced by hired PhDs and leading experts across math, computer science, finance, and more. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!75Fj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!75Fj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!75Fj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!75Fj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!75Fj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!75Fj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:119304,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!75Fj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!75Fj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!75Fj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!75Fj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c886bd8-a15e-41f2-8503-097d3a5cefef_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To afford the energy, compute, and data to produce these models that push benchmark metrics on FrontierMath and GDPVal, more and more centralization is encouraged in AI research. As shown in the chart above, the cost of building a frontier data center has been increasing exponentially, from around $7 billion in 2022 (when ChatGPT was first released) to a projected $106 billion in 2027. Hence, if there is a fixed amount of private &amp; public funding available to AI companies, it is in investors&#8217; interest to allocate that funding to a small number of companies so that they can afford the requisite frontier data centers to train these models. With funding too widely distributed, no single player would be able to produce a model that outperformed the state-of-the-art on these frontier benchmarks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vyXb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vyXb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!vyXb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!vyXb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!vyXb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vyXb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:242767,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vyXb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!vyXb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!vyXb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!vyXb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff976423b-3d9f-43f1-9b0d-fdf825a5359b_1920x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We can observe this trend more clearly in the training compute cost of frontier models from 2012 through 2025. We can see that model training costs rapidly increased from roughly $3 million in 2022 to over $300 million in 2025. In addition, the trend line shows this cost increasing at a rate of 0.5 orders of magnitude per year, indicating that the cost of a single training run next year (2027) will increase to about $3 billion. Projected forward to the end of the decade (2030), a training run would cost roughly $100 billion (with the data centers powering such a run likely costing in excess of $1 trillion).</p><p>Given all of this, we can think of the American AI ecosystem as a bet on increasing centralization and increasing scale. It is a bet on a benevolent Skynet future &#8212; leveraging unprecedented resources to build a single, massive AI model capable of solving the most difficult problems in science, technology, politics, and philosophy. </p><h2>The Jetsons Future</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JCQV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JCQV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JCQV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JCQV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JCQV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JCQV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8652965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JCQV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JCQV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JCQV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JCQV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93b633a2-3b1f-4c7c-9b0e-6ff0d60a2ae0_5504x3072.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The Jetsons is an animated sitcom from the 1960s that depicts a future defined not by a single technological breakthrough but by the accumulation of countless small conveniences &#8212; flying cars, household robots, and apartments that predict each family&#8217;s needs. The Jetsons&#8217; future is not one of transcendence but of leisure &#8212; technology has not produced a godlike intelligence but has instead seeped into every object, automating away the drudgery of daily life. The show&#8217;s vision is one of abundance through distribution: no single machine is particularly impressive, but the sheer proliferation of helpful machines has transformed the nature of work and home life entirely.</p><p>The Jetsons aired the same year as the Cuban Missile Crisis. Sixty years later, it is the CCP &amp; China, not the United States, that is building toward its vision.</p><p>Chinese AI labs burst onto the scene in early 2025 with the release of DeepSeek-R1. Unlike its American counterparts, the notable aspect of DeepSeek&#8217;s model was <em>not</em> its raw performance &#8212; it was strong, but it lagged behind the frontier. The truly impressive aspect of DeepSeek-R1 was that it performed similarly to frontier models at a fraction of the cost, both in terms of serving cost and training cost. For example, as I detailed in another article (<a href="https://www.chrishayduk.com/p/open-source-llms-are-eating-the-world">Open Source LLMs Are Eating the World</a>), DeepSeek-R1 was able to perform nearly as well as OpenAI o1 (the frontier model at the time) on the MMLU Pro benchmark, while only costing $6.75 to run the full benchmark suite compared to o1&#8217;s $75. This represented an 11x drop in the cost to serve the model at roughly equivalent performance levels.</p><p>The success of DeepSeek-R1 has sparked a wave of innovation in open-source Chinese AI. Various companies have entered the fray, including Alibaba with its Qwen series, Z.ai with its GLM series, and Moonshot AI with its Kimi series. Each of these three core competitors, along with DeepSeek, has steadily pushed the cost of economically useful intelligence towards zero. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oZCI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oZCI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 424w, https://substackcdn.com/image/fetch/$s_!oZCI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 848w, https://substackcdn.com/image/fetch/$s_!oZCI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 1272w, https://substackcdn.com/image/fetch/$s_!oZCI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oZCI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png" width="990" height="509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37587488-3401-4e68-b293-1088e5357ca0_990x509.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:509,&quot;width&quot;:990,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54656,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oZCI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 424w, https://substackcdn.com/image/fetch/$s_!oZCI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 848w, https://substackcdn.com/image/fetch/$s_!oZCI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 1272w, https://substackcdn.com/image/fetch/$s_!oZCI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37587488-3401-4e68-b293-1088e5357ca0_990x509.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Produced by GPT 5.2 Pro using MMLU Pro benchmark results</figcaption></figure></div><p>In addition, the speed of innovation has compressed the time delta in model performance between closed-source AI labs and their Chinese open-source equivalents. The chart above shows the catch-up time required for open-source AI models to match the performance of closed-source models at different performance thresholds on the MMLU-Pro benchmark. We can see that earlier performance levels, such as the 60% threshold, required roughly two-thirds of a year before open-source AI could match closed-source AI. However, recent performance thresholds have been achieved in far less time, between a quarter to a third of a year. Chinese AI labs have ramped up their investments in data centers and energy. They now have access to purchase NVIDIA H200 chips, and the Chinese chip ecosystem is maturing more quickly than expected. As a result, we should likely expect this time gap to continue to compress rather than to expand. </p><p>The implications here are genuinely massive. This chart shows that open-source AI models from China can match the performance of our leading closed-source models in less than 6 months, often at an order of magnitude lower cost. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Wq9P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Wq9P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Wq9P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Wq9P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Wq9P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Wq9P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg" width="1172" height="940" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:940,&quot;width&quot;:1172,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!Wq9P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Wq9P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Wq9P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Wq9P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f6107d3-497b-434c-a47f-92354ccbddd2_1172x940.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In parallel with these advancements in open-source AI, China has been experiencing two massive buildouts over the last five years. The first has been a huge increase in energy generation, specifically in solar and nuclear energy. These rapid increases in energy generation, which frequently exceed the rest of the world combined, for example, in installed solar capacity, are enabling manufacturing at scales never before seen. In particular, the improvements to solar and batteries that are occurring and will continue to occur due to technological and cost improvements driven by this increased manufacturing will allow energy not just to be more plentiful overall, but also to be more local to specific needs. What this means practically is that energy will become local and mobile, allowing for various devices to become much more energy-intensive than they have been previously through the use of improved batteries and local solar panels. This will allow devices to include the substantial onboard compute required to run the top open-source AI models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MY6h!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MY6h!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 424w, https://substackcdn.com/image/fetch/$s_!MY6h!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 848w, https://substackcdn.com/image/fetch/$s_!MY6h!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 1272w, https://substackcdn.com/image/fetch/$s_!MY6h!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MY6h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png" width="1024" height="622" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:622,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Q&amp;A: The global 'trade war' over China's booming EV industry - Carbon Brief&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Q&amp;A: The global 'trade war' over China's booming EV industry - Carbon Brief" title="Q&amp;A: The global 'trade war' over China's booming EV industry - Carbon Brief" srcset="https://substackcdn.com/image/fetch/$s_!MY6h!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 424w, https://substackcdn.com/image/fetch/$s_!MY6h!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 848w, https://substackcdn.com/image/fetch/$s_!MY6h!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 1272w, https://substackcdn.com/image/fetch/$s_!MY6h!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8a39805-ed7b-4ba4-b413-be04b84cb56b_1024x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The second key build-out has been in advanced manufacturing. China was <em>already</em> the world's manufacturing center before this push into advanced manufacturing. But now it has moved up the value chain and very quickly has gone from a laggard in key industries to dominating them. For instance, five years ago, China was largely irrelevant in the electric car market. Now its electric car companies are leading the world and outselling Tesla. Several leading companies are now making strong pushes into humanoid robots and are setting the pace in that category. </p><p>This confluence of advanced manufacturing in robotics &amp; battery-powered vehicles, increased energy generation (specifically solar), and open-source AI that is energy- and compute-efficient will allow China to develop a truly intelligence-powered economy. The major factors will be in place to have human-level AI embedded into a large share of both consumer and industrial products. </p><p>With this approach, China aims to enable a new level of general abundance with household robots, self-driving electric cars, self-directed delivery drones, and household appliances that can make decisions for themselves, such as a refrigerator that can detect when you&#8217;re running low on specific supplies and order them agentically. </p><p>And crucially, this future does <em>not</em> depend on reaching superintelligence. As I&#8217;ve detailed in my other article, <a href="https://www.chrishayduk.com/p/open-source-llms-are-eating-the-world">Open Source LLMs Are Eating the World</a>, many economically relevant tasks operate in a task-saturation regime. That is, once the models exceed some threshold level of performance, future increases in model scale and training compute do not make meaningful differences in task-level performance. Moreover, models today are <em>already</em> capable of performing many economically viable tasks, such as coding complex apps, serving as customer support agents, and more. Hence, this broad deployment of cheap AI in physical goods will deliver returns quite quickly.</p><p>Making this intelligence cheap and abundant through energy- and compute-efficient open source AI will unlock massive economic value across the spectrum. There isn&#8217;t much doubt about this. The Jetsons future is clearly within reach. However, there <em>is</em> a question mark whether we will reach the benevolent Skynet future.</p><h2>Consequences of the Divide</h2><p>US labs are betting on transcendent intelligence. Chinese labs are betting on abundant intelligence. Who wins depends on which future actually arrives.</p><p>Four scenarios are possible. Only one favors the American approach.</p><h3>The Scenario Matrix</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b0jh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b0jh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 424w, https://substackcdn.com/image/fetch/$s_!b0jh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 848w, https://substackcdn.com/image/fetch/$s_!b0jh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 1272w, https://substackcdn.com/image/fetch/$s_!b0jh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b0jh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png" width="1456" height="549" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:549,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:133144,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184223287?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b0jh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 424w, https://substackcdn.com/image/fetch/$s_!b0jh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 848w, https://substackcdn.com/image/fetch/$s_!b0jh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 1272w, https://substackcdn.com/image/fetch/$s_!b0jh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F765ab45d-a2f6-41c7-945d-93828f81c4c1_1820x686.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From the above matrix, we see that the American bet requires threading a needle: superintelligence must be achievable, it must trigger a runaway intelligence explosion, <em>and</em> that explosion must translate into massive economic returns unconstrained by physical bottlenecks.</p><p>Remove any link in that chain, and the calculus shifts.</p><p><strong>If superintelligence arrives but can&#8217;t escape physical constraints:</strong> The leading lab pulls ahead on benchmarks, but drug discovery still bottlenecks at FDA trials. Robotics still bottlenecks at manufacturing. Most economically useful tasks don&#8217;t require superintelligence anyway. Open-source competitors deliver similar real-world value at a fraction of the cost.</p><p><strong>If superintelligence arrives but no takeoff occurs:</strong> Energy infrastructure takes years to build. Training runs take months. Even with a superintelligent AI optimizing your codebase, the results manifest slowly enough for competitors to close the gap. No insurmountable lead materializes.</p><p><strong>If superintelligence never arrives:</strong> Intelligence gains follow a sigmoid curve&#8212;rapid improvement, then diminishing returns. At that plateau, the race shifts from &#8220;who&#8217;s smartest&#8221; to &#8220;who&#8217;s cheapest and most distributed.&#8221; China wins that race.</p><h3>The Asymmetric Bet</h3><p>The Jetsons future requires no miracles. Cheap, capable AI embedded in robots, vehicles, and appliances delivers value whether or not superintelligence is possible. China&#8217;s bet pays off in three of four scenarios.</p><p>The Skynet future requires everything to go right. Superintelligence must be reachable, takeoff must occur, and physical constraints must not bind. America&#8217;s bet pays off in one scenario.</p><h3>The Implication</h3><p>We see from the scenarios enumerated above that the US AI Lab approach is decisively dominant in only one of them. In all other scenarios, Chinese open-source AI is able to keep pace with the closed-source frontier, and, in doing so, it is guaranteed to bring about the Jetsons future that China is building towards. The American benevolent Skynet future is far from guaranteed. </p><p>In sum, to ensure that the United States broadly benefits from the AI revolution it has itself started, we need to take a page from the Chinese AI playbook. We must ensure that even if superintelligence is out of reach, we will have cheap, abundant intelligence suffused throughout the economy. We must ensure that our portable energy infrastructure (i.e., solar panels and batteries), our robotics manufacturing capabilities, and our open source AI efforts are sufficient to power a truly intelligent economy. The failure to do so may cede technological leadership in the 21st century to the CCP &amp; China. </p><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Open Source LLMs Are Eating the World]]></title><description><![CDATA[We are benchmarking LLMs incorrectly to predict economic utility]]></description><link>https://www.chrishayduk.com/p/open-source-llms-are-eating-the-world</link><guid isPermaLink="false">https://www.chrishayduk.com/p/open-source-llms-are-eating-the-world</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Fri, 09 Jan 2026 20:15:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_XZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The default way we evaluate large language models is fundamentally misaligned with how they create economic value. We track frontier capabilities across broad benchmarks (such as ARC-AGI-2, FrontierMath, and SWE-Bench Verified) and implicitly assume that whoever leads on these metrics captures the most value. </p><p>However, this view assumes that what matters is the model&#8217;s maximum intelligence acorss a broad range of tasks &#8212; the &#8220;PhD intelligence for all&#8221; mantra repeated by the large labs. </p><p>For practical company building, this framing is wrong, and understanding why reveals a structural advantage for open source models that holds regardless of when (or whether) we achieve AGI.</p><h2>I. Introduction: The Benchmarking Problem</h2><p>The standard narrative goes something like this: general model value scales with general performance. A model that scores higher on a diverse battery of benchmarks is more valuable than one that scores lower, and the companies training the most capable models will capture the lion&#8217;s share of economic returns.</p><p>But this misses how value actually gets created in practice. Companies don&#8217;t build products that require uniformly excellent performance across all possible tasks. They build for specific use cases: contract analysis, customer support, code generation, medical documentation. Revenue comes from solving customer problems, and customer problems are specific. The &#8220;average&#8221; benchmark performance that frontier labs optimize for doesn&#8217;t map to any real product, it&#8217;s just an abstraction that obscures the actual economics.</p><p>For any given application, what matters is whether your model is good enough at <em>this particular thing</em>, not whether it can solve PhD-level mathematics problems or write publishable research. A legal tech startup needs strong performance on contract reasoning and citation accuracy. A customer support platform needs reliability on intent classification and tone. Neither benefits from improvements to the model&#8217;s ability to prove novel theorems.</p><p>Moreover, the relationship between capability and value is S-curved. Early capability improvements unlock entirely new use cases: a model that goes from 40% to 70% accuracy on a task might cross the threshold from &#8220;useless&#8221; to &#8220;useful with human oversight.&#8221; But a model that goes from 92% to 96% often delivers no additional value, because the human workflow was already designed around spot-checking outputs and the bottleneck has shifted elsewhere to latency, cost, integration complexity, or user experience.</p><p>This is the crux of the argument: once a model clears the capability threshold for a given task, further intelligence improvements face rapidly diminishing returns. The contract analysis tool that&#8217;s &#8220;good enough&#8221; for lawyers to trust with first-pass review doesn&#8217;t become twice as valuable when the underlying model gets twice as capable. It just becomes overprovisioned.</p><h2>II. The Task Saturation Phenomenon</h2><p>For any specific task, <strong>the marginal value of model capability saturates at some threshold.</strong> Beyond a certain point, users cannot meaningfully distinguish between a model of size X and a model of size n&#215;X for any n &#8805; 1.</p><p>Consider what&#8217;s happened to standard benchmarks over the past few years. The chart below tracks top scores on benchmarks like ARC, MMLU, Winograd, HellaSwag, GSM8K, and TruthfulQA against their human baselines:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7ZLe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7ZLe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 424w, https://substackcdn.com/image/fetch/$s_!7ZLe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 848w, https://substackcdn.com/image/fetch/$s_!7ZLe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 1272w, https://substackcdn.com/image/fetch/$s_!7ZLe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7ZLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png" width="1256" height="780" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:780,&quot;width&quot;:1256,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113058,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184028048?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7ZLe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 424w, https://substackcdn.com/image/fetch/$s_!7ZLe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 848w, https://substackcdn.com/image/fetch/$s_!7ZLe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 1272w, https://substackcdn.com/image/fetch/$s_!7ZLe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fea090f68-5a7b-4c59-a68c-2f74fffa13dc_1256x780.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The pattern is consistent: rapid improvement followed by convergence toward (and often slightly beyond) human-level performance. Once a benchmark is effectively &#8220;solved,&#8221; additional capability improvements deliver zero marginal value for tasks that benchmark measures. A model scoring 95% on MMLU isn&#8217;t twice as useful for MMLU-adjacent tasks as one scoring 90%. For most practical purposes, they&#8217;re equivalent.</p><h2>III. Reframing the Analysis: Cost at Fixed Performance</h2><p>If capability saturates for specific tasks, then the relevant question isn&#8217;t &#8220;which model is most capable?&#8221; but rather &#8220;which model solves my task at the lowest cost?&#8221;</p><p>Once we fix a performance threshold (the point at which a task is effectively solved) we can track how the cost to achieve that threshold evolves over time. The a16z team did exactly <a href="https://a16z.com/llmflation-llm-inference-cost/">this analysis for MMLU scores</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_XZh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_XZh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 424w, https://substackcdn.com/image/fetch/$s_!_XZh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 848w, https://substackcdn.com/image/fetch/$s_!_XZh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!_XZh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_XZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png" width="1456" height="1036" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1036,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:279620,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184028048?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_XZh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 424w, https://substackcdn.com/image/fetch/$s_!_XZh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 848w, https://substackcdn.com/image/fetch/$s_!_XZh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 1272w, https://substackcdn.com/image/fetch/$s_!_XZh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57385725-8fc4-46fa-a04e-e3e08d16b487_1596x1136.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The trend line shows roughly a 10&#215; cost reduction every year for a fixed capability level. But the more important pattern is <em>which models</em> sit on that cost frontier over time. Early in a capability tier&#8217;s lifecycle, closed-source models from frontier labs define the frontier. But within months, open-source alternatives emerge at dramatically lower price points.</p><p>Look at the progression for MMLU &gt; 83: GPT-4 at $45 per million tokens, then GPT-4o at ~$10, then Claude 3.5 Sonnet at ~$10, and finally Llama 3.1 70B pushing costs down toward $0.50. The same pattern plays out for every capability threshold: closed source models solve the task first, and then open source models quickly make it cheaper. </p><p>Thus, if we imagine a fixed benchmark score as a proxy for the threshold at which a task is &#8220;solved&#8221;, we see that closed source models have historically had a payoff horizon of roughly one year before open source models made </p><h2>IV. Case Study: MMLU Pro Replication Speed</h2><p>MMLU Pro extends the original MMLU benchmark by increasing multiple choice options from 4 to 10, introducing misleading distractors, and emphasizing reasoning-heavy questions. It&#8217;s a harder benchmark, which allows us to separate out the performance levels of recently released models.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mz2q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mz2q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 424w, https://substackcdn.com/image/fetch/$s_!mz2q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 848w, https://substackcdn.com/image/fetch/$s_!mz2q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 1272w, https://substackcdn.com/image/fetch/$s_!mz2q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mz2q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png" width="1456" height="1530" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1530,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:139344,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/184028048?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mz2q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 424w, https://substackcdn.com/image/fetch/$s_!mz2q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 848w, https://substackcdn.com/image/fetch/$s_!mz2q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 1272w, https://substackcdn.com/image/fetch/$s_!mz2q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd676c533-3d7d-4f19-8f4b-a3e29c484839_1460x1534.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Benchmark results available <a href="https://www.vals.ai/benchmarks/mmlu_pro">here</a></figcaption></figure></div><p>Consider the 83% performance threshold. That is, models which answered at least 83% of questions correctly:</p><ul><li><p><strong>OpenAI o1</strong> was the first model to reach this level and did so upon its release on December 5, 2024. Its API pricing was $15 per million input tokens and $60 per million output tokens. <strong>The total cost to run the benchmark was $75.</strong></p></li><li><p><strong>DeepSeek R1</strong> was the first open source model to reach this level when it launched on January 20, 2025, priced at roughly $1.485 per million input tokens and $5.94 per million output tokens. <strong>The total cost to run the benchmark was $6.75</strong></p></li></ul><p>That&#8217;s an order of magnitude cost reduction in one month for equivalent task performance. If we want to be generous and use the release date of o1-preview, this still results in a time horizon of only 4 months before DeepSeek matched its performance with an open source model costing an order of magnitude less.</p><p>To drive the point home further still, DeepSeek V3.2 came out on December 1, 2025 and again achieved the 83% performance threshold, but this time at a <strong>two order of magntiude reduction in cost</strong> when compared with OpenAI o1. Specifically, <strong>the total cost to run the benchmark was only $2.24.</strong></p><p>Thus, for a fixed level of performance, we see the price drop from $75 to $6.75 to $2.24 over the course of a single year. <strong>As a result, I argue that </strong><em><strong>any</strong></em><strong> task solved by a closed source model will see enterprise buyers transition to cheaper open source models within 6 months to one year.</strong></p><p>And there&#8217;s reason to expect this pace to accelerate. As Huawei and SMIC close the gap with NVIDIA and TSMC, and now that NVIDIA potentially regains the ability to sell H200 chips in China, the Chinese open-source labs will have access to better hardware while maintaining their cost structure advantages. We may be looking at only a couple of months between closed-source frontier releases and open-source replication with substantial cost reduction.</p><h2>V. The AGI-Agnostic Conclusion</h2><p>What I think makes this view most compelling is that it doesn&#8217;t depend on AGI being decades away.</p><p>The conventional case for open source often rests on an assumption that we&#8217;re approaching a capability plateau. That is, that base model improvements will slow down, shifting competition to fine-tuning, cost, and vertical specialization. This assumes that the vision of the future espoused by the US AI labs, predicated on artificial superintelligence (ASI) and runaway intelligence explosions, is wrong, while China&#8217;s view of commoditized intelligence is correct. That may well be true, but it&#8217;s a bet on a particular trajectory of AI progress.</p><p>The task saturation argument is stronger because it&#8217;s agnostic to the AGI timeline. When you&#8217;re building a company, you&#8217;re typically building for a specific use case. That means you&#8217;re operating in the saturation regime, not the model scale-up regime. Even if frontier models continue improving rapidly and the AI-2027 timeline plays out, the task your company is built around has a capability threshold beyond which additional model intelligence doesn&#8217;t matter.</p><p>And once you&#8217;re in the saturation regime, the only dimension of competition that matters is cost. Open source wins on cost, systematically and structurally, because open-source economics allow for lower margins and broader distribution.</p><p>The practical takeaway for company builders is this: bias toward open source, and do so for cost reasons rather than capability bets.</p><p>If you&#8217;re building an AI-native product, ask yourself: what capability threshold does my use case actually require? Chances are, that threshold is either already achieved by current open-source models or will be within 6-12 months of a closed-source model first reaching it. Build your infrastructure and workflows around the assumption that you&#8217;ll be running on open-source models, even if you start with closed-source APIs for speed to market.</p><p>The benchmark that matters isn&#8217;t &#8220;which model is smartest.&#8221; It&#8217;s &#8220;which model solves my task cheaply enough.&#8221; And open source is destined to systematically win that competition through relentless cost deflation.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[It's Time For Google to Acquire Intel]]></title><description><![CDATA[Breaking the Nvidia-TSMC monopoly]]></description><link>https://www.chrishayduk.com/p/its-time-for-google-to-acquire-intel</link><guid isPermaLink="false">https://www.chrishayduk.com/p/its-time-for-google-to-acquire-intel</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Thu, 25 Sep 2025 13:02:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!POjm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!POjm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!POjm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!POjm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!POjm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!POjm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!POjm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png" width="1024" height="1024" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Generated Image September 24, 2025 - 11:01PM.png&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Generated Image September 24, 2025 - 11:01PM.png" title="Generated Image September 24, 2025 - 11:01PM.png" srcset="https://substackcdn.com/image/fetch/$s_!POjm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!POjm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!POjm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!POjm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6f0c6e63-7ae3-404b-9c1a-fbbf8778b9d7_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Nvidia made headlines this week when it announced it would invest up to $100&#8239;billion into OpenAI and help deploy at least 10&#8239;GW of AI infrastructure. The move, frequently memed as an &#8220;infinite money glitch,&#8221; with capital and revenue cycling between Nvidia and OpenAI (see below image), effectively ensures a substantial fraction of Nvidia&#8217;s GPUs will land in OpenAI&#8209;aligned datacenters (via leasing or outright purchases). </p><p>This comes on the heels of OpenAI&#8217;s <strong>&gt;</strong>$300&#8239;billion &#8220;Stargate&#8221; build&#8209;out with Oracle, which targets <strong>~</strong>4.5&#8239;GW of capacity, further tightening the market for top&#8209;end accelerators.</p><p>And that&#8217;s before accounting for OpenAI&#8217;s ongoing expansion on Microsoft Azure, where the relationship now runs under a right&#8209;of&#8209;first&#8209;refusal model for new capacity rather than blanket exclusivity, still conferring practical priority on Azure deployments while allowing OpenAI to add capacity with other partners. </p><p><strong>Netting this out:</strong> through the end of the decade, OpenAI has assembled an envelope of roughly 10&#8211;15&#8239;GW of Nvidia&#8209;powered capacity across Oracle, Microsoft, and other partners with overlap between these footprints, so think of this as a shared umbrella rather than purely additive numbers. For context, independent analyses estimate <strong>~</strong>10&#8239;GW of additional AI data&#8209;center power could be needed globally in 2025 alone; in other words, OpenAI&#8217;s program is on the scale of a full year of incremental world AI build&#8209;out.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kCHD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kCHD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kCHD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kCHD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kCHD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kCHD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg" width="1004" height="632" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:632,&quot;width&quot;:1004,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Logos of OpenAI, NVIDIA, and Oracle arranged in a triangular pattern with arrows labeled \&quot;$100 billion\&quot; connecting them. Text at the top reads \&quot;THE INFINITE MONEY GLITCH\&quot; in red.&quot;,&quot;title&quot;:&quot;Logos of OpenAI, NVIDIA, and Oracle arranged in a triangular pattern with arrows labeled \&quot;$100 billion\&quot; connecting them. Text at the top reads \&quot;THE INFINITE MONEY GLITCH\&quot; in red.&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Logos of OpenAI, NVIDIA, and Oracle arranged in a triangular pattern with arrows labeled &quot;$100 billion&quot; connecting them. Text at the top reads &quot;THE INFINITE MONEY GLITCH&quot; in red." title="Logos of OpenAI, NVIDIA, and Oracle arranged in a triangular pattern with arrows labeled &quot;$100 billion&quot; connecting them. Text at the top reads &quot;THE INFINITE MONEY GLITCH&quot; in red." srcset="https://substackcdn.com/image/fetch/$s_!kCHD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kCHD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kCHD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kCHD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe695baec-b55a-495b-a17a-a1a31bc5ec51_1004x632.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Courtesy of SemiAnalysis&#8217;s <a href="https://x.com/dylan522p/status/1970346183827783756">Dylan Patel</a></figcaption></figure></div><p>The above data implies Nvidia GPU availability will tighten <em>substantially</em> for other frontier&#8209;model players&#8212;Anthropic (primarily on AWS), xAI, Meta, and Google DeepMind&#8212;raising effective prices and lead times and forcing harder choices about model cadence, context windows, and training tokens.</p><p>Google has been trying to break out of this Nvidia-dominated mold for years through the development of its own own AI&#8209;specialized TPUs for training and inference. But these in-house designed chips still pass through chokepoints that Nvidia heavily influences, especially TSMC wafers and advanced packaging. By the end of 2025, analysts expect Nvidia to be <strong>~</strong>20%+ of TSMC revenue (second only to Apple), and the CoWoS&#8209;class packaging and HBM ecosystems remain binding constraints even as capacity expands. TSMC&#8217;s allocation is fundamentally contractual, driven by prepays and take&#8209;or&#8209;pay deals, and it will be reluctant to shift meaningful share away from Nvidia while demand remains red&#8209;hot.</p><p>To escape the straightjacket created by the Nvidia&#8209;OpenAI alignment, Google should buy Intel (or a substantial portion of it), fund the High&#8209;NA EUV ramp, and prepare to manufacture TPUs on Intel fabs as that capacity comes online. That gives Google end&#8209;to&#8209;end control of its AI training infrastructure&#8212;chip architectures, training software, chip manufacturing, and data center buildout&#8212;and a guaranteed runway independent of Nvidia&#8217;s queue. </p><p>Recent events make this even more urgent. Nvidia just disclosed a $5&#8239;billion Intel investment at $23.28/share (roughly 5% of Intel&#8217;s outstanding shares), alongside a product pact in which Intel will build x86 SoCs integrating Nvidia RTX GPU chiplets for PCs and collaborate on custom data&#8209;center CPUs&#8212;clear evidence that Intel&#8217;s roadmap can be steered by anchor customers. Intel is also now soliciting an Apple investment, according to Bloomberg/Reuters reporting.</p><p>Given the quickly changing dynamics around Intel, Google must act quickly and decisively. For example, a $25&#8239;billion purchase at $35/share would buy on the order of ~714&#8239;million shares, implying ~16%&#8211;17% of Intel based on ~4.37&#8239;billion shares outstanding&#8212;placing Google ahead of both the U.S. government (~10%) and Nvidia (~4&#8211;5%) as the largest shareholder. That level of ownership could anchor governance and direct capex toward TPU&#8209;critical fabs and packaging lines.</p><p>In practice, this looks like the following:</p><ol><li><p><strong>A minority stake + board influence</strong> sufficient to align Intel Foundry&#8217;s roadmap to TPU requirements</p></li><li><p><strong>A TPU-only supply compact</strong>: multi-year, take-or-pay wafer and advanced packaging commitments, with right-of-first-allocation during shortages and pricing bands tied to verifiable tool/packaging milestones. </p></li><li><p><strong>Parallel open&#8209;market TPU SKUs</strong> to keep utilization high and de&#8209;risk capex&#8212;turning Google&#8217;s silicon into a software&#8209;first, capacity&#8209;priced product.</p></li></ol><p>#3 is the longest shot, but perhaps the most enticing benefit of the investment. This would open up a second profit engine to fuel Google&#8217;s growth over the next decade, especially as its Search business comes under threat from AI-search competitors (such as OpenAI&#8217;s search-enabled offerings). In fact, Nvidia&#8217;s data&#8209;center business is now running at an annualized ~$160&#8239;billion revenue pace, which is comparable to Google&#8217;s Search cash cow. Thus, the addition of the TPU revenue line provides substantial growth opportunities and a potential hedge against Google&#8217;s eroding search moat.</p><p>If this plan works, Google gets scheduling certainty, lower $/token, a faster model cadence independent of Nvidia&#8217;s allocation calendar, and another revenue stream that could potentially reach the level of Google Search. If it stumbles, the downside is capped at a financial position that should still appreciate if Intel&#8217;s foundry inflects. Either way, for $25 billion, Google can buy its way out of the Nvidia-TSMC duopoly and into the driver&#8217;s seat of AI compute.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Strategic Implications of GPT-5 for OpenAI]]></title><description><![CDATA[OpenAI shifts away from the enterprise and toward the consumer]]></description><link>https://www.chrishayduk.com/p/the-strategic-implications-of-gpt</link><guid isPermaLink="false">https://www.chrishayduk.com/p/the-strategic-implications-of-gpt</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Fri, 08 Aug 2025 15:39:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!_Syy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Syy!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Syy!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!_Syy!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!_Syy!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!_Syy!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Syy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2364241,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/170448919?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Syy!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!_Syy!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!_Syy!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!_Syy!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e6943e8-d8d3-4e51-b3f2-e941d5d890db_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Image courtesy of GPT-5</figcaption></figure></div><p>After years of anticipation and hype, GPT-5 is finally out. And the results are decidedly mixed. GPT-5 is undoubtedly a great model &#8212; it is #1 across the board on LMArena, sets new highs in SWE-Bench and a host of other coding tasks, and performs great across a range of math benchmarks. However, the expectations for GPT-5 were that it would blow the competition out of the water. Instead, it has made incremental improvements across all of these benchmarks, and is highly likely to be passed over whenever Google releases its next Gemini model in short order (or just when Gemini 2.5 Deep Think gets benchmarked!) </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FLx3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FLx3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 424w, https://substackcdn.com/image/fetch/$s_!FLx3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 848w, https://substackcdn.com/image/fetch/$s_!FLx3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 1272w, https://substackcdn.com/image/fetch/$s_!FLx3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FLx3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png" width="750" height="469" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:469,&quot;width&quot;:750,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;OpenAI botches the charts in GPT-5 introduction &#8211; FlowingData&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="OpenAI botches the charts in GPT-5 introduction &#8211; FlowingData" title="OpenAI botches the charts in GPT-5 introduction &#8211; FlowingData" srcset="https://substackcdn.com/image/fetch/$s_!FLx3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 424w, https://substackcdn.com/image/fetch/$s_!FLx3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 848w, https://substackcdn.com/image/fetch/$s_!FLx3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 1272w, https://substackcdn.com/image/fetch/$s_!FLx3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4abf679a-4d93-45c9-85a5-6435c84b0f6f_750x469.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Note that Claude Opus 4.1 scored a 74.5% on SWE-bench just days earlier, so GPT-5 performance is virtually the same (also what in the world is going on with charts at OpenAI???)</figcaption></figure></div><p>The largest gains to GPT-5 came in a less performance-based metric: it seems that, for this release, OpenAI highly prioritized reducing hallucinations and sycophancy in model output.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dBF7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dBF7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 424w, https://substackcdn.com/image/fetch/$s_!dBF7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 848w, https://substackcdn.com/image/fetch/$s_!dBF7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 1272w, https://substackcdn.com/image/fetch/$s_!dBF7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dBF7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png" width="1456" height="774" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:774,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;GPT-5 Benchmark Scores | ml-news &#8211; Weights &amp; Biases&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="GPT-5 Benchmark Scores | ml-news &#8211; Weights &amp; Biases" title="GPT-5 Benchmark Scores | ml-news &#8211; Weights &amp; Biases" srcset="https://substackcdn.com/image/fetch/$s_!dBF7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 424w, https://substackcdn.com/image/fetch/$s_!dBF7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 848w, https://substackcdn.com/image/fetch/$s_!dBF7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 1272w, https://substackcdn.com/image/fetch/$s_!dBF7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe84b3dbd-444c-4eba-9bf5-0b69ffeb50e2_2048x1089.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So you may not notice large performance differences between GPT-5 and the leading models from other labs (or even when compared to OpenAI&#8217;s o3 model), but you likely will notice that the model is much less likely to make things up and say things that are flat out wrong just to produce an answer.</p><p>In addition, you will likely notice in the ChatGPT model picker that all of the previous models are gone: now there&#8217;s only GPT-5. This is another one of GPT-5&#8217;s main contributions &#8212; it greatly simplifies the model selection process. GPT-5 is more of a system than a model, dynamically routing requests to faster LLMs (analogous to GPT 4o) or slower, thinking LLMs (analogous to o3) depending on the complexity of the request.</p><p>(The two above points are important; we&#8217;ll come back to those later).</p><p>Likely in response to some widespread dismay at the performance benchmarks, Sam Altman tweeted the following after the GPT-5 announcement:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!b1QM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!b1QM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 424w, https://substackcdn.com/image/fetch/$s_!b1QM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 848w, https://substackcdn.com/image/fetch/$s_!b1QM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!b1QM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!b1QM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:464257,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/170448919?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!b1QM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 424w, https://substackcdn.com/image/fetch/$s_!b1QM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 848w, https://substackcdn.com/image/fetch/$s_!b1QM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 1272w, https://substackcdn.com/image/fetch/$s_!b1QM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F05ed997d-58c9-4adb-bb36-36bdc952c7a2_2160x2160.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>&#8220;we can release much, much smarter models&#8221;</strong></p><p>It seems Altman is asserting that OpenAI deliberately <em>chose</em> to release a model below the company&#8217;s capabilities, barely edging out its competitors (and likely not even edging out Google&#8217;s leading model) on most performance metrics. Instead, they deliberately <em>chose </em>to focus on reducing hallucinations and streamlining model selection as the main contributions of GPT-5. Why would OpenAI do this? Why not continue setting the benchmark for LLM model performance, as they&#8217;ve done since the ye olde days of GPT-2?</p><p>Because the strategic focus of the company has clearly shifted.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><h1>Market Dynamics Affecting OpenAI</h1><p>ChatGPT is a consumer application with <a href="https://www.techloy.com/chatgpt-is-on-track-to-reach-700-million-weekly-active-users/">700 million weekly active users</a>. And it is absolutely trouncing the competition in consumer adoption. The ChatGPT app in the Apple App Store has 3.3 million reviews, compared to just 377,000 for the Gemini app and 23,000 for the Claude app. This suggests that ChatGPT has a mobile install base that is 10x the size of Google Gemini and 100x the size of Anthropic&#8217;s Claude. Moreover, in June 2025, openai.com had 1.12 billion visits, while gemini.google.com had 265 million and Claude had 113 million &#8212; again suggesting a lead of at least an order of magnitude for OpenAI over its competitors in the consumer chat space (source: <a href="https://www.semrush.com/">https://www.semrush.com/</a>).</p><p>By contrast, according to <a href="https://menlovc.com/perspective/2025-mid-year-llm-market-update/">a Menlo Ventures report</a> from July 2025, Anthropic is actually the leading market provider for enterprise LLM API usage, with 32% market share vs. OpenAI&#8217;s 25% market share in mid-2025. Google is also growing and not far behind OpenAI, at 20% market share, up from 12% in 2024. OpenAI&#8217;s enterprise position has also been trending strongly negative, cratering from 50% market share in 2023 down to its current position of 25% market share in mid-2025.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ruow!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ruow!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 424w, https://substackcdn.com/image/fetch/$s_!ruow!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 848w, https://substackcdn.com/image/fetch/$s_!ruow!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 1272w, https://substackcdn.com/image/fetch/$s_!ruow!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ruow!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp" width="1456" height="751" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:751,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Enterprise LLM API market share by usage&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Enterprise LLM API market share by usage" title="Enterprise LLM API market share by usage" srcset="https://substackcdn.com/image/fetch/$s_!ruow!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 424w, https://substackcdn.com/image/fetch/$s_!ruow!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 848w, https://substackcdn.com/image/fetch/$s_!ruow!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 1272w, https://substackcdn.com/image/fetch/$s_!ruow!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F39abed8f-1c2d-409f-ba6e-d5933542aea7_2560x1320.webp 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>So we see the market dynamics pressing on OpenAI as a company &#8212; absolutely dominant positioning on the consumer side of the market, with a weak (and steadily weakening) position on the enterprise side of the market. This leaves OpenAI with a choice &#8212; double down on its success on the consumer side of the market, or attempt to win in the highly competitive enterprise space. This choice mainly comes down to where the company has a moat that can result in durable profit margins.</p><h1>How Market Dynamics Affect the Models</h1><p>First, let&#8217;s explore the dynamics of the consumer market. Consumers, by and large, make buying decisions not based on performance or objective metrics, but instead based on &#8220;vibes&#8221;. You can improve the vibes for a consumer by improving brand positioning (i.e., make the consumer feel a certain emotion from using your product) or by improving the user experience (UX) and user interface (UI) of your product (i.e., make the product more enjoyable for the user).</p><p>OpenAI&#8217;s lead in consumer usage stems primarily from precisely these areas, with its extremely strong branding and UI/UX improvements in its chat interface versus the competition. OpenAI had a strong, multi-year lead due to its first mover advantage, providing its &#8220;ChatGPT&#8221; brand with significant mindshare in the consumer base. In addition, since the release of the original ChatGPT, OpenAI has focused strongly on the web and mobile chat experience. With features like memory and ChatGPT Projects, OpenAI has introduced a high level of personalization for users of the app, thereby creating a high switching cost moat &#8212; if you switch to Claude or Gemini, you can&#8217;t take ChatGPT&#8217;s memories or projects with you. This instantly makes the competing consumer apps less appealing to users in the same way that users of Spotify are reluctant to shift over to Apple Music once they have built up a library of playlists that they enjoy.</p><p>Hence, to improve consumer market share, OpenAI will need to continually pull the two levers of brand positioning and UI/UX improvements. Models can&#8217;t really improve brand positioning much, as that is more a function of marketing, so the main pressure on the model side of the equation will come from the UI/UX push. This pressure results in making models that are <strong>simpler and more enjoyable to use.</strong></p><p>Now we&#8217;ll shift our view to the enterprise market. Businesses, unlike consumers, strictly focus on return on investment when allocating capital expenditures. These ROI calculations will essentially have four inputs when it comes to LLMs: </p><ol><li><p>The API cost per million tokens</p></li><li><p>The number of tokens needed to solve a task</p></li><li><p>The value of that task</p></li><li><p>The performance of the LLM on that task</p></li></ol><p>We can then model the ROI of using an LLM as follows:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\n\\text{value generated} &amp;= (\\text{value of task}) \\times (\\text{performance of LLM on that task}) \\\\\n\\text{cost} &amp;= (\\text{API cost per million tokens}) \\times (\\text{number of tokens needed to solve the task}) \\\\\n\\text{net profit} &amp;= \\text{value generated} - \\text{cost} \\\\\n\\text{ROI} &amp;= \\frac{\\text{net profit}}{\\text{cost}}\n\\end{aligned}&quot;,&quot;id&quot;:&quot;AVQSRYKDYM&quot;}" data-component-name="LatexBlockToDOM"></div><p>So, from the above, we can see that the only levers that LLM providers can pull to improve the ROI calculations for a company are:</p><ol><li><p>Decrease the cost per million tokens for the API</p></li><li><p>Decrease the number of tokens needed to solve the task</p></li><li><p>Increase LLM performance on the task</p></li></ol><p>Given the pressure from open source contributions (e.g., DeepSeek, Kimi, and Qwen), closed source model providers will never be able to compete on #1. #2 also runs counter to the current scaling of AI models &#8212; to increase test-time compute (and thus make the LLM useful for more difficult tasks), we by definition have to increase the number of tokens used. Hence, LLM providers competing in the enterprise have started to converge on #3 &#8212; improving LLM performance on the given task.</p><p>Now, <strong>there are two additional levers that an LLM provider can pull to improve the performance of the model on a specific task:</strong></p><ol><li><p><strong>Make the model smarter overall</strong></p></li><li><p><strong>Customize the model for that task</strong></p></li></ol><p>Broadly, Google has taken the first approach, with Gemini models consistently leading the pack in intelligence (particularly the new Gemini 2.5 Pro Deep Think model). OpenAI would struggle mightily to compete along this dimension because Google has such massive advantages in terms of scale &#8212; it has access to ridiculous amounts of compute and has indexed virtually all of the world&#8217;s data. Having a lead in algorithms is not a durable moat due to the speed of diffusion of inventions in Silicon Valley, and since model performance is a function of algorithms, data, and compute, Google will maintain a decisive lead here.</p><p>Meanwhile, Anthropic has taken the second approach, specializing its models for code using targeted reinforcement learning and building the Claude Code agentic harness. This is the lowest hanging fruit for specialized models, given that this is the domain in which today&#8217;s LLMs perform best. Since Anthropic already has a large lead here, this then leaves OpenAI with two choices &#8212; find a less obvious niche which it can start customizing its models for, or compete directly with Anthropic in the coding space, where it already has a large advantage.</p><p>From the above analysis, we can see that OpenAI has a large lead in the consumer market with durable moats, and that in order to improve those moats, OpenAI would need to improve the UI/UX of its models by making them simpler and more enjoyable to use. By contrast, to compete in the enterprise market, OpenAI would need to either produce the smartest model (where it is at a disadvantage compared to Google) or start customizing its models for targeted use cases (where it is at a disadvantage compared to Anthropic in the most obvious market of coding agents). </p><h1>Conclusion - GPT-5 as AI for the Common Man</h1><p>Now let&#8217;s wrap up this argument.</p><p>We have already seen that OpenAI has a large and commanding lead in the consumer market, with a low and shrinking market share in the enterprise market. Now we have also shown that it has solid, defensible moats in the consumer market and it is at strong technical disadvantages in the enterprise market. We have also established that prioritizing consumers means improving model UI/UX, while prioritizing enterprise means improving model performance and specialization. Lastly, from the opening paragraphs, we have established that OpenAI deliberately did not make the highest performing model possible. </p><p>Instead, they prioritized reducing hallucinations and streamlining the model selection process in ChatGPT. Both of these changes significantly improve the consumer experience, as confabulations can destroy consumer trust and erode brand advantages, while the old model picker with nearly 10 different models intimidated new users and caused high cognitive load when using the app.</p><p>As such, the logical conclusion is that <strong>OpenAI has chosen to prioritize consumers</strong> <strong>over enterprise, and GPT-5 is the result of this</strong>.</p><p>Hence, over the coming years, don&#8217;t expect OpenAI to consistently lead in model performance as they have over the past 3 years. Instead, look for continuing improvements in the usage experience of ChatGPT. If you want to find the best models overall or the best coding models, you&#8217;ll probably need to look to Google and Anthropic, respectively.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Gemini 2.5 Pro: How Data + Compute Moats Beat Algorithmic Tweaks]]></title><description><![CDATA[Gemini 2.5 Pro and Google's path to AI supremacy]]></description><link>https://www.chrishayduk.com/p/google-takes-the-lead-in-the-ai-race</link><guid isPermaLink="false">https://www.chrishayduk.com/p/google-takes-the-lead-in-the-ai-race</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Mon, 14 Apr 2025 21:08:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!6Mpj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The race towards Artificial General Intelligence (AGI) and state-of-the-art AI models is often framed around breakthrough algorithms and novel architectures. However, a deeper analysis reveals that the true drivers of durable leadership lie elsewhere. While algorithmic innovation is crucial, the path to AI supremacy is increasingly paved with massive datasets and unparalleled computational power. When viewed through this lens, Google DeepMind emerges not just as a competitor, but as the likely frontrunner.</p><h3>The Trifecta of AI Progress: Algorithms, Compute, and Data</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!6Mpj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!6Mpj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!6Mpj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!6Mpj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!6Mpj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!6Mpj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2138408,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/161334401?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!6Mpj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!6Mpj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!6Mpj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!6Mpj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F67bced44-f871-4628-8285-a1ce227b72cc_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Training large-scale AI models hinges on three interdependent pillars:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ol><li><p><strong>Algorithms:</strong> These are the recipes, the architectures (like Transformers, Mixture-of-Experts), and the training methodologies (loss functions, optimization techniques) that dictate how effectively models learn patterns and relationships from data. Efficient algorithms extract more "knowledge" per unit of data and compute.</p></li><li><p><strong>Compute:</strong> This represents the raw processing power, typically measured in FLOPs (Floating Point Operations Per Second), required to execute the vast number of calculations involved in training deep neural networks. It's the energy input transforming potential into a trained artifact.</p></li><li><p><strong>Data:</strong> This is the raw material &#8211; the text, images, code, audio, video, and other modalities &#8211; from which the model learns the structure of the world, language, and reasoning. The quality, quantity, and diversity of data fundamentally shape the model's capabilities.</p></li></ol><p>These factors exhibit strong interplay. An algorithmic leap, like the transition from RNNs/LSTMs to Transformers for sequence modeling, unlocked the potential to effectively utilize vastly larger datasets and compute budgets. Before Transformers, training on web-scale text data with massive parameter counts often hit diminishing returns due to limitations in handling long-range dependencies and parallelization. The Transformer architecture, with its self-attention mechanism, was significantly more scalable, allowing marginal increases in data and compute to translate into tangible performance gains once more. The performance wasn't just better; the <em>scaling properties</em> improved.</p><h3>The Illusion of Algorithmic Moats</h3><p>Recent history is replete with examples emphasizing algorithmic prowess. The excitement around models like DeepSeek-R1, achieving remarkable performance with comparatively modest training resources, underscores the power of efficient architectures (like Mixture-of-Experts) and optimized training strategies. It proves that clever algorithms <em>can</em> significantly improve the compute/data-to-performance ratio.</p><p>However, as I argued previously in <em><a href="https://www.chrishayduk.com/p/on-algorithmic-moats-and-the-path">On Algorithmic Moats and the Path to AGI</a></em>, algorithms alone do not constitute a sustainable competitive advantage in the current AI landscape. Why?</p><ol><li><p><strong>Talent Mobility:</strong> The AI research community is fluid. Top researchers frequently move between major labs like Google DeepMind, OpenAI, Anthropic, and Meta, carrying conceptual knowledge and insights about successful (and unsuccessful) architectural experiments and training techniques. While NDAs exist, the fundamental <em>ideas</em> diffuse rapidly.</p></li><li><p><strong>Open Source and Publication:</strong> Key players like Meta (LLaMA series) and innovative teams like DeepSeek often open-source their models and research. Academic institutions and arXiv ensure rapid dissemination of novel techniques. This accelerates the entire field but levels the playing field algorithmically. A breakthrough published today can be replicated and built upon by competitors within months, if not weeks.</p></li></ol><p>Therefore, relying solely on being the <em>first</em> to discover the next architectural tweak is a fragile strategy. Being a fast-follower, capable of rapidly implementing and scaling proven algorithmic advances discovered elsewhere, might be just as effective, <em>provided</em> you possess advantages in the other two factors.</p><h3>The Real Moats: Data and Compute Scale</h3><p>If algorithms are becoming increasingly commoditized, what provides a durable edge? The answer lies in the factors that are far harder to replicate: <strong>data and compute.</strong></p><p><strong>Why Scale Matters:</strong> The principle of scaling laws in deep learning empirically demonstrates that model performance often improves predictably, following a power law, as model size, dataset size, and training compute increase. While we've seen impressive results from smaller, efficient models, we are likely still far from the point of diminishing returns for many complex reasoning and multimodal tasks. Reaching the next plateau of AI capability will almost certainly require scaling data and compute far beyond current levels.</p><p><strong>Why They Are Moats:</strong></p><ol><li><p><strong>Non-Portability:</strong> Unlike algorithmic knowledge, engineers cannot easily take petabytes of proprietary, curated internal data or access to tens of thousands of specialized accelerators (like TPUs or GPUs) with them when they change jobs.</p></li><li><p><strong>High Barrier to Entry:</strong> Building world-class compute infrastructure (data centers, custom silicon, high-speed interconnects) and accumulating diverse, high-quality datasets at the scale required represents billions of dollars in capital expenditure and years, often decades, of cumulative effort and investment. This is not something startups or even well-funded competitors can easily replicate overnight.</p></li><li><p><strong>Synergistic Flywheels:</strong> Access to vast compute allows for more ambitious experiments and training larger models. These improved models, when deployed, can generate new, valuable interaction data, which feeds back into further model improvements, creating a virtuous cycle that is difficult for competitors with lesser resources to match.</p></li></ol><p><strong>Gemini 2.5 Pro: A Glimpse of the Advantage</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kvQw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kvQw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kvQw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kvQw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kvQw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kvQw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg" width="1456" height="1343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1343,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;r/singularity - Gemini 2.5 Pro benchmarks released&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="r/singularity - Gemini 2.5 Pro benchmarks released" title="r/singularity - Gemini 2.5 Pro benchmarks released" srcset="https://substackcdn.com/image/fetch/$s_!kvQw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kvQw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kvQw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kvQw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbc4a2e5a-b1ad-497c-8f16-5702bb3c315b_1466x1352.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Gemini 2.5 Pro Experimental, recently release by Google, offers a glimpse into how these interacting factors of <strong>data </strong>and <strong>compute</strong> will lead to a durable advantage in Google&#8217;s AI model performance. Despite OpenAI and DeepSeek releasing highly performant thinking models months in advance of Google (representing a large lead in algorithmic innovations), Gemini 2.5 Pro has managed to score #1 across the board in Chatbot Arena and across a wide range of benchmarks.</p><p>While Google describes Gemini 2.5 Pro partly through algorithmic concepts like "thinking models," the sheer breadth and depth of its capabilities, validated by both benchmarks and human preference, strongly suggest that these algorithms are being scaled and refined using computational resources and data diversity that few, if any, competitors can match. The "significantly enhanced base model" (as described by Google) is almost certainly a product of larger parameter counts trained for longer durations on more diverse data, enabled by Google's vertical integration of hardware (TPUs) and software within their hyper-scale data centers.</p><h3>Google's Unassailable Advantage</h3><p>This brings us to Google. When assessing data and compute advantages, Google stands in a league of its own.</p><p><strong>1. Data Dominance:</strong></p><ul><li><p><strong>Breadth and Modality:</strong> Google possesses arguably the most diverse and extensive collection of multimodal data on the planet. Consider the sources:</p><ul><li><p><strong>Google Search:</strong> Billions of daily queries provide unparalleled insight into human intent, language variation, and real-time information needs (text, images, implicit semantics).</p></li><li><p><strong>YouTube:</strong> The world's largest video platform offers vast amounts of video, audio, transcripts, comments, and multilingual content &#8211; crucial for multimodal understanding.</p></li><li><p><strong>Android:</strong> Interaction data from billions of devices provides insights into user behavior, application usage, and sensor inputs (potentially anonymized and aggregated).</p></li><li><p><strong>Google Maps:</strong> Geospatial data, satellite imagery, Street View imagery, reviews, and real-time traffic information.</p></li><li><p><strong>Gmail, Docs, Workspace:</strong> While respecting user privacy is paramount, Google potentially has access (for internal R&amp;D, aggregated/anonymized analysis, or opt-in features) to colossal amounts of text, code, and collaborative data reflecting professional and personal communication patterns.</p></li><li><p><strong>Google Books:</strong> A massive corpus of digitized text spanning centuries.</p></li><li><p><strong>Chrome:</strong> Web interaction data (aggregated and anonymized) reflecting how users navigate and consume information online.</p></li></ul></li><li><p><strong>Scale and Freshness:</strong> The sheer volume is staggering, but equally important is the constant influx of <em>new</em> data, keeping datasets fresh and reflecting current events, language evolution, and emerging trends. This continuous stream is vital for maintaining model relevance and accuracy.</p></li></ul><p><strong>2. Compute Superiority:</strong></p><ul><li><p><strong>Custom Silicon (TPUs):</strong> Google made a strategic bet on custom AI accelerators years ago with its Tensor Processing Units (TPUs). Now in their 7th generation, TPUs are designed specifically for large-scale ML training and inference, offering potentially significant advantages in performance-per-watt and performance-per-dollar <em>for Google's specific workloads and scale</em> compared to general-purpose GPUs. This vertical integration allows hardware and software co-design for optimal efficiency.</p></li><li><p><strong>Infrastructure Mastery:</strong> Google operates some of the world's most sophisticated and efficient data centers. Decades of experience in distributed systems (MapReduce, Borg/Kubernetes, Spanner) translate into an unparalleled ability to orchestrate and execute massively parallel training jobs reliably and efficiently across thousands of accelerators. This isn't just about owning chips; it's about the networking fabric, power delivery, cooling, and system software that make large-scale training feasible.</p></li><li><p><strong>Capital Investment:</strong> Google has the financial resources to sustain and expand this infrastructure lead, continuously investing billions in data centers and next-generation TPUs.</p></li></ul><h3>Conclusion: The Inevitable Frontrunner?</h3><p>While the AI race is far from over, and competitors like OpenAI and Anthropic continue to innovate, the fundamental dynamics favor players with entrenched advantages in data and compute. Algorithmic breakthroughs will continue to happen across the ecosystem, but they diffuse quickly. The ability to <em>scale</em> these algorithms using proprietary data and custom-built, hyper-scale infrastructure is the real differentiator.</p><p>Google's unparalleled data ecosystem, harvested across its diverse product portfolio, combined with its long-term investment in custom TPUs and mastery of planetary-scale computing, creates a formidable moat. Gemini 2.5 Pro is likely just an early indicator of what this integrated advantage can produce. As the demands for data and compute continue to escalate on the path to more capable AI, Google's lead in these foundational resources positions it strongly to outpace the competition and ultimately define the next era of artificial intelligence.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Foundation Model Trap]]></title><description><![CDATA[Why AI Model Companies Are More Like Airlines than Like Cereal Companies]]></description><link>https://www.chrishayduk.com/p/the-foundation-model-trap</link><guid isPermaLink="false">https://www.chrishayduk.com/p/the-foundation-model-trap</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Wed, 05 Mar 2025 20:14:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!JK_8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JK_8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JK_8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JK_8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JK_8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JK_8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JK_8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg" width="1024" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:214162,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/158463759?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JK_8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!JK_8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!JK_8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!JK_8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F240b429a-11df-478c-be6d-88ea2e904cc3_1024x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Harvey Sawikin recently wrote a great article analyzing the AI industry through a very Munger-like lens: will AI turn out more like the cereal industry (where there are many competitors with very healthy profit margins) or more like the airline industry (where competition compresses profit margins to near 0). </p><p>This idea has major implications for the major AI labs training foundation models today, such as OpenAI, Anthropic, and xAI. In this article I'll attempt to flesh out my understanding of this cereal vs. airline distinction and discuss why the airline scenario is more likely for the foundation model providers.</p><p>Before we dive in, you can find Harvey&#8217;s article below. I highly recommend giving it a read before continuing here.</p><div class="embedded-post-wrap" data-attrs="{&quot;id&quot;:156375904,&quot;url&quot;:&quot;https://harveysawikin.substack.com/p/ai-companies-cereals-or-airlines&quot;,&quot;publication_id&quot;:2339794,&quot;publication_name&quot;:&quot;Harvey&#8217;s Substack&quot;,&quot;publication_logo_url&quot;:&quot;https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbce7176-1f83-43cd-8bbf-67a1bbfb334f_429x429.png&quot;,&quot;title&quot;:&quot;AI Companies: Cereals or Airlines?&quot;,&quot;truncated_body_text&quot;:&quot;In the post The Munger Games, inspired by my first-ever attendance at the Berkshire annual meeting and purchase and reading of Poor Charlie&#8217;s Almanack, I promised more commentary on Charlie Munger&#8217;s book once I&#8217;d reflected on it. One insight that stuck with me has come to the fore lately as I&#8217;ve tried to get my head around AI &#8211; an effort that isn&#8217;t theo&#8230;&quot;,&quot;date&quot;:&quot;2025-02-03T13:46:19.862Z&quot;,&quot;like_count&quot;:6,&quot;comment_count&quot;:3,&quot;bylines&quot;:[{&quot;id&quot;:32105441,&quot;name&quot;:&quot;Harvey Sawikin&quot;,&quot;handle&quot;:&quot;harveysawikin&quot;,&quot;previous_name&quot;:null,&quot;photo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65baf26e-5c5c-4a55-9377-22234d39af84_429x600.jpeg&quot;,&quot;bio&quot;:&quot;Fund manager, ex-lawyer, ex-novelist, art collector, writing about investing, marriage, culture, and random topics.&quot;,&quot;profile_set_up_at&quot;:&quot;2023-02-04T11:25:24.829Z&quot;,&quot;publicationUsers&quot;:[{&quot;id&quot;:2360843,&quot;user_id&quot;:32105441,&quot;publication_id&quot;:2339794,&quot;role&quot;:&quot;admin&quot;,&quot;public&quot;:true,&quot;is_primary&quot;:false,&quot;publication&quot;:{&quot;id&quot;:2339794,&quot;name&quot;:&quot;Harvey&#8217;s Substack&quot;,&quot;subdomain&quot;:&quot;harveysawikin&quot;,&quot;custom_domain&quot;:null,&quot;custom_domain_optional&quot;:false,&quot;hero_text&quot;:&quot;Welcome to my Substack, where I will write about investing (especially value investing), marriage and family, culture, and random topics. All subscription money will be donated to The Human Fund (sorry, that's for Seinfeld fans). Really, to a good cause.&quot;,&quot;logo_url&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/bbce7176-1f83-43cd-8bbf-67a1bbfb334f_429x429.png&quot;,&quot;author_id&quot;:32105441,&quot;theme_var_background_pop&quot;:&quot;#D10000&quot;,&quot;created_at&quot;:&quot;2024-02-10T21:01:10.289Z&quot;,&quot;email_from_name&quot;:null,&quot;copyright&quot;:&quot;Harvey Sawikin&quot;,&quot;founding_plan_name&quot;:null,&quot;community_enabled&quot;:true,&quot;invite_only&quot;:false,&quot;payments_state&quot;:&quot;enabled&quot;,&quot;language&quot;:null,&quot;explicit&quot;:false,&quot;is_personal_mode&quot;:false}}],&quot;is_guest&quot;:false,&quot;bestseller_tier&quot;:null}],&quot;utm_campaign&quot;:null,&quot;belowTheFold&quot;:false,&quot;type&quot;:&quot;newsletter&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="EmbeddedPostToDOM"><a class="embedded-post" native="true" href="https://harveysawikin.substack.com/p/ai-companies-cereals-or-airlines?utm_source=substack&amp;utm_campaign=post_embed&amp;utm_medium=web"><div class="embedded-post-header"><img class="embedded-post-publication-logo" src="https://substackcdn.com/image/fetch/$s_!UkOl!,w_56,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbbce7176-1f83-43cd-8bbf-67a1bbfb334f_429x429.png"><span class="embedded-post-publication-name">Harvey&#8217;s Substack</span></div><div class="embedded-post-title-wrapper"><div class="embedded-post-title">AI Companies: Cereals or Airlines?</div></div><div class="embedded-post-body">In the post The Munger Games, inspired by my first-ever attendance at the Berkshire annual meeting and purchase and reading of Poor Charlie&#8217;s Almanack, I promised more commentary on Charlie Munger&#8217;s book once I&#8217;d reflected on it. One insight that stuck with me has come to the fore lately as I&#8217;ve tried to get my head around AI &#8211; an effort that isn&#8217;t theo&#8230;</div><div class="embedded-post-cta-wrapper"><span class="embedded-post-cta">Read more</span></div><div class="embedded-post-meta">a year ago &#183; 6 likes &#183; 3 comments &#183; Harvey Sawikin</div></a></div><p>Okay, so let&#8217;s start with differentiating <em>why</em> cereals allow for competition with healthy profit margins, whereas airlines are a rough business for all involved.</p><p>Cereals have different flavors, so consumer preferences for a certain flavor can cause some degree of demand inelasticity. From a firm perspective, rather than chasing the flavor and profit margins of another firm's cereal, the more profitable long-term strategy is to specialize in a different flavor and reap your own healthy profit margins.</p><p>By contrast, the main service that airlines provide is transporting you from Point A to Point B. There isn't really an "experience" to speak of that differentiates airlines from one another (particularly for non-business class flyers), so the calculus for a consumer then comes down to only two factors: speed and cost. Speed can be achieved through two means: faster planes (which hasn't happened in decades) and more direct flights. Airlines are incentivized to provide direct flights between major cities/transit hubs because, if they did not, then any travelers going between major hubs (say, NYC and London) would choose the other airlines which did have those direct flights. Thus, we can assume that most major airlines will have direct flights between most major transit hubs/cities within a certian distance of each other. Hence, for any two airlines that have a direct flight between a fixed pair of cities, the <em>only</em> way to compete is on price, since this will be the <em>only</em> criterion differentiating the airlines for consumers. Thus, a small difference in price will lead to nearly all consumers choosing the cheaper options. This inherently <em>must</em> drive down profit margins as airlines seek to charge the lowest possible price while still maintaining profitability.</p><p>Translating this argument to AI, we see two potential paths forward:</p><ol><li><p><strong>Cereal Mode:</strong> We know that the data input to a model during its training process basically determines its behavior on the other end - what it acts like, what tasks it's good at, etc. Access to different types of data may thus give rise to different "flavors" of AI models, providing varying skill profiles and personalities. In this scenario, we could imagine that OpenAI provides the best chat experience (due to its large dataset of user chats), while Grok might provide the best news aggregation and summarization (due to its up-to-the-second Twitter data). This may provide enough distinction to allow each AI company to charge healthy profits margins on their respective foundation models.</p></li><li><p><strong>Airline Mode:</strong> In this case, maybe the data on the margins provided by chat interactions, Twitter, etc. doesn't move the needle much in terms of model behavior and capabilities. Perhaps the web-scale pretraining data drowns out the idosyncracies across each AI lab's datasets, leaving each lab's state-of-the-art AI models performing roughly identically. In this case, the only way to compete would be on the API pricing, with consumers rapidly moving to the cheapest option available that can perform the given task.</p></li></ol><p>Based on trends of the last few months, I think Airline Mode is looking more and more likely. The <a href="https://lmarena.ai/leaderboard">Chatbot Arena leaderboard</a> shows that all of the leading models from the main labs perform roughly similarly to each other (Grok 3 and GPT-4.5 are even currently within 1 Elo point of each other as of this writing!). And DeepSeek was able to reproduce OpenAI o1 in the span of a couple months (R1's Elo is actually 11 points higher than o1's). We're seeing <em>more</em> convergence between models over the last couple of years, not less. </p><p>Given that, unless a lab gets to AGI and the idea of recursive self-improvement leading to a permanent advantage turns out to be true, I don't see how foundation model training can provide durable, healthy profit margins without a significant change in business model for these companies.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding DeepSeek Part II: DeepSeek-V2]]></title><description><![CDATA[Compressing the key-value matrix]]></description><link>https://www.chrishayduk.com/p/understanding-deepseek-part-i-deepseek</link><guid isPermaLink="false">https://www.chrishayduk.com/p/understanding-deepseek-part-i-deepseek</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Wed, 05 Mar 2025 13:31:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!bv9y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Summary</h1><p>DeepSeek-V2, released in June 2024, built off the success of DeepSeek's previous papers to set a new standard for training and inference efficiency. The core changes made to DeepSeek-V2 that set it apart from prior open source models occur in two core components of the transformer architecture: the attention block and the feed-forward network (see below image).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bv9y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bv9y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 424w, https://substackcdn.com/image/fetch/$s_!bv9y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 848w, https://substackcdn.com/image/fetch/$s_!bv9y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 1272w, https://substackcdn.com/image/fetch/$s_!bv9y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bv9y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png" width="1043" height="890" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:890,&quot;width&quot;:1043,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:179138,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/158419950?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!bv9y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 424w, https://substackcdn.com/image/fetch/$s_!bv9y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 848w, https://substackcdn.com/image/fetch/$s_!bv9y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 1272w, https://substackcdn.com/image/fetch/$s_!bv9y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe7327a5e-c128-42a0-8f23-04b24d8b8fc1_1043x890.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The two key changes can be summarized as follows:</p><p>1.  <strong>Feed-Forward Network Optimization:</strong> DeepSeekMoE architecture</p><blockquote><p>Mixture of experts (MoE) layers are a drop-in replacement for the feed-forward layer in the standard transformer architecture. Prior to DeepSeekMoE, most MoE architectures functioned by splitting the feed-forward layer into several large feed-forward layers. Each input token would then "choose" 1 or 2 of these parallel feed-forward layers, also known as "experts", for its own computation. This architecture had one key problem - namely, each expert needed to learn large amounts of redundant information, since processing any token on any topic requires understanding of grammar, semantics, etc. DeepSeek solved this redundancy problem, thereby greatly increased the learning efficiency of the MoE architecture, through three key innovations. These included: more numerous, finer-grained experts; separating experts into shared and routing experts; and load balancing tokens across experts and devices. For more details on these innovations, see the previous blog post in the series.</p></blockquote><p>2. <strong>Attention Layer Optimization:</strong> Multi-head Latent Attention (MLA) </p><blockquote><p>Multi-head attention, described in detail in my other post, utilizes three matrices to produce new representations of the input tokens: the Query, Key, and Value matrices. Each of these matrices has dimension n x d, where n is the maximum length of the sequence and d is the dimension of the vector representing each token in the sequence. Standard transformers cache the Key and Value matrices for every layer fully in-memory at inference time, improving speed but resulting in large memory overhead. DeepSeek-V2's solution is to compress the Key and Value matrices at each layer into a single latent vector. At inference time, only this vector needs to be cached, substantially reducing memory requirements.</p></blockquote><p>Since we already described the DeepSeekMoE architecture in detail in the previous blog post of this series, this post will focus primarily on multi-head latent attention. We'll start by describing the problem it aims to solve, then move onto describing the intuition behind MLA's solution, and finally dive into the concrete math describing the method. We'll then end this post by discussing the effects that the combination of DeepSeekMoE and multi-head latent attention have on training and inference efficiency. Let's dive in!</p><div><hr></div><p><strong>Note: </strong>This post is part of the &#8220;Understanding DeepSeek&#8221; Series:</p><ol><li><p><a href="https://www.chrishayduk.com/p/understanding-deepseek-part-i-deepseekmoe">Understanding DeepSeek Part I: DeepSeekMoE</a></p></li><li><p>[This article] Understanding DeepSeek Part II: DeepSeek-V2</p></li><li><p>[Upcoming] Understanding DeepSeek Part III: DeepSeekMath</p></li><li><p>[Upcoming] Understanding DeepSeek Part IV: DeepSeek-Prover-V1.5</p></li><li><p>[Upcoming] Understanding DeepSeek Part V: DeepSeek-V3</p></li><li><p>[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1</p></li><li><p>[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World</p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe to make sure you don&#8217;t miss any new posts in this seris!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h1>The Memory Efficiency Problem with Standard Multi-Head Attention</h1><p>Standard multi-head attention, at its core, solves the problem of deciding how to update our understanding of one concept, given a set of other, potentially-related concepts. In the case of language modeling, we want to update our understanding of a particular token using the understanding of the other tokens present in the sequence. To accomplish this, at each attention layer in a transformer, the model learns to parametrize three key matrices: the Query, Key, and Value matrices. These three matrices work together to identify the most relevant portions of the sequence for each token, and then to update each token's representation based on the relevant portions that were found. I won't cover the full details of how this is done here, but you can reference <a href="https://www.chrishayduk.com/p/a-primer-on-multi-head-causal-self">my other blog post</a> for more information.</p><p>Now, when language models are producing output at inference, we essentially need to place the transformer in a while loop. Until the transformers outputs an &#8220;End of Sequence&#8221; token, we&#8217;ll feed the input sequence into the transformer to produce the next token then, appending that next token to the input sequence, we&#8217;ll feed the newly-elongated input sequence back into the transformer, repeating the process. </p><p>The key insight that enables caching here is the following: since modern LLMs are causal, meaning future tokens cannot influence previous tokens, by adding a new token to the end of the input sequence, we do not change the representation of any of the previous tokens. Hence, we do not need to recompute the hidden representations for the previous tokens, since these will be identical. </p><p>The only token for which we need to compute a new representation is the next token in the sequence (that is, the one token that doesn&#8217;t exist yet)! Another core insight coming from this observation is that we only need the key and value vectors for each previous token to compute the new token&#8217;s representation. Since the previous tokens&#8217; representations do not change, we don&#8217;t need to use the other tokens as "queries&#8221; to update their representations. However, we do need their key and value vectors so that we can &#8220;query&#8221; these vectors with the new token's query vector. </p><p>The above observations then give us a road map for caching values in the transformer in order to limit the number of computations we perform and speed up inference time. In particular, we must cache the Key and Value matrices at each hidden layer so that we can use these to compute the hidden representation for the new token. </p><p>Now let&#8217;s compute the memory requirements to store these cached values for Llama 3.3 70B, a state-of-the-art open source model at the time of writing.  (In practice, Llama 3.3 uses Grouped-Query Attention which actually reduces caching requirements. For the sake of simplicity, we'll assume it uses standard attention here.)</p><p>Llama 3.3 has 80 attention layers. Each key and value vector in these attention layers has a dimension of 8192. And Llama 3.3 has a maximum context length of 128,000 tokens. </p><p>If Llama 3.3 is used in the default floating point 16 (FP16) mode, then each stored number will take up 2 bytes (16 bits). Hence, a single vector consisting of 8192 floating point numbers will take up 16,384 bytes, or equivalently 16.384 kilobytes. For each cached token in our input, we need to store both a key vector *and* a value vector at each layer. Hence, at every layer, a cached token will require two vectors, totaling 32.768 KB in memory. Since there are 80 such layers, the cost to cache one token is thus 80 * 32.768 KB = 2621.44 KB (equivalently, 2.62 MB).</p><p>Now suppose our input is 10,000 tokens long and we are producing the next token in the sequence. To cache the necessary data for the previous tokens, we need 10,000 * 2.62 MB = 26,200 MB (equivalently, 26.2 GB). </p><p>If our input uses the full Llama 3.3 context length of 128,000 tokens, the required space is 128,000 * 2.62 MB = 335,360 MB (equivalently, 335.36 GB). </p><p>As can be seen by the above example, memory requirements for the cache expand quickly as the input length increases. This makes it incredibly difficult to serve models with long context windows. In order to solve this problem with the standard transformer architecture, DeepSeek introduced Multi-head Latent Attention (MLA). </p><h1>Multi-head Latent Attention (MLA)</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!BuXX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!BuXX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 424w, https://substackcdn.com/image/fetch/$s_!BuXX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 848w, https://substackcdn.com/image/fetch/$s_!BuXX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 1272w, https://substackcdn.com/image/fetch/$s_!BuXX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!BuXX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png" width="1456" height="644" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71b33121-208e-406d-82a4-faf03be131b4_1530x677.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:644,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186879,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/158419950?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!BuXX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 424w, https://substackcdn.com/image/fetch/$s_!BuXX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 848w, https://substackcdn.com/image/fetch/$s_!BuXX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 1272w, https://substackcdn.com/image/fetch/$s_!BuXX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71b33121-208e-406d-82a4-faf03be131b4_1530x677.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In order to overcome these memory efficiency issues, DeepSeek created the Multi-head Latent Attention layer. This layer modifies standard multi-head attention (depicted on the left side of the above image) by compressing the key and value matrices into a <em>single vector.</em> In practice, this looks like the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n    \\mathbf{c}^{KV}_t &amp;= W^{DKV} \\mathbf{h}_t, \\\\[8pt]\n    \\mathbf{k}^{C}_t &amp;= W^{UK} \\mathbf{c}^{KV}_t, \\\\[8pt]\n    \\mathbf{v}^{C}_t &amp;= W^{UV} \\mathbf{c}^{KV}_t, \\\\[8pt]\n    \\text{where} \\quad \n    \\mathbf{c}^{KV}_t &amp;\\text{ is the compressed latent vector for keys and values,}\\\\[5pt]\n\td_c \\; (&amp;<< d_n n_h) \\text{ denotes the KV compression dimension,} \\\\[5pt]\n    W_{DKV} &amp;\\in \\mathbb{R}^{d_c \\times d} \\text{ is the down-projection matrix} \\\\[5pt]\n    W^{UK}, &amp;W^{UV} \\in \\mathbb{R}^{d_h n_h \\times d_c} \\text{ are the up-projection matrices for the keys and values, respectively} \\\\[5pt]\n\\end{align}&quot;,&quot;id&quot;:&quot;BBBVCXUDSM&quot;}" data-component-name="LatexBlockToDOM"></div><p>That is, our model now must learn three additional matrices per layer - one down-projection matrix and two up-projection matrices. By learning these three matrices, we now no longer need to store the entire Key and Value matrices when caching previously-computed tokens. Instead, we can store the compressed latent vectors for each layer, where the compressed latent vector at a layer contains <em>all</em> of the information needed to produce the full Key &amp; Value matrices. </p><p>Thus, if we have L layers, we know need to only store d_c * L values (d_c numbers in each latent vector and L latent vectors total, one per layer). </p><p>Let's take the example of Llama 3.3 that we illustrated above to see how much this gains us - previously, caching the full Key and Value matrices for the full 128,000 token context length of Llama 3.3 required 335.36 GB. Now, instead of caching the full matrices, let's imagine we've augmented Llama 3.3 to use MLA. DeepSeek sets the dimension of the latent vector to four times the hidden dimension, so we will use 32,768 as the dimension for our latent vector here. Hence, each vector takes up 0.06554 MB. Then, to cache one latent vector at each of Llama 3.3's 80 layers corresponds to using 80 * 0.06554 MB = 5.243 GB.</p><p>This is a <em>substantial</em> reduction from the initial requirement of 335.36 GB for standard attention, demonstrating the efficiency gains that can be driven using this approach.</p><h1>Training and Inference Efficiency</h1><p>DeepSeek-V2 introduces significant efficiency improvements in both training and inference compared to its predecessor, DeepSeek 67B, primarily through innovations in its architecture&#8212;especially the Multi-head Latent Attention (MLA). By compressing the Key and Value matrices into a single latent vector, MLA dramatically reduces memory consumption during inference. The reduction of the KV cache by approximately 93.3% translates directly into substantial gains in maximum generation throughput, allowing DeepSeek-V2 to achieve throughput levels up to 5.76 times greater than those observed in DeepSeek 67B. These optimizations enable DeepSeek-V2 to handle much longer contexts (up to 128K tokens) efficiently, positioning it as one of the most practical choices among large-scale language models for real-world applications where large-context inference is critical.</p><p>Additionally, the integration of DeepSeekMoE into the Feed-Forward Network layers synergizes well with MLA, enabling significant computational savings without sacrificing model performance. By activating only a fraction (21B) of its total parameters (236B), DeepSeek-V2 demonstrates economical training by saving 42.5% of training costs compared to dense models of similar scale. Thus, MLA plays a critical role not only in inference-time efficiency but also in making the pretraining phase more cost-effective.</p><h1>Results and Key Takeaways</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ThVp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ThVp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 424w, https://substackcdn.com/image/fetch/$s_!ThVp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 848w, https://substackcdn.com/image/fetch/$s_!ThVp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 1272w, https://substackcdn.com/image/fetch/$s_!ThVp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ThVp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png" width="1441" height="794" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:794,&quot;width&quot;:1441,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:193908,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.chrishayduk.com/i/158419950?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ThVp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 424w, https://substackcdn.com/image/fetch/$s_!ThVp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 848w, https://substackcdn.com/image/fetch/$s_!ThVp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 1272w, https://substackcdn.com/image/fetch/$s_!ThVp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92d1cf29-86c1-4262-8f90-bcf584dc3273_1441x794.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The innovative Multi-head Latent Attention layer significantly enhances the practical deployability of DeepSeek-V2. Compared to traditional Multi-Head Attention, MLA achieves superior inference performance while simultaneously overcoming the KV cache bottleneck. With its novel low-rank joint compression strategy, MLA significantly reduces inference memory overhead, making DeepSeek-V2 particularly suited for high-throughput, real-time applications requiring extensive context management.</p><p>Empirical evaluations on various benchmarks illustrate the clear strengths of DeepSeek-V2, even when compared against other leading open-source models of the time. Notably, DeepSeek-V2 consistently achieved top-tier performance on benchmarks such as MMLU, math reasoning tasks, and coding challenges, highlighting the architectural advantages introduced by MLA. Moreover, these enhancements enabled DeepSeek-V2 to be trained and served at a fraction of the cost of comparably performing dense models (see above image).</p><p>All in all, Multi-head Latent Attention represented another significant milestone for DeepSeek on the path towards highly-optimized training and inference that marked their revolution with DeepSeek-R1 and DeepSeek-V3. The next blog post in this series will dive into the new innovations introduced for DeepSeek-V3, building upon the foundations laid here and forming the base model used to train DeepSeek's state-of-the-art reasoning model.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[A Primer on Multi-Head Causal Self-Attention]]></title><description><![CDATA[The neural network layer that kicked off the LLM craze]]></description><link>https://www.chrishayduk.com/p/a-primer-on-multi-head-causal-self</link><guid isPermaLink="false">https://www.chrishayduk.com/p/a-primer-on-multi-head-causal-self</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Sat, 01 Feb 2025 00:32:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d3efecf9-3e49-410d-b69e-5cf3ceecc999_1026x1148.webp" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Lately, I've been writing quite a few series that center around the transformer architecture. For many of those blog posts, I struggle to decide whether I should include the background information necessary to understand attention (greatly increasing the length of the blog post) or I should assume the reader already knows this information (limiting the reach of my audience). Thus, this post is intended to be a compromise between the two positions, allowing me to link this post as background reading in any future blog post that requires knowledge of the nuts and bolts of the attention architecture.</p><p>This will be a "living" blog post, in that it will be edited and expanded upon as my own understanding of the architecture grows and deepens. If there are any radically large changes that I make, I will re-email the post out to subscribers for their review. Otherwise, feel free to check back periodically to see how the article has changed!</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Subscribe to get notified for updates to this primer.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>The Basic Terminology of Multi-Head Causal Self-Attention</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!HgXR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!HgXR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 424w, https://substackcdn.com/image/fetch/$s_!HgXR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 848w, https://substackcdn.com/image/fetch/$s_!HgXR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 1272w, https://substackcdn.com/image/fetch/$s_!HgXR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!HgXR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png" width="635" height="347" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:347,&quot;width&quot;:635,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40706,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!HgXR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 424w, https://substackcdn.com/image/fetch/$s_!HgXR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 848w, https://substackcdn.com/image/fetch/$s_!HgXR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 1272w, https://substackcdn.com/image/fetch/$s_!HgXR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F594a8d67-c26a-433c-91d0-2448c6d3de74_635x347.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The standard attention block used in first-generation LLMs like GPT-2 and GPT-3 is <strong>multi-head casual self-attention</strong>. </p><p>The goal of this variant of attention, like any attention variant, is to learn how to update a vector using other context vectors in order to accomplish some goal. In the case of language modeling, our vectors represent tokens, which you can think of a roughly analogous to words. The goal of these vector updates is to accurately predict the next word in the sentence. It is called <strong>causal</strong> because this type of attention ensures that each word can only update itself using previous words in the sentence - that is, it can't look ahead and update itself using words that haven't been written yet! It is called <strong>self-attention</strong> because the things that each word is paying attention to are the other words in the sentence. There is no outside data or context involved here. And finally, it is termed <strong>multi-head</strong> because, at each attention layer, we have multiple attention operations occurring in parallel. These parallel attention operators are referred to as "heads".</p><p>To produce the results of attention, each attention head takes as input a sequence of tokens represented as vectors. These vectors are passed through three feed-forward networks per head in parallel, projecting each token's vector into three new vectors. These new vectors are commonly referred to as the query, key, and value vectors. These query, key, and value vectors are then used to update the vector representations of the words in our sentence, improving the model's understanding of the concepts contained in the sentence.</p><p>Let's take a look at how this is done in practice.</p><h1>The Mathematics of Causal Self-Attention</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IwgS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IwgS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 424w, https://substackcdn.com/image/fetch/$s_!IwgS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 848w, https://substackcdn.com/image/fetch/$s_!IwgS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 1272w, https://substackcdn.com/image/fetch/$s_!IwgS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IwgS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png" width="925" height="739" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ff99c21-1666-4108-869a-102bb1bec947_925x739.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:739,&quot;width&quot;:925,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82151,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IwgS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 424w, https://substackcdn.com/image/fetch/$s_!IwgS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 848w, https://substackcdn.com/image/fetch/$s_!IwgS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 1272w, https://substackcdn.com/image/fetch/$s_!IwgS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff99c21-1666-4108-869a-102bb1bec947_925x739.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As mentioned, the attention block block takes as input a sequence of tokens represented as vectors. Suppose the input sequence is given by:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;X = \\{x_1, x_2, \\dots, x_n\\},&quot;,&quot;id&quot;:&quot;TROFIYZJWD&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each x_i is the vector representation (embedding) of a token.</p><p>Each input vector x_i is simultaneously projected into three different spaces using learned linear transformations. That is, for every token x_i, we compute:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;q_i = x_i W^Q,\\quad k_i = x_i W^K,\\quad v_i = x_i W^V,&quot;,&quot;id&quot;:&quot;RTYBEAFKPI&quot;}" data-component-name="LatexBlockToDOM"></div><p>where:</p><ul><li><p>W^Q, W^K, and W^V are the weight matrices for the query, key, and value projections, respectively.</p></li><li><p>The sets of all query, key, and value vectors are often denoted as Q,  K, and V.</p></li></ul><p>For a given token x_i, we will compute a similarity score with every token x_j such that x_j comes before it in the sentence (or is the token itself). This is done by taking the dot product of the query vector for x_i with the key vector for k_j. This result is then divided by the square root of the key vector's dimension. Mathematically, this is given by:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\alpha_{ij} = \\frac{q_i \\cdot k_j}{\\sqrt{d_k}}&quot;,&quot;id&quot;:&quot;GXXLDMYNFI&quot;}" data-component-name="LatexBlockToDOM"></div><p>This value is precisely the unnormalized measure of how much token i should attend to token j. In linear algebra, the dot product of two vectors is just a scaled version of the cosine of the angle between them. An angle of 0 degrees gives a cosine value of 1, while an angle of 180 degrees gives a cosine value of -1. Hence, the closer the two vectors get to pointing in the same direction, the closer that their dot product gets to 1, and the closer the two vectors get to pointing in opposite directions, the closer their dot product gets to -1. Intuitively then, cosine (and by extension the dot product) has very desirable properties to use as a similarity function in the attention mechanism.</p><p>Armed with these similarity scores, we now have a way of measuring how "similar" two tokens in our sequence are. However, in order to use them to produce new vector embeddings, we're going to want to re-scale them. The dot product between the query and key vectors could end up being quite large, and using this value directly can cause large changes in the scale of the vector representation for a given token. Moreover, knowing the score of a particular token pair (let's say between tokens x_i and x_j) tells us nothing about how important that pair is - importance is always relative, and what if the score for x_i and x_k is bigger?</p><p>Given the above discussion we know we need to introduce some function that will re-scale our scores in such a way that we do not radically change the magnitude of the token's vector representation and that we can quickly determine how "important" each token pair is. A convenient differentiable function that does just this is softmax. The equation to produce the softmax output for token x_i is given below:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{\\alpha}_{ij} = \\frac{\\exp(\\alpha_{ij})}{\\sum_{j=1}^i \\exp(\\alpha_{ij})}&quot;,&quot;id&quot;:&quot;KBGRNDGLBV&quot;}" data-component-name="LatexBlockToDOM"></div><p>The softmax function will take our scores for all possible pairs made with x_i (i.e. all key vectors we multiplied by x_i's query vector) and squash them into the range of 0 to 1. Moreover, it will ensure that these values sum to 1. Hence, we can view these outputs, referred to as attention weights, as probabilities or percentages. It can be useful to think of the attention weight for the pair x_i and x_j as the percent of x_i's attention that should be paid to x_j. </p><p>Once we have these attention weights, we can use them to produce a new vector representation for token x_i. We do this by taking a weighted sum of the value vectors for each token, where the weight is the attention weight. As mentioned above, we can think of this attention weight as the percent of attention that x_i pays to each vector that precedes it in the sequence. Mathematically, this looks like:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}_i = \\sum_{j=1}^i \\tilde{\\alpha}_{ij} v_j&quot;,&quot;id&quot;:&quot;WPLELAQFMS&quot;}" data-component-name="LatexBlockToDOM"></div><p>This weighted sum integrates information from the tokens that x_i &#8220;attends&#8221; to, based on the learned attention weights.</p><p>In multi-head attention, these operations happen in parallel multiple times over. That is, we will produce multiple instances of the query, key, and value vectors for each token in the sequence. We will then use those unique instances to produces distinct updated vector representations for each token. If we have k heads, then we will produce k updated vector for token i:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}_{i1}, \\tilde{x}_{i2}, \\ldots \\tilde{x}_{ik}&quot;,&quot;id&quot;:&quot;PRSVCGFTHD&quot;}" data-component-name="LatexBlockToDOM"></div><p>These vectors get concatenated together, forming a single vector to represent the updated token i:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}_i = [\\tilde{x}_{i1}; \\tilde{x}_{i2}; \\ldots; \\tilde{x}_{ik}]&quot;,&quot;id&quot;:&quot;HWIMGBKDRV&quot;}" data-component-name="LatexBlockToDOM"></div><p>The intuition behind using multiple heads to create the final updated representation for token i is that each head can learn to capture different aspects of language. One had might learn grammatical structure, while another head might learn vocabulary related to the legal profession. By splitting responsibilities between the attention heads, each can learn unique, non-redundant information.</p><p>This vector will then be passed through a linear projection layer, producing the final output of the multi-head attention layer.</p><p>Let's walk through a toy example now to make things concrete.</p><h1>A Toy Example: "the dog barks"</h1><p>Suppose our sentence is "the dog barks", and our tokenizer splits it into three tokens: "the", "dog", and "barks". Initially, these tokens are embedded into vectors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;x_{\\text{the}},\\quad x_{\\text{dog}},\\quad x_{\\text{barks}}.&quot;,&quot;id&quot;:&quot;VPTYOASSTS&quot;}" data-component-name="LatexBlockToDOM"></div><p>When entering the first attention block, each of these vectors is projected into three new vectors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\n\n\\text{For \&quot;the\&quot;:} &amp;\\quad q_{\\text{the}} = x_{\\text{the}} W^Q,\\quad k_{\\text{the}} = x_{\\text{the}} W^K,\\quad v_{\\text{the}} = x_{\\text{the}} W^V, \\\\\n\n\\text{For \&quot;dog\&quot;:} &amp;\\quad q_{\\text{dog}} = x_{\\text{dog}} W^Q,\\quad k_{\\text{dog}} = x_{\\text{dog}} W^K,\\quad v_{\\text{dog}} = x_{\\text{dog}} W^V, \\\\\n\n\\text{For \&quot;barks\&quot;:} &amp;\\quad q_{\\text{barks}} = x_{\\text{barks}} W^Q,\\quad k_{\\text{barks}} = x_{\\text{barks}} W^K,\\quad v_{\\text{barks}} = x_{\\text{barks}} W^V.\n\n\\end{aligned}&quot;,&quot;id&quot;:&quot;OBBBHLQCJW&quot;}" data-component-name="LatexBlockToDOM"></div><p>Thus, we transform 3 input vectors into 9 new vectors (3 each for queries, keys, and values).</p><h2>Updating the "dog" Token</h2><p>Let&#8217;s focus on updating the token "dog". In our example, "dog" corresponds to the second token x_2. To update its representation, we use its query vector and compute dot-product scores with the key vectors of "the" and "dog" (i.e., the tokens that precede or are the token itself):</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\n\n\\alpha_{\\text{dog},\\text{the}} &amp;= \\frac{q_{\\text{dog}} \\cdot k_{\\text{the}}}{\\sqrt{d_k}}, \\\\\n\n\\alpha_{\\text{dog},\\text{dog}} &amp;= \\frac{q_{\\text{dog}} \\cdot k_{\\text{dog}}}{\\sqrt{d_k}},\n\n\\end{aligned}&quot;,&quot;id&quot;:&quot;OIAYCICNJN&quot;}" data-component-name="LatexBlockToDOM"></div><p>where d_k is the dimensionality of the key vectors. The division by the square root of d_k is used to normalize the scores.</p><p>These scores are then passed through a softmax function to obtain attention weights (or probabilities) that indicate how much attention "dog" should pay to itself and to "the":</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{aligned}\n\n\\tilde{\\alpha}_{\\text{dog},\\text{the}} &amp;= \\frac{\\exp\\left(\\alpha_{\\text{dog},\\text{the}}\\right)}{\\exp\\left(\\alpha_{\\text{dog},\\text{the}}\\right) + \\exp\\left(\\alpha_{\\text{dog},\\text{dog}}\\right)}, \\\\\n\n\\tilde{\\alpha}_{\\text{dog},\\text{dog}} &amp;= \\frac{\\exp\\left(\\alpha_{\\text{dog},\\text{dog}}\\right)}{\\exp\\left(\\alpha_{\\text{dog},\\text{the}}\\right) + \\exp\\left(\\alpha_{\\text{dog},\\text{dog}}\\right)}.\n\n\\end{aligned}&quot;,&quot;id&quot;:&quot;YRFTFEVZYQ&quot;}" data-component-name="LatexBlockToDOM"></div><p>Softmax ensures that these attention weights are between 0 and 1, and that they all sum to 1. Hence, they are valid probabilities and, to make things easier, you can of what percent of its attention the word "dog" should pay to the word "the" or to itself.</p><p>With the attention probabilities computed, we update the original vector representation of "dog" by taking a weighted sum of the corresponding value vectors. In this case, we combine the value vector of "the" and the value vector of "dog":</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}_{\\text{dog}} = \\tilde{\\alpha}_{\\text{dog},\\text{the}} \\, v_{\\text{the}} + \\tilde{\\alpha}_{\\text{dog},\\text{dog}} \\, v_{\\text{dog}}&quot;,&quot;id&quot;:&quot;EVNVACWIHW&quot;}" data-component-name="LatexBlockToDOM"></div><p>This new vector is an updated representation that incorporates contextual information from the preceding token "the" as well as from "dog" itself.</p><h2>Recap</h2><p>To recap, these are the major steps for updating the "dog" vector in our example using causal self-attention:</p><p>1. <strong>Input Embedding:</strong> </p><p>   Each token is embedded into a vector x_i.</p><p>2. <strong>Linear Projections:</strong>  </p><p>   Each x_i is projected into query, key, and value vectors:  </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;   q_i = x_i W^Q,\\quad k_i = x_i W^K,\\quad v_i = x_i W^V.&quot;,&quot;id&quot;:&quot;NCHGMXGHNI&quot;}" data-component-name="LatexBlockToDOM"></div><p>3. <strong>Score Calculation (for causal attention):</strong></p><p>   For token "dog" (second token), calculate:</p><p>   </p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;   \\alpha_{\\text{dog}, j} = \\frac{q_{\\text{dog}} \\cdot k_j}{\\sqrt{d_k}}, \\quad \\text{for } j \\in \\{\\text{\&quot;the\&quot;}, \\text{\&quot;dog\&quot;}\\}&quot;,&quot;id&quot;:&quot;NXASKXVPVR&quot;}" data-component-name="LatexBlockToDOM"></div><p>4. <strong>Softmax to Obtain Weights:</strong></p><p>   Convert scores to probabilities:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{\\alpha}_{\\text{dog}, j} = \\frac{\\exp\\left(\\alpha_{\\text{dog}, j}\\right)}{\\sum_{j'} \\exp\\left(\\alpha_{\\text{dog}, j'}\\right)}&quot;,&quot;id&quot;:&quot;BPJVISXYAX&quot;}" data-component-name="LatexBlockToDOM"></div><p>5. <strong>Contextual Update:</strong>  </p><p>   Update "dog" by a weighted sum of the value vectors:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\tilde{x}_{\\text{dog}} = \\sum_{j \\in \\{\\text{\&quot;the\&quot;}, \\text{\&quot;dog\&quot;}\\}} \\tilde{\\alpha}_{\\text{dog}, j} \\, v_j&quot;,&quot;id&quot;:&quot;JLFZHRNNAS&quot;}" data-component-name="LatexBlockToDOM"></div><p>In the full multi-head attention mechanism, this process is performed in parallel over multiple "heads" (with different learned projections), and the results are concatenated and transformed further to form the final output of the attention block.</p><h1>Key Takeaways</h1><p>Let's now summarize the key points of what we've learned:</p><ol><li><p>Attention is a neural network mechanism used to update vectors using the context from other vectors</p></li><li><p>Input vectors to an attention layer are replaced by 3 intermediate vectors: the query, key, and value vectors</p></li><li><p>The query and key vectors work together to produce similarity scores between pairs of vectors. If we are updating token i and want to know how much token j should influence our input, we multiply the query vector of token i by the key vector of token j.</p></li><li><p>The scores produced by the query and key vectors can be turned into probabilities using the softmax function. These probabilities are used to measure how much token i should consider the tokens that came before it in the sequence when updating its vector representation</p></li><li><p>The new vector representation for token i is produced by multiplying the softmax probabilities by the value vectors for each corresponding token. These are then summed together</p></li><li><p>The above process occurs independently across several parallel attention operations, called heads. At the end of the attention block, the new vector representations for token i coming from each head are concatenated together and passed through a linear projection layer.</p></li><li><p>The above process (steps 1-6) is performed in parallel for the full sequence of input tokens.</p></li></ol><p>If you keep these 7 key points in mind while reading Arxiv papers (or my future blog posts!), you'll have a strong understanding of what multi-head causal self-attention is doing, where it faces limitations, and whether or not a given architectural change actually addresses those limitations.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding DeepSeek Part I: DeepSeekMoE]]></title><description><![CDATA[Mixture of experts models with a twist]]></description><link>https://www.chrishayduk.com/p/understanding-deepseek-part-i-deepseekmoe</link><guid isPermaLink="false">https://www.chrishayduk.com/p/understanding-deepseek-part-i-deepseekmoe</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Thu, 30 Jan 2025 03:28:29 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RrxT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RrxT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RrxT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RrxT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RrxT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RrxT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RrxT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg" width="826" height="465" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:465,&quot;width&quot;:826,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;What is DeepSeek: China's open-source AI research lab which rivals OpenAI |  World News - Business Standard&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="What is DeepSeek: China's open-source AI research lab which rivals OpenAI |  World News - Business Standard" title="What is DeepSeek: China's open-source AI research lab which rivals OpenAI |  World News - Business Standard" srcset="https://substackcdn.com/image/fetch/$s_!RrxT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 424w, https://substackcdn.com/image/fetch/$s_!RrxT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 848w, https://substackcdn.com/image/fetch/$s_!RrxT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!RrxT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80afde5c-c788-4a82-b1d9-4779b07b4a11_826x465.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Series Introduction</h1><p>Recently, the announcement of DeepSeek-R1 shook the AI world, as an open source project managed to match the performance of OpenAI's state-of-the-art API, o1, within months of its release. The market reacted vehemently to this news, with Nvidia's stock dropping 18% in a singe day. AI researchers, engineers, and commentators alike took to Twitter/X to share their thoughts on DeepSeek-R1's implications for the AI industry and the United States, with many asserting that the age of American AI had come and gone in a flash, with China now firmly taking the lead.</p><p>But were these takes correct?</p><p>In order to dissect the true implications for the world going forward, we first need to understand DeepSeek-R1 on a fundamental level - what is it, what does it do, how does it work, and what are the key innovations that it introduced. This blog post series will aim to arm you with that knowledge.</p><p>To do this effectively, we are going to start at the beginning of DeepSeek's major papers and work our way forward in time, tracing out the researchers' reasoning and how they arrived at the final design for DeepSeek-R1. This final design included two key components: </p><ol><li><p>An efficient mixture of experts language model base </p></li><li><p>Reinforcement learning-tuned chain of thought capabilities</p></li></ol><p>In this blog series, we will explore two separate but related series of papers in order to deeply understand the two key components of DeepSeek-R1. First, we will trace the evolution of the mixture of experts architecture from DeepSeek-MOE to DeepSeek-V3, their newest state-of-the-art language model. We will then turn our attention to reinforcement learning-tuned chain of thought, beginning with the seminal DeepSeekMath paper and working our way forward to the current AI darling - DeepSeek-R1.</p><p>With this strong foundational knowledge of the theoretical underpinnings of DeepSeek-R1, we will be able to separate the hype from the noise. In light of what we've learned from these paper deep dives, this blog series will conclude with an analysis of the implications of DeepSeek-R1 from several perspectives:</p><ol><li><p>Technological progress</p></li><li><p>AI market dynamics</p></li><li><p>Geopolitical risks</p></li></ol><p>By the end of this series, you will have a clear, evidence-based understanding of DeepSeek-R1&#8212;what makes it powerful, where it stands relative to its competitors, and what its long-term impact might be. As the AI landscape continues to shift at an unprecedented pace, cutting through speculation and focusing on the fundamentals will be key to making sense of the road ahead. Let&#8217;s dive in.</p><div><hr></div><p><strong>Note: </strong>This post is part of the &#8220;Understanding DeepSeek&#8221; Series:</p><ol><li><p>[This article] Understanding DeepSeek Part I: DeepSeekMoE</p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-deepseek-part-i-deepseek">Understanding DeepSeek Part II: DeepSeek-V2</a></p></li><li><p>[Upcoming] Understanding DeepSeek Part III: DeepSeekMath</p></li><li><p>[Upcoming] Understanding DeepSeek Part IV: DeepSeek-Prover-V1.5</p></li><li><p>[Upcoming] Understanding DeepSeek Part V: DeepSeek-V3</p></li><li><p>[Upcoming] Understanding DeepSeek Part VI: DeepSeek-R1</p></li><li><p>[Upcoming] Understanding DeepSeek Part VII: Implications for the AI Industry and the World</p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe to make sure you don&#8217;t miss any new posts in this series!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h1>Paper Summary</h1><p>Mixture-of-experts (MoE) models are an extension of the standard transformer architecture in which a collection of expert modules (typically feed-forward networks) each learn to specialize in different aspects of the data. For a given token or input, only a subset of these specialized experts is activated, allowing the model to dynamically focus its computation on the most relevant components. This selective activation enables MoE models to achieve a high effective capacity&#8212;since many different specialists are available&#8212;while maintaining computational efficiency, because only a limited number of experts actually process each input. As a result, MoE approaches excel at capturing diverse patterns, efficiently scaling model size, and flexibly adapting to a wide variety of tasks.</p><p>Standard mixture-of-experts models, used prior to DeepSeekMoE, typically rely on selecting the top <em>K</em> experts (often 1 or 2) out of <em>N</em> possible experts for each token in a sequence. While this approach does reduce computational load&#8212;since only a small fraction of experts are activated&#8212;it also forces those few activated experts to capture <em>all</em> aspects of the token, including common linguistic structure that is often duplicated across experts. Consequently, an enormous portion of each expert&#8217;s capacity is spent memorizing redundant information, leaving less room for true specialization.</p><p>DeepSeekMoE improves upon the standard MoE architecture, solving this redundancy problem by:</p><p>1. <strong>Using a larger number of smaller experts (Fine-Grained Expert Segmentation)</strong></p><p>Instead of a few large experts, DeepSeek splits capacity into many more experts, each of which is smaller in dimensionality. The model then increases the number of selected experts by the same factor, creating a dramatically larger space of potential expert combinations. Despite this combinatorial explosion, the overall parameter count and per-token activated parameters remain <em>exactly the same</em> as in a conventional MoE setup&#8212;meaning we gain richer representational capacity without paying extra in total parameter count or computational cost.</p><p>2. <strong>Separating Experts into Shared and Routing Experts</strong></p><p>DeepSeek also partitions its experts into two sets. The shared experts, which are <em>always activated</em> for every token, learn the broad &#8220;common knowledge&#8221; required by all inputs (e.g., syntax, high-level semantics). The routing experts, by contrast, are only activated if they are relevant to a specific token, allowing them to focus on niche or domain-specific information. This further decreases redundancy and promotes parameter efficiency: shared experts handle language &#8220;fundamentals,&#8221; while routing experts handle specialization.</p><p>3. <strong>Load Balancing Through Additional Loss Terms</strong></p><p>Finally, DeepSeek addresses load balancing in two senses. It enforces a roughly equal usage of each active routing expert across tokens&#8212;ensuring no single expert is under- or over-utilized&#8212;and distributes the experts themselves across multiple GPUs to avoid hardware bottlenecks. Both of these aims are achieved by incorporating new balancing terms into the training objective.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7dNn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7dNn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 424w, https://substackcdn.com/image/fetch/$s_!7dNn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 848w, https://substackcdn.com/image/fetch/$s_!7dNn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 1272w, https://substackcdn.com/image/fetch/$s_!7dNn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7dNn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png" width="1086" height="664" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:664,&quot;width&quot;:1086,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:145250,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7dNn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 424w, https://substackcdn.com/image/fetch/$s_!7dNn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 848w, https://substackcdn.com/image/fetch/$s_!7dNn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 1272w, https://substackcdn.com/image/fetch/$s_!7dNn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc305bcca-7705-40e3-8ddf-2a278c9e73a0_1086x664.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Taken together, these modifications produce a model that is both parameter-efficient and highly flexible. By boosting expert variety, removing needless duplication, and balancing the workload across experts and devices, DeepSeekMoE provides a substantially more effective way to leverage MoE architectures&#8212;achieving greater specialization and capacity without increasing the overall parameter footprint.</p><p>Let's dive in deeper to these three optimizations now and see how they alter the standard MoE transformer architecture.</p><h1>Standard Mixture of Experts Models</h1><p>In standard MoE architecture, expert layers will typically replace the feed-forward layer that occurs after self-attention. Experts can be thought of as a set of <em>N</em> feed-forward layers that are structurally identical to the original feed-forward layer. Only a subset of these <em>N</em> possible feed-forward networks will be activated for any individual token, with many prior MoE architectures selecting 1 or 2 of these <em>N</em> possible networks for a given token. </p><p>Whether or not a network is activated is determined by taking the dot product of the output of the attention layer for that token (i.e. the hidden vector for token i) with the centroid of the current expert. We then take the softmax of this value to force it into the range of 0 to 1. You can think of this like an attention score computed over the experts instead of the tokens - we want to see which expert aligns most closely with the current token under consideration. These scores are computed for each expert, and then the experts are ranked according to this score. The top <em>K</em> (usually 1 or 2) experts are selected based on this ranking, and the token embeddings are then passed to those feed-forward expert networks. </p><p>The output of these experts are added together alongside the initial hidden state for the token (i.e. the token vector prior to the application of the experts). This produces the final output for the given layer.</p><p>The major obstacle with this approach is the following: since most prior MoE models only selected the top 1 or 2 experts for each token, the selected expert(s) must capture <em>everything</em> about a given token, including redundant information such as language structure. This wastes a large amount of the model's capacity to learn useful information, forcing the weights of each expert to memorize redundant information that is already captured by the other experts.</p><h1>Fine-Grained Expert Segmentation</h1><p>One of DeepSeek's solutions to the redundancy problem is to <strong>make experts smaller but more numerous</strong>. That is, the DeepSeekMoE approach reduces the dimensionality of each individual expert's feed-forward network (and therefore its computational cost and representational capacity) by a factor of 1/m compared to the network's standard feed-forward networks. Correspondingly, it increases the number of total experts by a factor of m <em>and</em> the number of selected experts by the same factor of m. This results in the same number of parameters for the model on net, but allows for substantially more variety when selecting the experts to use for a specific token.</p><p>We can see this increased variety when examining the combinatorics of the expert space. Suppose our standard feed-forward network has hidden dimension 4096, and our standard mixture of experts model uses 8 of these experts in total, with 2 selected for any given token. This results in the following number of possible expert combinations for each token in the standard mixture of experts model:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;{8 \\choose 2} = 28 \\; \\text{possible expert combinations}&quot;,&quot;id&quot;:&quot;SBDXFQQTLB&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Now, using the DeepSeekMoE architecture, suppose we have m = 8. That is, we are going to increase our number of experts by a factor of 8 (and reduce the hidden dimension by a factor of 1/8). This gives us a hidden dimension of 512 per expert, with 64 total experts and 16 experts selected for any given token. This results in the following number of possible expert combinations for each token in the DeepSeekMoE version of the model:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;{64 \\choose 16} \\approx 489,000,000,000,000 \\; \\text{possible expert combinations}&quot;,&quot;id&quot;:&quot;HALZCSCSEQ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>That is, we go from 28 possible expert combinations to nearly 489 trillion possible expert combinations! This allows for <em>significantly</em> more specialization across experts and much more variety in knowledge application on a token-by-token basis. Astonishingly, even with this huge increase in variety, the number of tokens stays exactly the same! The number of total parameters in each model is given by:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\n\\text{Original MoE model parameters} &amp;= 8 \\text{ experts} * 4096 \\text{ parameters per expert} = 32,768\\\\\n\n\\text{DeepSeekMoE model parameters} &amp;= 64 \\text{ experts} * 512 \\text{ parameters per expert} = 32,768\n\n\\end{align*}&quot;,&quot;id&quot;:&quot;YLEXVROBCV&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Similarly, the number of parameters activated for any given token is exactly the same:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align*}\n\n\\text{Original MoE activated parameters} &amp;= 2 \\text{ activated experts} * 4096 \\text{ parameters per expert}\\\\ &amp;= 8192\\\\\n\n\\text{DeepSeekMoE activated parameters} &amp;= 16 \\text{ activated experts} * 512 \\text{ parameters per expert}\\\\ &amp;= 8192\n\n\\end{align*}&quot;,&quot;id&quot;:&quot;DWYNCNGFFC&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Hence, we get basically a free lunch here - significantly higher representational capacity in our model with the same number of parameters used!</p><h1>Shared Experts</h1><p>Another approach DeepSeek took to avoid capturing redundancy in its experts is to segment the expert population into two groups: shared experts and routing experts.</p><p>Shared experts are <strong>always activated</strong>, regardless of the input token. This incentivizes these expert modules to capture common knowledge relevant to all queries (e.g. language semantics). By contrast, routing experts are only activated if the token is relevant to the expert, as described in the "Standard Mixture of Expert Models" section. </p><p>That is, the initial <em>mN</em> experts are split into two groups: <em>K_s</em> shared experts and <em>K_r = mN - K_s</em> routing experts. <em>All</em> of the <em>K_s</em> shared experts are activated for all tokens, while a subset of the <em>K_r</em> are selected for each token. Mathematically, this looks like the following:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{align}\n    \\mathbf{h}^l_t &amp;= \\sum_{i=1}^{K_s} \\text{FFN}_i(\\mathbf{u}^l_t) \n    + \\sum_{i=K_s+1}^{mN} \\left( g_{i,t} \\cdot \\text{FFN}_i(\\mathbf{u}_t^l) \\right) \n    + \\mathbf{u}_t^l, \\\\[8pt]\n    \\text{where} \\quad \n    \\mathbf{h}_t^l &amp;\\text{ is the hidden vector output for the } t\\text{-th token at the } l\\text{-th layer,} \\\\[5pt]\n    \\text{FFN}_i &amp;\\text{ is the feed-forward network representing the } i\\text{-th expert,} \\\\[5pt]\n    K_s &amp;\\text{ is the number of shared experts,} \\\\[5pt]\n    mN &amp;\\text{ is the total number of experts,} \\\\[5pt]\n    \\mathbf{u}_t^l &amp;\\text{ is the output of the attention mechanism for token } t \\text{ at layer } l, \\\\[5pt]\n    g_{i,t} &amp;= \\begin{cases} s_{i,t}, &amp; s_{i,t} \\in \\text{Top}_k\\left( \\{s_{j,t} \\mid 1 \\leq j \\leq mN\\}, mK \\right) \\\\[5pt] 0, &amp; \\text{otherwise} \\end{cases} \\\\[8pt]\ns_{i,t} &amp;= \\text{Softmax}_i \\left( \\mathbf{u}_t^l{}^\\top \\mathbf{e}_i^l \\right),\n\\end{align}&quot;,&quot;id&quot;:&quot;NOEKVJEXTI&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>Hence, we can see that the hidden vector output of token t at layer L <em>always</em> uses all of the shared experts (denoted by the first summation in the equation) and <em>always</em> includes the residual (denoted by the last term). The middle term, representing the routing experts, includes a gating factor that controls which experts are turned on for any specific token. In particular, the gating factor is the output of a softmax if the expert ranked in the top <em>mK</em> experts. Otherwise, it is 0. As a result, not only do we eliminate most of the possible experts (thereby greatly reducing the number of active parameters), we also weight the final output based on how <em>close</em> each chosen routing expert is to the token. In other words, the more a chosen routing expert "knows" about a topic, the more heavily we weight its opinion.</p><p>This setup allows the routing experts to ignore the redundant information captured by the shared experts and instead focus on learning concepts and information that are relevant to its area of specialization. This promotes parameter efficiency in the model, as each marginal parameter added to the routing experts will be encouraged through the learning process to acquire information that is distinct from the existing parameters.</p><h1>Load Balancing</h1><p>Now that we have a better-designed MoE network with fine-grained experts and expert sharing, there still remains one major challenge to ensure the parameters are used maximally - we need to load balance requests across the available experts. Essentially, our goal is to force each token to attend to the outputs of the <em>mK</em> chosen routing experts roughly equally. This makes certain that, when we activate routing expert parameters to process a particular token, all of the activated parameters are contributing meaningfully to the output. As a result, we maximize the utilization of the MoE architecture.</p><p>In addition to load balancing across experts, we would like to load balance across devices. Experts are typically stored on many separate GPUs, since these models are too large to fit in the memory of a single GPU. Given this fact, we would like the chosen experts for a token to be evenly spread across devices, thus preventing overloading of any single GPU.</p><p>These two goals are achieved by DeepSeekMoE through introducing two new terms to the loss function.</p><h1>Results and Key Takeaways</h1><p>With the above optimizations, DeepSeek was able to mitigate many of the most challenging problems facing MoE models. Together, fine-grained segmentation, shared experts, and load balancing work to maximize the amount of unique, useful information stored in a given set of parameters. As a result, DeepSeekMoE is able to outperform models with <em>fewer</em> active parameters. Below, we can see that DeepSeekMoE outperformed LLaMA2 7B (a dense model that does <em>not</em> use any experts) across a number of benchmarks with fewer than half of the active parameters. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZFgF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZFgF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 424w, https://substackcdn.com/image/fetch/$s_!ZFgF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 848w, https://substackcdn.com/image/fetch/$s_!ZFgF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 1272w, https://substackcdn.com/image/fetch/$s_!ZFgF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZFgF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png" width="991" height="918" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:918,&quot;width&quot;:991,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:209291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZFgF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 424w, https://substackcdn.com/image/fetch/$s_!ZFgF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 848w, https://substackcdn.com/image/fetch/$s_!ZFgF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 1272w, https://substackcdn.com/image/fetch/$s_!ZFgF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7ed31d09-a4d3-4b60-aa09-ba6ce8ae5a68_991x918.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When compared to another mixture of experts model, GShard, we see that DeepSeekMoE again outperforms it with the same total parameters and only half of the activated parameters.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RZJF!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RZJF!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 424w, https://substackcdn.com/image/fetch/$s_!RZJF!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 848w, https://substackcdn.com/image/fetch/$s_!RZJF!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 1272w, https://substackcdn.com/image/fetch/$s_!RZJF!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RZJF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png" width="984" height="547" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b6101720-8db6-495a-99ad-d32ce5906a86_984x547.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:547,&quot;width&quot;:984,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69102,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!RZJF!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 424w, https://substackcdn.com/image/fetch/$s_!RZJF!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 848w, https://substackcdn.com/image/fetch/$s_!RZJF!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 1272w, https://substackcdn.com/image/fetch/$s_!RZJF!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb6101720-8db6-495a-99ad-d32ce5906a86_984x547.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In sum, DeepSeek's optimizations for the MoE architecture served to substantially expand the possibilities for local and edge inference. Since only a small percentage of the model's total parameters are active for any given token, during inference the model's performance requirements are much closer to that of a small, weak model. However, its output quality matches that of a large, well-trained dense LLM. This innovation was critical for laying the groundwork towards DeepSeek-R1, ensuring that state-of-the-art base LLM performance would be possible for smaller models.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe to make sure you don&#8217;t miss any new posts in this series!</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold]]></title><description><![CDATA[How ESMFold and ESM3 replace explicit MSAs with encoder-only transformers]]></description><link>https://www.chrishayduk.com/p/understanding-protein-language-models-40b</link><guid isPermaLink="false">https://www.chrishayduk.com/p/understanding-protein-language-models-40b</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Wed, 22 Jan 2025 15:53:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!e9L5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!e9L5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!e9L5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 424w, https://substackcdn.com/image/fetch/$s_!e9L5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 848w, https://substackcdn.com/image/fetch/$s_!e9L5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 1272w, https://substackcdn.com/image/fetch/$s_!e9L5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!e9L5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png" width="1277" height="330" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:330,&quot;width&quot;:1277,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Meta's ESMfold: the rival of AlpahFold2 | by Salvatore Raieli | Medium&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Meta's ESMfold: the rival of AlpahFold2 | by Salvatore Raieli | Medium" title="Meta's ESMfold: the rival of AlpahFold2 | by Salvatore Raieli | Medium" srcset="https://substackcdn.com/image/fetch/$s_!e9L5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 424w, https://substackcdn.com/image/fetch/$s_!e9L5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 848w, https://substackcdn.com/image/fetch/$s_!e9L5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 1272w, https://substackcdn.com/image/fetch/$s_!e9L5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd14a546-d65b-4198-add1-c7e976554eb9_1277x330.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Note: </strong>This post is part of the &#8220;Understanding Protein Language Models&#8221; Series:</p><ol><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models">Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2</a></p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models-e1a">Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching</a></p></li><li><p>[This article] Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold</p></li></ol><div><hr></div><p><strong>Overview of the Main Ideas</strong></p><ol><li><p><strong>AlphaFold2&#8217;s MSA:</strong> AlphaFold2 identifies evolutionarily related proteins to the target sequence and builds a multiple sequence alignment (MSA). In the Evoformer block, row-wise (within-sequence) and column-wise (across-sequences) attention on this MSA yields information about co-evolving residues. This MSA-based representation is then integrated into a pair representation matrix, ultimately helping AlphaFold2 predict the 3D structure.</p></li><li><p><strong>ESMFold&#8217;s Language Model Encoding:</strong> In ESMFold, the MSA step is replaced by a large protein language model (ESM-2) trained via a Masked Language Modeling (MLM) objective. As in standard large language models for text, the hidden layers of the encoder learn semantic and syntactic regularities&#8212;in this case, biochemical and structural patterns. The result is that ESMFold can leverage these learned encodings to identify motifs and co-evolving positions without explicitly performing genetic database searches or building large MSAs.</p></li><li><p><strong>Conceptual Motif Lookup:</strong> We can interpret ESM-2&#8217;s embeddings as performing a &#8220;continuous fuzzy lookup&#8221; within an implicit database of protein motifs. Because the language model was pretrained on massive amounts of protein data, it has effectively learned how residues co-occur&#8212;and thus co-evolve&#8212;within protein families. This internal representation replaces the explicit MSA step.</p></li></ol><p>Below, we will dive into how this replacement works in more detail, starting with a short recap of AlphaFold2&#8217;s MSA-based pipeline and then exploring how ESMFold (and ESM-2 as its core) sidesteps explicit alignment by using learned representations.</p><div><hr></div><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div><hr></div><h2>1. Revisiting AlphaFold2&#8217;s MSA-Based Approach</h2><h3>1.1 Gathering Evolutionary Information</h3><p>AlphaFold2 conducts genetic searches against databases such as MGnify, UniRef90, Uniclust30, and BFD to identify sequences that share evolutionary relationships with the target sequence. From these hits, it constructs an MSA:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{MSA} \\;=\\; \\begin{pmatrix} s_{1,1} &amp; s_{1,2} &amp; \\dots &amp; s_{1,L} \\\\ s_{2,1} &amp; s_{2,2} &amp; \\dots &amp; s_{2,L} \\\\ \\vdots &amp; \\vdots &amp; \\ddots &amp; \\vdots \\\\ s_{S,1} &amp; s_{S,2} &amp; \\dots &amp; s_{S,L} \\end{pmatrix},&quot;,&quot;id&quot;:&quot;NLFVOYWKIH&quot;}" data-component-name="LatexBlockToDOM"></div><p>where L is the length of the target sequence, and S is the number of evolutionarily related sequences found. Here, s_{k,i} denotes the i-th residue of the k-th sequence in the alignment. By hypothesizing that residues co-evolve, the MSA is an external source of statistical correlations about which residues likely pair or contact each other in 3D space.</p><h3>1.2 Evoformer and Pair Representation</h3><p>In the AlphaFold2 pipeline:</p><ol><li><p><strong>MSA Representation</strong> <strong>M</strong>: A 3D tensor M,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{M} \\in \\mathbf{R}^{S \\times L \\times c}&quot;,&quot;id&quot;:&quot;TFKYKLSLZH&quot;}" data-component-name="LatexBlockToDOM"></div><p>where c is the dimensionality of each residue embedding.</p></li><li><p><strong>Pair Representation</strong> <strong>P</strong>: A 2D grid P,</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{P} = \\mathbb{R}^{L \\times L \\times c_z}&quot;,&quot;id&quot;:&quot;HOLIQNNHIB&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each P_{i,j} is a learned embedding representing the pairwise relationship between residue i and residue j in the target sequence.</p></li></ol><p>Inside the Evoformer block, row-wise and column-wise attention update the MSA representation:</p><ul><li><p><strong>Row-wise (Within-sequence) Attention</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{AttnRow}(\\mathbf{M})_{k, i} = \\sum_{m=1}^{L} \\alpha_{i,m} \\, \\bigl(W^V \\mathbf{M}_{k,m}\\bigr)&quot;,&quot;id&quot;:&quot;FLMQJXBFVQ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where &#945;_{i,m} are attention weights.</p></li><li><p><strong>Column-wise (Across-sequence) Attention</strong></p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{AttnCol}(\\mathbf{M})_{k, i} = \\sum_{n=1}^{S} \\beta_{k,n} \\, \\bigl(W^V \\mathbf{M}_{n,i}\\bigr)&quot;,&quot;id&quot;:&quot;BCDSEKOEAZ&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where &#946;_{k,n} are attention weights.</p></li></ul><p>After these attention layers (plus MSA transitions via 2-layer MLP), AlphaFold2 computes an <strong>Outer Product Mean</strong> that integrates MSA embeddings into the pair representation:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathbf{OPM}_{i,j} \\;=\\; \\Bigl(\\frac{1}{S}\\sum_{k=1}^S \\mathbf{u}_{k,i} \\otimes \\mathbf{u}_{k,j}\\Bigr) \\,W_{\\mathrm{proj}}&quot;,&quot;id&quot;:&quot;SMHJJISUOS&quot;}" data-component-name="LatexBlockToDOM"></div><p>where u_{k,i} is the final MSA embedding vector for residue i in sequence k. This OPM_{i,j} is then added (or concatenated and projected) into P_{i,j}, effectively injecting co-evolutionary signals gleaned from the MSA into the residue-pair representation.</p><div><hr></div><h2>2. ESMFold: Replacing MSA with Language Modeling</h2><h3>2.1 The Core Mechanism: Encoder-Only Transformer</h3><p>ESMFold (and its backbone ESM-2 model) is built around a large encoder-only transformer. It is trained with the <strong>masked language modeling</strong> objective, meaning it tries to reconstruct masked or hidden residues from context. This training strategy, originally popularized by BERT in natural language processing, has an important effect: it forces the model to encode in its weights the relevant &#8220;contexts&#8221; that predict each amino acid.</p><p>Mathematically, if x=(x_1,x_2,&#8230;,x_L) is the protein sequence and x_k is replaced by a special [MASK] token with some probability, the MLM training objective is</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\mathcal{L}_{\\mathrm{MLM}} = -\\sum_{k=1}^{L} \\log p_\\theta(x_k \\mid x_1, \\ldots, x_{k-1}, \\text{[MASK]}, x_{k+1}, \\ldots, x_L)&quot;,&quot;id&quot;:&quot;UFYLUMHRMO&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>where p_&#952; is parameterized by the encoder transformer. Over billions of observed residues, the model internalizes the patterns of co-occurrence across diverse protein sequences.</p><h3>2.2 Implicit Motif Lookup</h3><p>Where AlphaFold2 uses explicit lookups in an MSA database (plus explicit attention across sequences), ESM-2&#8217;s learned embeddings do something analogous &#8220;in one shot.&#8221; After pretraining, the internal representation of each residue h_i (the hidden state at position i) captures average contexts encountered during training. In effect, for any position i,</p><ol><li><p>h_i has high similarity to h_j if residues x_i and x_j frequently appear in similar sequence contexts in the training set.</p></li><li><p>By extension, if an entire sequence <strong>x</strong> has patterns analogous to known motifs (e.g., an ATP-binding site pattern, a signal peptide motif, or secondary-structure fragments), then the embeddings reflect these patterns&#8212;allowing ESMFold to &#8220;retrieve&#8221; them without an explicit MSA.</p></li></ol><p>You can view this as a &#8220;continuous fuzzy matching&#8221; process, wherein the [KEY], [QUERY], and [VALUE] matrices of the transformer contain compressed representations of how residues co-occur. Rather than computing the dynamic-programming-based edit distances (or alignment) across a large external database, the model&#8217;s attention modules effectively do an alignment on-the-fly in a continuous, high-dimensional space.</p><h3>2.3 Integration into Folding</h3><p>ESMFold then appends a structure-prediction head on top of these ESM-2 embeddings, akin to how AlphaFold2 appends its structure module after the Evoformer. Even though ESMFold no longer has an explicit pair representation from an MSA, it still must estimate which residues interact or contact each other. In current ESMFold architectures:</p><ul><li><p>The final hidden states from the ESM-2 encoder are projected into a lower-dimensional representation that acts like a &#8220;pair embedding&#8221; for each (i,j).</p></li><li><p>A geometry module or a series of feed-forward layers further refines these embeddings to produce coordinates or distance/contact maps.</p></li></ul><p>In practice, ESMFold&#8217;s results are often on par with AlphaFold2 for many proteins, especially those with strong evolutionary constraints. For proteins with scant evolutionary data, ESMFold can sometimes do <em>better</em> than AlphaFold2, because it does not rely so heavily on a large MSA. On the flip side, certain proteins with well-studied deep MSAs can benefit from the explicit signals that AlphaFold2&#8217;s large MSA provides.</p><div><hr></div><h2>3. Mathematical Rationales for Replacing MSA</h2><h3>3.1 Complexity and Speed</h3><p>One major advantage of dropping MSAs is computational efficiency. MSA searches can be prohibitively expensive for large proteins or large sets of queries, requiring queries against massive databases (MGnify, UniRef, etc.) and heuristics to align thousands of sequences. In ESMFold:</p><ul><li><p><strong>No MSA Search:</strong> The model simply takes the query sequence and feeds it through the encoder in a single forward pass.</p></li><li><p><strong>Linear vs. Quadratic Complexity:</strong> A single Transformer forward pass for a sequence of length L has complexity O(L^2 d) where d is the dimension of embeddings, whereas building an MSA might involve searching and aligning thousands of sequences, each of length up to L.</p></li></ul><h3>3.2 Continuous Fuzzy Matching Perspective</h3><p>If we interpret the MSA as a form of nearest-neighbor search (looking for &#8220;neighboring&#8221; sequences in a large database), then the language model is effectively a learned data structure that has:</p><ul><li><p><strong>Compressed</strong> the manifold of known protein sequences into &#952; (the weights).</p></li><li><p><strong>Learned</strong> an attention-based mechanism to query that internal manifold for relevant contexts.</p></li></ul><p>In typical fuzzy string matching, one might compute edit distances between the query and every entry in the database. In the ESM-2 architecture, the attention mechanism:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Attn}(\\mathbf{Q},\\mathbf{K},\\mathbf{V}) = \\mathrm{softmax}\\Bigl(\\frac{\\mathbf{Q} \\mathbf{K}^T}{\\sqrt{d_k}}\\Bigr) \\mathbf{V}&quot;,&quot;id&quot;:&quot;JAWRVEHEKZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>acts as a trainable similarity function to identify relevant contexts. The intangible advantage is that these contexts may mix and match partial motifs from multiple &#8220;virtual neighbors,&#8221; creating a new representation not limited to the top few explicit matches in a database.</p><h3>3.3 Co-evolutionary Signals Without Explicit Alignments</h3><p>A major reason MSA is so powerful is that it captures <em>co-evolving residues</em>&#8212;positions that change in correlated ways across evolutionary history. In a typical MSA-based approach, if residue i mutates from A to G, residue j might consistently switch from T to S. Over many sequences, one infers that i and j likely contact or interact structurally.</p><p>By training on a massive corpus, the language model sees countless such correlations in raw sequence form. The emergent embeddings reflect these patterns. Hence, the final hidden state h_i is (indirectly) sensitive to all correlated positions that have ever appeared near that residue in training. So even though ESMFold does not align the query sequence to a database, it has internalized an approximate version of that same statistical correlation from its pretrained weights.</p><div><hr></div><h2>4. Example: From MSA to Language Model&#8212;A Toy Mathematical Sketch</h2><p>Suppose we consider a short hypothetical protein sequence <strong>x</strong>=(M,K,L,L,P,V,L). In an MSA-based approach, you might gather 10,000 sequences from a database, building:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\begin{pmatrix} M &amp; K &amp; L &amp; L &amp; P &amp; V &amp; L \\\\ M &amp; K &amp; L &amp; L &amp; T &amp; V &amp; L \\\\ M &amp; R &amp; L &amp; L &amp; P &amp; A &amp; L \\\\ \\vdots &amp; &amp; &amp; &amp; &amp; &amp; \\\\ \\text{(10,000 rows)} \\end{pmatrix}.&quot;,&quot;id&quot;:&quot;CPYPYAYXBH&quot;}" data-component-name="LatexBlockToDOM"></div><p>You then compute attention across these sequences (column-wise) and across residues (row-wise), deriving correlation maps.</p><p>In the ESM-2 approach, no explicit MSA is constructed. Instead, during training the model saw thousands (or millions) of sequences resembling (M,K,L,L,P,V,L) or partial subsequences thereof. The MLM objective forced the model to fill in [MASK] tokens in contexts like _&#8201;K&#8201;L&#8201;_&#8201;P&#8201;V&#8201;_. Over many instances, it learned which next residues are probable. As a result, once we feed (M,K,L,L,P,V,L) into ESM-2, the hidden states reflect a &#8220;compressed MSA,&#8221; effectively picking up correlations that used to require explicit cross-sequence operations.</p><div><hr></div><h2>5. Implications and Future Directions</h2><ol><li><p><strong>Efficiency Gains:</strong> ESMFold runs <em>significantly faster</em> than AlphaFold2 when no large MSA is available, since it avoids the alignment process. For proteome-scale structure predictions, this is a game-changer.</p></li><li><p><strong>Handling Novel Proteins:</strong> If a target protein has few homologs in public databases, MSA-based models struggle. ESMFold is robust in these &#8220;low-homology&#8221; cases since it learned general protein grammar from the entire training corpus.</p></li><li><p><strong>Limited Interpretability:</strong> One downside is that MSA-based approaches produce an explicit record of hits and alignments, which can be biologically interpretable (e.g., which species and families contributed the signals). ESMFold&#8217;s learned embedding, while powerful, can be less transparent.</p></li><li><p><strong>Hybrid Approaches:</strong> Some emerging methods combine pre-trained embeddings with an MSA for the best of both worlds&#8212;particularly for proteins where deep MSAs exist.</p></li><li><p><strong>Scaling Laws and Emergent Behavior:</strong> As ESM models grow (ESM-2, ESM-3, etc.), they exhibit emergent behaviors akin to large language models in NLP. This suggests we may see further improvements in structure prediction, function annotation, and protein design.</p></li></ol><div><hr></div><h2>6. Conclusion</h2><p>AlphaFold2&#8217;s success showed how vital MSAs are in revealing <em>co-evolutionary signals</em>, which guide 3D structure inference. ESMFold&#8217;s fundamental insight is that you can <em>pre-learn</em> these signals at massive scale by treating protein sequences as &#8220;language.&#8221; Then, instead of collecting an MSA at inference time, the model effectively &#8220;queries&#8221; its internal knowledge of sequence co-occurrences, learned through the MLM objective.</p><p>In both approaches, the central idea is to approximate how residues covary. In AlphaFold2, that covariance emerges explicitly from a large MSA. In ESMFold, it is embedded implicitly in a high-dimensional transformer space. The advantage of the language-model approach is that it (1) eliminates the bottleneck of database searching/alignment, and (2) leverages far more global knowledge than just sequences that happen to align with the target protein.</p><p>Mathematically, we can view these approaches as two different ways to compute a &#8220;similarity function&#8221; over the manifold of protein sequences:</p><ul><li><p><strong>AlphaFold2 + MSA:</strong> An explicit alignment-based approach that organizes relevant sequences so the model can learn correlations.</p></li><li><p><strong>ESMFold + Transformer:</strong> A large-scale learned approach that stores correlation statistics in the weights, retrieving them through self-attention rather than explicit alignment.</p></li></ul><p>As these language models grow and become more accurate, their potential to replace, or at least augment, MSA-based pipelines will only increase&#8212;promising ever-faster and more versatile protein structure prediction.</p><p>In summary, ESMFold&#8217;s fundamental contribution is demonstrating how one can use a large, pretrained protein-language transformer to replicate (and in some cases surpass) the evolutionary context that an MSA provides. It is a step toward an era where generative models of protein sequence space might supersede explicit database lookups, enabling faster, more flexible, and equally accurate structure predictions&#8212;even for proteins with scarce evolutionary data.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching]]></title><description><![CDATA[How transformers learn from their input data]]></description><link>https://www.chrishayduk.com/p/understanding-protein-language-models-e1a</link><guid isPermaLink="false">https://www.chrishayduk.com/p/understanding-protein-language-models-e1a</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Wed, 15 Jan 2025 19:21:17 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!sMfG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sMfG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sMfG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 424w, https://substackcdn.com/image/fetch/$s_!sMfG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 848w, https://substackcdn.com/image/fetch/$s_!sMfG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!sMfG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sMfG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png" width="1456" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;transformer - Why GPT uses decoder only architecture, when they can use  full encoder decoder architecture? - Artificial Intelligence Stack Exchange&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="transformer - Why GPT uses decoder only architecture, when they can use  full encoder decoder architecture? - Artificial Intelligence Stack Exchange" title="transformer - Why GPT uses decoder only architecture, when they can use  full encoder decoder architecture? - Artificial Intelligence Stack Exchange" srcset="https://substackcdn.com/image/fetch/$s_!sMfG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 424w, https://substackcdn.com/image/fetch/$s_!sMfG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 848w, https://substackcdn.com/image/fetch/$s_!sMfG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 1272w, https://substackcdn.com/image/fetch/$s_!sMfG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F97ec1cc1-2590-4cb8-947a-7e65144465a4_2026x1106.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Note: </strong>This post is part of the &#8220;Understanding Protein Language Models&#8221; Series:</p><ol><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models">Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2</a></p></li><li><p>[This article] Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching</p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models-40b">Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold</a></p></li></ol><div><hr></div><p>In the quest to understand modern protein language models like ESM2 and ESM3, we often focus on their impressive empirical results while treating their internal mechanisms as a black box. This post attempts to build intuition about how encoder-only transformers work by drawing an analogy to a simpler, well-understood algorithm: fuzzy string matching. I argue that encoder-only transformers can be viewed as performing a kind of continuous fuzzy lookup against a compressed form of their training data, encoded in their weights and latent space representations.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><h2>The Central Analogy</h2><p>When working with large text corpora, we often need to find similar strings or patterns. The traditional approach employs fuzzy string matching: maintaining a database of all strings and computing edit distances to find matches. An alternative approach, which I argue is conceptually similar but mathematically more sophisticated, uses an encoder-only transformer to compress the patterns in the corpus into model weights, then uses attention mechanisms to find similarities.</p><p>Both approaches fundamentally solve the same problem - finding contextually appropriate matches - but do so in radically different ways. Understanding this connection helps demystify how encoder-only transformers work and suggests ways to improve them.</p><h2>How Encoder-Only Transformers Process Input</h2><p>To understand the analogy, we first need to build detailed intuition about how encoder-only transformers work. Unlike the full encoder-decoder architecture used in translation, encoder-only transformers take a sequence of tokens and return a sequence of the same length, where each output token is a refined representation incorporating contextual information.</p><p>The process begins in the embedding layer, where discrete tokens are converted into continuous vectors. Each input token is first converted to a one-hot encoding - a vector of zeros with a single one indicating the token's identity. This sparse vector is then multiplied by an embedding matrix to produce a dense vector representation. Mathematically, for a token x_i:</p><pre><code>one_hot = [0, 0, ..., 1, ..., 0] # 1 at position x_i 

embedding = one_hot @ W_emb # Matrix multiplication with embedding matrix</code></pre><p>To this embedding, we add a positional encoding vector that encodes information about where the token appears in the sequence. The original transformer paper used sinusoidal positional encodings:</p><pre><code>def get_positional_encoding(seq_len, d_model): 
   position = np.arange(seq_len)[:, np.newaxis] 
   div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model)) 
   pos_enc = np.zeros((seq_len, d_model)) 
   pos_enc[:, 0::2] = np.sin(position * div_term) 
   pos_enc[:, 1::2] = np.cos(position * div_term) 
   return pos_enc</code></pre><p>These positional encodings have an elegant property: the relative position of two tokens can be computed through linear combinations of their encodings.</p><p>The heart of the transformer architecture lies in its self-attention mechanism. For each position in the sequence, the model generates three vectors through learned linear transformations:</p><pre><code>Q = H @ W_Q # Query vectors 
K = H @ W_K # Key vectors 
V = H @ W_V # Value vectors</code></pre><p>where H is the matrix of hidden states. The attention scores are then computed as:</p><pre><code>attention_scores = softmax(Q @ K.T / sqrt(d_k)) 
output = attention_scores @ V</code></pre><p>This can be written more formally as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{Attention}(Q, K, V) = \\text{softmax}\\left(\\frac{QK^T}{\\sqrt{d_k}}\\right)V &quot;,&quot;id&quot;:&quot;AGZNKXYYMF&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>This attention mechanism allows each position to gather information from all other positions, with the weights determined by learned compatibility scores. The scaling factor $\sqrt{d_k}$ prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients.</p><p>The transformer employs multiple attention heads in parallel, each with its own set of query, key, and value projections:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{MultiHead}(H) = \\text{Concat}(\\text{head}_1, \\ldots, \\text{head}_h)W^O&quot;,&quot;id&quot;:&quot;ZRQNCSGKPZ&quot;}" data-component-name="LatexBlockToDOM"></div><p>where each head is computed as:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{head}_i = \\text{Attention}(HW^Q_i, HW^K_i, HW^V_i)&quot;,&quot;id&quot;:&quot;ICYATLGANP&quot;}" data-component-name="LatexBlockToDOM"></div><p>After attention, each position's representation goes through a two-layer feed-forward network:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{FFN}(x) = \\text{max}(0, xW_1 + b_1)W_2 + b_2&quot;,&quot;id&quot;:&quot;WFWABMVVDQ&quot;}" data-component-name="LatexBlockToDOM"></div><h2>Traditional Fuzzy String Matching</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xxZW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xxZW!,w_424,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 424w, https://substackcdn.com/image/fetch/$s_!xxZW!,w_848,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 848w, https://substackcdn.com/image/fetch/$s_!xxZW!,w_1272,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 1272w, https://substackcdn.com/image/fetch/$s_!xxZW!,w_1456,c_limit,f_webp,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xxZW!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif" width="547" height="335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:547,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Fuzzy String Matching with `stringdist` &#8211; Just R Things&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Fuzzy String Matching with `stringdist` &#8211; Just R Things" title="Fuzzy String Matching with `stringdist` &#8211; Just R Things" srcset="https://substackcdn.com/image/fetch/$s_!xxZW!,w_424,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 424w, https://substackcdn.com/image/fetch/$s_!xxZW!,w_848,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 848w, https://substackcdn.com/image/fetch/$s_!xxZW!,w_1272,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 1272w, https://substackcdn.com/image/fetch/$s_!xxZW!,w_1456,c_limit,f_auto,q_auto:good,fl_lossy/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F57dde700-3ba6-49b6-bfcd-6f591b344001_547x335.gif 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>To appreciate the analogy, we need to understand how traditional fuzzy string matching works. Given an input string and a database of reference strings, fuzzy matching computes the edit distance between the input and each reference. The edit distance represents the minimum number of operations (insertions, deletions, or substitutions) needed to transform one string into another.</p><p>The core of fuzzy string matching is the computation of edit distance. For strings s and t, the dynamic programming recurrence is:</p><pre><code>def edit_distance(s, t): 
   m, n = len(s), len(t) 
   dp = np.zeros((m+1, n+1)) 

   # Initialize base cases 
   for i in range(m+1): 
      dp[i,0] = i 
   for j in range(n+1): 
      dp[0,j] = j 

   # Fill dp table 
   for i in range(1, m+1): 
      for j in range(1, n+1): 
         if s[i-1] == t[j-1]: 
            dp[i,j] = dp[i-1,j-1] 
         else: 
            dp[i,j] = 1 + min( 
               dp[i-1,j], # deletion 
               dp[i,j-1], # insertion 
               dp[i-1,j-1] # substitution 
            ) 
   
   return dp[m,n]</code></pre><p>Mathematically, the recurrence relation is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;D[i,j] = \\min \\begin{cases} D[i-1,j] + 1 &amp; \\text{deletion} \\\\ D[i,j-1] + 1 &amp; \\text{insertion} \\\\ D[i-1,j-1] + \\mathbb{1}_{s[i] \\neq t[j]} &amp; \\text{substitution} \\end{cases}&quot;,&quot;id&quot;:&quot;NQWAVECNJM&quot;}" data-component-name="LatexBlockToDOM"></div><p>The computation proceeds through dynamic programming, building a matrix where each cell represents the minimum number of operations needed to match a prefix of the input string to a prefix of the reference string. The final cell gives the total edit distance between the strings. By computing this distance for each reference string and sorting the results, we can find the closest matches in the database.</p><p>While conceptually simple and mathematically elegant, this approach becomes computationally expensive for large databases, requiring time proportional to both the string lengths and the size of the database. Various optimizations exist, such as trie structures and early pruning, but the fundamental challenge of scaling remains.</p><h2>The Transformer as Compressed Fuzzy Matching</h2><p>Here we arrive at the core insight: the encoder-only transformer effectively compresses the pattern-matching capabilities of fuzzy string matching into its weights. During training, the model learns to encode the essential patterns and relationships present in the training data into its parameters.</p><p>The embedding matrix learns to map tokens to a continuous space where similar tokens are close together. The attention weights learn which patterns of tokens commonly co-occur, while the feed-forward layers learn to combine these patterns into higher-level features. Each successive layer captures progressively more abstract relationships. This process is analogous to building an optimized index of the training data, but instead of storing exact strings, we store distributed representations of patterns and relationships.</p><p>The connection between transformers and fuzzy matching becomes clearer when we compare their similarity computations. In fuzzy matching, the similarity between strings s and t is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{similarity}(s,t) = -\\min_{\\text{operations}} \\sum_i \\text{cost}(\\text{op}_i)&quot;,&quot;id&quot;:&quot;ZDFABHWQEO&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>In transformer attention, the similarity between vectors q and k is:</p><div class="latex-rendered" data-attrs="{&quot;persistentExpression&quot;:&quot;\\text{similarity}(q,k) = \\frac{q \\cdot k^T}{\\sqrt{d_k}} &quot;,&quot;id&quot;:&quot;HFJIELSMFL&quot;}" data-component-name="LatexBlockToDOM"></div><p></p><p>We can view the transformer's learned weights as parameterizing a continuous relaxation of edit distance. The attention mechanism implements this relaxed distance metric:</p><pre><code>def attention_similarity(query, key, value): 
   # query shape: [seq_len, d_k] 
   # key shape: [seq_len, d_k] 
   # value shape: [seq_len, d_v] 

   scores = query @ key.T / np.sqrt(query.shape[-1]) 
   attention_weights = softmax(scores) # [seq_len, seq_len] 
   return attention_weights @ value # [seq_len, d_v]</code></pre><p>Several observations support this compression view. The systematic improvement in model performance with increasing size suggests that larger models can store more detailed patterns from the training data. Analysis of attention patterns reveals that different heads learn interpretable relationships that match linguistic or domain structure. The organization of the embedding space shows meaningful clustering of similar tokens and preservation of analogical relationships.</p><p>When we run a sequence through the transformer, the process mirrors fuzzy matching but operates in a continuous space. The initial embedding maps tokens to vectors, analogous to preparing strings for comparison. The self-attention mechanism computes similarity scores between positions, playing a role similar to edit distance calculation, but using a learned, context-dependent metric. Multiple layers progressively refine these representations, like iteratively improving string alignment.</p><p>We can formalize this connection mathematically. In fuzzy matching, similarity is measured as the negative minimum cost of operations needed to transform one string into another. In transformer attention, similarity is measured through scaled dot products between query and key vectors. The transformer effectively learns a continuous approximation of edit distance that can capture more nuanced relationships.</p><h2>Concrete Example: Protein Sequences</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-iKO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-iKO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 424w, https://substackcdn.com/image/fetch/$s_!-iKO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 848w, https://substackcdn.com/image/fetch/$s_!-iKO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 1272w, https://substackcdn.com/image/fetch/$s_!-iKO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-iKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png" width="600" height="536" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:536,&quot;width&quot;:600,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:74056,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-iKO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 424w, https://substackcdn.com/image/fetch/$s_!-iKO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 848w, https://substackcdn.com/image/fetch/$s_!-iKO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 1272w, https://substackcdn.com/image/fetch/$s_!-iKO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc494fb8-639a-4a44-8cd0-ade19516ec06_600x536.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Consider a concrete example from protein sequence analysis. In traditional fuzzy matching, we might have a query sequence "MKLLPVL" and search a database containing sequences like "MKLLTVL" (one substitution) or "MLKPVL" (two operations). Each comparison requires explicit computation of edit distances. This type of genetic database search is used by MSA-based models such as AlphaFold2 and AlphaFold3. The mechanics of this is described in the previous post in the series, <a href="https://www.chrishayduk.com/p/understanding-protein-language-models">Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2</a>.</p><p>The transformer approach is markedly different. After embedding the sequence into continuous vectors, self-attention finds similar patterns that have been compressed into the model's weights during training. The output reflects patterns observed during training, but crucially, the model can combine these patterns in novel ways. The transformer effectively "remembers" which amino acid substitutions are biochemically plausible in each context, without storing explicit sequences.</p><p>In the next post in this series, we will flesh out this section more and learn how protein language models allow us to replace the MSA step in AlphaFold, creating faster strucutre prediction models that generalize better to proteins that have few available related sequences.</p><h2>Conclusion</h2><p>Viewing encoder-only transformers as performing compressed fuzzy matching provides powerful intuition about their operation. Rather than seeing them as black boxes, we can understand them as learning to compress and query a vast database of patterns from their training data. This perspective suggests that improvements in transformer architecture may come from better compression techniques for storing training patterns, more efficient similarity computations, and explicit incorporation of string matching algorithms.</p><p>Future research might investigate how much pattern information is stored in different parts of the model, how different architectures affect compression quality, and whether we can design better compression mechanisms inspired by string algorithms. We might also explore the theoretical limits of this compression approach and its implications for model scaling.</p><p>The success of this architecture in domains like protein sequence modeling suggests that the ability to learn and compress domain-specific similarity metrics is a powerful paradigm. As we continue to develop these models, maintaining this conceptual connection to classical algorithms may help guide the way to more efficient and effective architectures.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[OpenAI o3 and the Rise of the Intelligence Allocator]]></title><description><![CDATA[The implications of rapidly increasing inference costs]]></description><link>https://www.chrishayduk.com/p/openai-o3-and-the-rise-of-the-intelligence</link><guid isPermaLink="false">https://www.chrishayduk.com/p/openai-o3-and-the-rise-of-the-intelligence</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Fri, 20 Dec 2024 19:19:58 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2iKn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2iKn!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2iKn!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2iKn!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2iKn!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2iKn!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2iKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg" width="1024" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:176607,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2iKn!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!2iKn!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!2iKn!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!2iKn!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3188a284-6a08-423d-8f8c-964db47f75e6_1024x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>OpenAI's announcement of their o3 series of models represents a pivotal moment in AI development - but not for the reasons many might expect. While the headline achievement of 87% on ARC AGI is impressive, the more transformative aspect lies in the economics of the model's deployment.</p><p>Let's start with the raw numbers: A single inference task on o3 at its highest compute setting costs over $1,000 (see the below figure). This isn't $1,000 per evaluation set or per session - this is per individual task. To put this in perspective, that's roughly equivalent to 5-10 hours of skilled human labor cost, dedicated to solving a single problem. The model offers lower compute settings, but with corresponding decreases in capability. This creates a direct tradeoff between cost and intelligence that we haven't had to grapple with before.</p><p>This cost structure represents a sharp departure from the trend we've observed over the past two years. During that period, the cost of running general-purpose language models has approached zero, even as their capabilities have steadily improved. GPT-3.5 became GPT-4, yet inference costs remained relatively stable. GPT-4 then became GPT-4 Turbo and GPT-4o, maintaining intelligence while rapdily decreasing inference costs. This led to a proliferation of AI applications - we could afford to experiment freely, integrating AI into virtually every workflow to see what stuck.</p><p>The o3 series shatters this paradigm. When each inference costs more than a decent laptop, you can't simply "throw AI at the problem" anymore. Every use of high-compute o3 needs to be justified by the value it creates. This introduces what we might call the "inference allocation problem" - how do we determine which tasks are worth deploying our most powerful (and expensive) models on?</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GwAu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GwAu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GwAu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GwAu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GwAu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GwAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg" width="1194" height="670" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:670,&quot;width&quot;:1194,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:45229,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GwAu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 424w, https://substackcdn.com/image/fetch/$s_!GwAu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 848w, https://substackcdn.com/image/fetch/$s_!GwAu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!GwAu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd1de8e6-34c6-43e0-bb18-4bbfb4b49874_1194x670.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><p>Consider a software development team using o3 for code analysis. Running the model at high compute to analyze a critical security vulnerability in a payment system might be easily justifiable. But what about using it to optimize a non-critical internal tool? Or to review routine pull requests? The team now needs to develop frameworks for making these decisions systematically.</p><p>This fundamentally transforms AI deployment into a capital allocation problem. Just as investment managers spread limited capital across opportunities to maximize returns, organizations must now optimize their allocation of inference compute to maximize value creation.</p><p>Consider a hypothetical AI budget of $1 million per month. Currently, this might support tens or hundreds millions of GPT-4o inferences spread across hundreds of different use cases. With o3, the same budget only covers about 1,000 high-compute inferences. This scarcity forces us to think like capital allocators: Which thousand problems, if solved with our highest level of artificial intelligence, will generate the most value?</p><p>Beyond simply identifying high-value problems, intelligence allocators will need to understand the relationship between compute investment and value creation. Sometimes, a medium-compute inference at $100 might capture 80% of the potential value at 1/10th the cost. In other cases, the step-change in capability from high-compute might be worth the premium. Like any good investment decision, it requires understanding both the cost of capital and the expected returns.</p><p>In another parallel to traditional capital allocation, just as investors develop frameworks for evaluating investments across different sectors and risk levels, organizations will need frameworks for evaluating AI compute allocation across different use cases. These frameworks will need to consider factors like:</p><ul><li><p>The value delta between using high-compute versus lower-compute models</p></li><li><p>The cost of being wrong or suboptimal</p></li><li><p>The potential for value capture from improved accuracy</p></li><li><p>The frequency with which the task needs to be performed</p></li></ul><p>We might even see the emergence of "AI portfolio theory" - methods for optimizing the allocation of compute resources across different types of tasks to maximize expected return while managing risk. Some organizations might adopt a "barbell strategy" - using basic models for routine tasks while reserving expensive high-compute inferences for their most critical problems.</p><p>This shift in focus for AI engineers means that success looks more like developing the frameworks and metrics needed to make intelligence allocation decisions effectively rather than focusing purely on technical implementations. The best AI engineers will be those who can think like capital allocators, understanding both the technical capabilities and the business value of different compute investments.</p><p>In this light, o3 represents the beginning of an era where artificial intelligence must be treated as a scarce resource requiring careful allocation. The organizations that thrive will be those that develop robust frameworks for deploying this resource where it can generate the highest returns.</p><p>The future of AI might look less like unlimited abundance and more like traditional capital markets, where success comes from making smart allocation decisions with limited resources. As models continue to become more powerful and computationally intensive, these allocation skills will only become more crucial.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[On Algorithmic Moats and the Path to AGI]]></title><description><![CDATA[Google's path to winning the AI race]]></description><link>https://www.chrishayduk.com/p/on-algorithmic-moats-and-the-path</link><guid isPermaLink="false">https://www.chrishayduk.com/p/on-algorithmic-moats-and-the-path</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Thu, 19 Dec 2024 21:59:19 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!pLfc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pLfc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pLfc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pLfc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pLfc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pLfc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pLfc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg" width="1024" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:248010,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/jpeg&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pLfc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!pLfc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!pLfc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!pLfc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F065387d0-4f46-4d37-903f-6704a2a21ffa_1024x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>The past few weeks have provided a remarkable natural experiment in AI development dynamics. OpenAI releases what appears to be a breakthrough technology, and Google promptly demonstrates superior capabilities:</p><ul><li><p>OpenAI's Sora demonstrated remarkable text-to-video generation, only to be superseded by Google's Veo 2 with notably higher quality output</p></li><li><p>OpenAI's o1 introduced novel "thinking" capabilities, followed within weeks by Google's Gemini 2.0 Flash Thinking implementing similar functionality</p></li><li><p>Gemini 2.0 has now surpassed both GPT-4 and Claude Sonnet across a broad range of benchmarks</p></li></ul><p>This pattern reveals something fundamental about the nature of competitive advantage in artificial intelligence. To understand why this Google dominance was inevitable, we need to examine a broader principle: the myth of algorithmic moats.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><h1>Algorithmic Moats</h1><p>It has frequently been said that part of Silicon Valley's success is the lack of non-compete clauses for employees. This allowed trade secrets to proliferate rapidly in the Bay Area, creating more efficient competition dynamics and allowing many engineers to learn from each other, rather than restricting learnings and competitive advantages to a single firm.</p><p>However, I rarely see this same line of argument applied to business moats. If this holds true, then it implies that algorithms alone <em>cannot</em> provide a durable moat to a business. Employees can easily leave one company and take all of its hard-won knowledge to a competitor, allowing the competitor to catch up.</p><p>Consider a thought experiment: You discover a revolutionary new algorithm. How long can you maintain that advantage? In a world of mobile talent and reverse engineering, the half-life of algorithmic secrets approaches zero as their value approaches infinity.</p><p>This creates what we might call the algorithm diffusion principle: Any sufficiently valuable algorithm will spread through the industry at a rate proportional to its perceived importance. Silicon Valley's prohibition on non-compete clauses accelerates this process, creating an upper bound on how long any single player can maintain algorithmic superiority.</p><p>Hence, algorithms only provide moats insofar as they facilitate the construction of another type of moat. When we talk about algorithmic moats, we're really discussing two separate concepts: the technical implementation details that can be replicated, and the emergent properties that arise from being first to market with those implementations.</p><p>Consider Google's own history with PageRank. While revolutionary for its time, the core insight &#8211; that incoming links could be weighted by the importance of their source &#8211; was relatively straightforward to replicate once published. What made Google dominant wasn't PageRank itself, but rather the virtuous cycle it enabled: better search results &#8594; more users &#8594; more data &#8594; even better search results. The algorithm was merely the catalyst for building a data moat.</p><p>This pattern repeats across the technology landscape. Spotify's recommendation algorithms, while sophisticated, aren't what prevent users from switching to Apple Music or YouTube Music. Instead, it's the years of accumulated listening history, carefully curated playlists, and social sharing features that create switching costs. The algorithms enable these benefits, but they aren't the moat themselves.</p><h1>Moats on the Path to AGI</h1><p>The implications of the lack of direct algorithmic moats become clear when we consider AGI development as a function of three primary variables:</p><ol><li><p>Algorithmic innovation (A)</p></li><li><p>Computational resources (C)</p></li><li><p>Training data quality and quantity (D)</p></li></ol><p>We might express AGI capability as: AGI_capability = A * f(C,D)</p><p>Where f(C,D) represents the effective utilization of compute and data. The algorithm diffusion principle suggests that A will quickly equilibrate across major players. Therefore, the decisive factor becomes f(C,D).</p><p>This is where Google's position becomes overwhelming. Consider their structural advantages:</p><p>Data Supremacy:</p><ul><li><p>Google Search: The world's most comprehensive map of human knowledge and intent</p></li><li><p>YouTube: The largest repository of human audio-visual communication</p></li><li><p>Google Books/Scholar: A near-complete corpus of formal human knowledge</p></li><li><p>Android/Gmail: Vast behavioral and communication datasets</p></li></ul><p>Compute Dominance:</p><ul><li><p>Custom TPU architecture optimized for AI workloads</p></li><li><p>Vertical integration from silicon to software</p></li><li><p>World-class data center infrastructure</p></li><li><p>Decades of distributed systems optimization</p></li></ul><p>These advantages compound non-linearly. Having twice the data and twice the compute doesn't yield four times the capability &#8211; it might yield eight times or more due to emergent properties in large-scale systems.</p><p>The recent pattern of Google rapidly matching and exceeding OpenAI's innovations perfectly illustrates this dynamic. When OpenAI develops a new technique, Google can quickly replicate it (algorithm diffusion) and then apply it with vastly superior resources, achieving better results almost immediately.</p><p>This creates what game theorists would call a dominant strategy for Google: Wait for algorithmic innovations, replicate them with superior resources, and achieve better results than the original inventors. The math becomes almost deterministic.</p><p>One might object that breakthrough algorithms could create discontinuous advantages that trump resource differences. However, the observed scaling laws in neural networks suggest otherwise. The smooth power-law relationships we've seen indicate that resource advantages compound predictably rather than being disrupted by algorithmic breakthroughs.</p><p>In retrospect, the tech industry's focus on OpenAI and other startups represents a failure to reason from first principles. In a world where algorithmic innovations cannot be contained, the player with overwhelming advantages in compute and data will inevitably emerge victorious. Google's position isn't just strong &#8211; it's strategically dominant in a game-theoretic sense.</p><p>The universal rule of algorithmic diffusion suggests a surprising corollary: The most effective strategy for other players might not be to compete directly with Google, but rather to focus on specialized domains where Google's general-purpose advantages are less relevant. This could, ironically, lead to a more specialized and diverse AI ecosystem than many currently predict.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[ESM3 and the Future of Protein Language Models]]></title><description><![CDATA[Pure sequence learning is out, multiscale data is in]]></description><link>https://www.chrishayduk.com/p/esm3-and-the-future-of-protein-language</link><guid isPermaLink="false">https://www.chrishayduk.com/p/esm3-and-the-future-of-protein-language</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Tue, 25 Jun 2024 14:16:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!RYPe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RYPe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RYPe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 424w, https://substackcdn.com/image/fetch/$s_!RYPe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 848w, https://substackcdn.com/image/fetch/$s_!RYPe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 1272w, https://substackcdn.com/image/fetch/$s_!RYPe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RYPe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png" width="1280" height="680" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d672334e-257b-435e-93f6-3faa34cad41c_1280x680.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:680,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;EvolutionaryScale Debuts With ESM3 Generative AI Model for Protein Design |  NVIDIA Blog&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="EvolutionaryScale Debuts With ESM3 Generative AI Model for Protein Design |  NVIDIA Blog" title="EvolutionaryScale Debuts With ESM3 Generative AI Model for Protein Design |  NVIDIA Blog" srcset="https://substackcdn.com/image/fetch/$s_!RYPe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 424w, https://substackcdn.com/image/fetch/$s_!RYPe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 848w, https://substackcdn.com/image/fetch/$s_!RYPe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 1272w, https://substackcdn.com/image/fetch/$s_!RYPe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd672334e-257b-435e-93f6-3faa34cad41c_1280x680.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>EvolutionaryScale, a team spun out of Meta&#8217;s AI research department, <a href="https://www.evolutionaryscale.ai/blog/esm3-release">today released ESM3</a>, the sequel to the hugely popular ESM2 protein language model that was released in 2022. On the heels of this model release announcement, EvolutionaryScale also announced that <a href="https://techcrunch.com/2024/06/25/evolutionaryscale-backed-by-amazon-and-nvidia-raises-142m-for-protein-generating-ai/">they had raised a staggering $142 million in their seed round</a>. But what&#8217;s so different about ESM3 from previous iterations of protein language models? Does it warrant this level of hype?</p><p>To understand ESM3 and the surrounding hype, we&#8217;ll start with an overview of how protein language modeling has traditionally been done and use that as a springboard to see why ESM3 may represent a large step forward.</p><div><hr></div><p><strong>Note</strong>: If you&#8217;d like to learn more about protein language models before diving in here, check out my &#8220;Understanding Protein Language Models&#8221; series!</p><ol><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models">Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2</a></p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models-e1a">Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching</a></p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models-40b">Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold</a></p></li></ol><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><div><hr></div><h1>Protein Language Modeling Overview</h1><p>The goal of protein language modeling is to <em>understand </em>the space of all possible proteins through training on sequence data. To that end, protein language models leverage the underlying architecture that powers all of the advances in natural language text &#8212; that is, the transformer. </p><h2>Tokenization</h2><p>Just as with text language models, protein language models operate on &#8220;tokens&#8221;, or discrete chunks that input sequences have been divided into. The space of all possible tokens is referred to as the vocabulary of the model. In natural language text this is a bit more difficult because different tokenization schemes can profoundly affect performance &#8212; if we chunk text at the level of letters, our vocabulary is likely far too small and our model will need much more data to learn. It will also be limited in the length of sequences it can read since each character will use up 1 token of its context window (which can be thought of as the model&#8217;s working memory). But if we chunk by letter, our token vocabulary will now be tens of thousands of words long and the model will again struggle to learn since it may only see rare words once or twice. For protein language models, on the other hand, we have a natural tokenization point at the level of amino acids! Amino acids are the organic compounds that act as the building blocks for proteins in organisms (in fact, DNA&#8217;s primary role is to provide instructions to each organism&#8217;s cells about protein production &#8212; what proteins should be produced, how many, and at what time). There are 22 distinct amino acids that are found somewhere in the genetic code of life (though only 20 are found in the human body), and these form the basis for any protein language model&#8217;s vocabulary.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!MrvP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!MrvP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 424w, https://substackcdn.com/image/fetch/$s_!MrvP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 848w, https://substackcdn.com/image/fetch/$s_!MrvP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 1272w, https://substackcdn.com/image/fetch/$s_!MrvP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!MrvP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png" width="1220" height="344" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:344,&quot;width&quot;:1220,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;On Tokenization In LLMs. Different tokenization algorithms and&#8230; | by Mina  Ghashami | ILLUMINATION'S MIRROR | Medium&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="On Tokenization In LLMs. Different tokenization algorithms and&#8230; | by Mina  Ghashami | ILLUMINATION'S MIRROR | Medium" title="On Tokenization In LLMs. Different tokenization algorithms and&#8230; | by Mina  Ghashami | ILLUMINATION'S MIRROR | Medium" srcset="https://substackcdn.com/image/fetch/$s_!MrvP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 424w, https://substackcdn.com/image/fetch/$s_!MrvP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 848w, https://substackcdn.com/image/fetch/$s_!MrvP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 1272w, https://substackcdn.com/image/fetch/$s_!MrvP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcef0134d-2482-473b-b5c7-e18eee282f87_1220x344.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Tokenization in natural language text models</figcaption></figure></div><h2>Training Setup &amp; Loss Function</h2><p>Once we have a tokenization scheme, we need to train the model to perform our desired task. While text language models have focused squarely on generation, protein language models have instead focused on producing useful encodings of the target protein. To that end, they have typically used a bidirectional transformer with masked language modeling (MLM) loss rather than the unidirectional transformers with autoregressive loss. I&#8217;ll define these terms below:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1TQp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1TQp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 424w, https://substackcdn.com/image/fetch/$s_!1TQp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 848w, https://substackcdn.com/image/fetch/$s_!1TQp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 1272w, https://substackcdn.com/image/fetch/$s_!1TQp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1TQp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png" width="1210" height="795" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:795,&quot;width&quot;:1210,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Can We Learn the Language of Proteins? &#8211; The Berkeley Artificial  Intelligence Research Blog&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Can We Learn the Language of Proteins? &#8211; The Berkeley Artificial  Intelligence Research Blog" title="Can We Learn the Language of Proteins? &#8211; The Berkeley Artificial  Intelligence Research Blog" srcset="https://substackcdn.com/image/fetch/$s_!1TQp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 424w, https://substackcdn.com/image/fetch/$s_!1TQp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 848w, https://substackcdn.com/image/fetch/$s_!1TQp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 1272w, https://substackcdn.com/image/fetch/$s_!1TQp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f85a05f-f27a-45a1-a6ae-40978aa740d2_1210x795.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Bidirectional transformer with masked language modeling loss</figcaption></figure></div><ul><li><p><strong>Bidirectional </strong>- the transformer model can read a sequence both forwards and backward</p></li><li><p><strong>Masked language modeling loss </strong>- we pick random tokens in the input sequence and hide them (or mask them) from the model. We then score the transformer model based on whether or not it can guess this hidden token given the rest of the input sequence (see the image above)</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vR2B!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vR2B!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 424w, https://substackcdn.com/image/fetch/$s_!vR2B!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 848w, https://substackcdn.com/image/fetch/$s_!vR2B!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!vR2B!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vR2B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png" width="1456" height="737" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:737,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Language Model Training and Inference: From Concept to Code&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Language Model Training and Inference: From Concept to Code" title="Language Model Training and Inference: From Concept to Code" srcset="https://substackcdn.com/image/fetch/$s_!vR2B!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 424w, https://substackcdn.com/image/fetch/$s_!vR2B!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 848w, https://substackcdn.com/image/fetch/$s_!vR2B!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 1272w, https://substackcdn.com/image/fetch/$s_!vR2B!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08729f45-ace9-419d-80dd-4520c878cfac_2300x1164.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Unidirectional transformer with autoregressive loss</figcaption></figure></div></li><li><p><strong>Unidirectional </strong>- the transformer can only read a sequence forwards. That is, future tokens cannot influence past tokens in an input sequence</p></li><li><p><strong>Autoregressive loss </strong>- we hide the last token in a sequence and score the model based on its ability to predict that token using all of the previous tokens that </p></li></ul><p>Protein language models, by using the bidirectional model with MLM loss setup, are able to reference both <em>previous </em>and <em>future </em>amino acids when generating a representation for a given amino acid in a sequence. This allows the model to learn with less training data since it is an easier task than the unidirectional autoregressive case. In addition, it allows the model to attend to amino acids that may be far apart in the amino acid sequence but actually very close together in the resulting 3D protein structure.</p><h2>ESM2 Results</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!55Qs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!55Qs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 424w, https://substackcdn.com/image/fetch/$s_!55Qs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 848w, https://substackcdn.com/image/fetch/$s_!55Qs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!55Qs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!55Qs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg" width="1050" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:1050,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Protein Structure Prediction : A Primer (Part 5) | by Siddhant Rai | Medium&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Protein Structure Prediction : A Primer (Part 5) | by Siddhant Rai | Medium" title="Protein Structure Prediction : A Primer (Part 5) | by Siddhant Rai | Medium" srcset="https://substackcdn.com/image/fetch/$s_!55Qs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 424w, https://substackcdn.com/image/fetch/$s_!55Qs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 848w, https://substackcdn.com/image/fetch/$s_!55Qs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!55Qs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2472e998-43bb-4472-ae20-f27a3f03f3af_1050x443.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Given the above tokenization scheme and training setup, ESM2 (and other protein language models like it) was able to produce some pretty impressive results. The base model, which was only trained to identify missing amino acids in a protein sequence, could be finetuned to perform tasks like protein function prediction, protein-protein interactions, or protein structure prediction (as shown in the image above). It is able to perform these predictions very quickly when compared to its contemporary competitors such as AlphaFold2 due to the efficiency of the language modeling approach. However, by the same token, its accuracy in structure prediction was generally worse than AlphaFold&#8217;s.</p><h1>Limitations of Pure Sequence Data</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!tIWf!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!tIWf!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tIWf!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tIWf!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tIWf!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!tIWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg" width="1456" height="1092" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Image&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Image" title="Image" srcset="https://substackcdn.com/image/fetch/$s_!tIWf!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 424w, https://substackcdn.com/image/fetch/$s_!tIWf!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 848w, https://substackcdn.com/image/fetch/$s_!tIWf!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!tIWf!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc9ff10b-0bd3-4d8a-984e-a1facabedef9_1500x1125.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>While ESM2 and other previous protein language models showed impressive results across several tasks, they have been fundamentally limited by their reliance on pure sequence data. This approach, while valuable, fails to capture the full complexity of biological systems and the hierarchical nature of protein interactions.</p><p>The key issue lies in the nature of DNA and protein data compared to natural language text. In language models trained on text, we observe a natural multiscale learning process. Text data contains paired instruction and answer data at multiple levels of abstraction, allowing models to learn tasks ranging from simple summarization to complex synthesis of multiple texts. For example, there may be a task-answer pair that says "Please summarize this passage of text", and then another that says "Please synthesize these summaries of different pieces of text into a single narrative". We move from one task (compressing text on a single topic into a summary) into a higher-level version of that same text (compressing multiple texts on multiple topics into a single summary). In other words, we're ascending a ladder of complexity in the types of instructions the LLM is learning to perform. <strong>This all happens naturally as part of the LLM training process because text is both the instruction and the answer. Or, to put it in the parlance of von Neumann computing, text is the program and it is the data (just as bits are both the program and the data in the von Neumann computer architecture).</strong></p><p>Now let us consider a DNA language model training on DNA sequence. We can think of DNA sequences as instructions that detail how to produce a protein (just as we looked at text instructions above). The protein can then itself be viewed as a higher-level instruction (namely, how should this molecule interact with other molecules in the body). These interactions can be seen as yet higher-level instructions for how to compose reaction pathways in the body. And so on up the chain. </p><p>The core idea here is that if we train on <em>only </em>DNA sequences, we see the instructions at an early stage without ever viewing the <em>solution</em> (the protein sequence &amp; structure). Moreover, once we "complete" this instruction (DNA -&gt; protein), to do anything useful we need to continue climbing the hierarchy (i.e. now our protein shape becomes the instruction and we use it to identify complexes &amp; interactions). But our DNA language model never sees how to do this from its DNA sequence data and thus can never learn how to ascend the hierarchy. <strong>We don't get this natural multiscale learning the way we do in text because DNA is not universal. The modality changes as we move up the ladder of complexity, whereas text is always text, regardless of the complexity or level of granularity of the instruction-answer pair.</strong> </p><p><strong>Hence, to make biological foundation models that replicate the success of LLMs in text, we need a way to encode all of the various modalities and learn them jointly in a single model. We also need to accelerate data collections efforts across these modalities rather than focusing purely on the lowest levels (i.e. DNA &amp; protein sequencing).</strong></p><h1>ESM3 &amp; Multiscale Data</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!YkS1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!YkS1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 424w, https://substackcdn.com/image/fetch/$s_!YkS1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 848w, https://substackcdn.com/image/fetch/$s_!YkS1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 1272w, https://substackcdn.com/image/fetch/$s_!YkS1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!YkS1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png" width="1456" height="1272" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1272,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!YkS1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 424w, https://substackcdn.com/image/fetch/$s_!YkS1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 848w, https://substackcdn.com/image/fetch/$s_!YkS1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 1272w, https://substackcdn.com/image/fetch/$s_!YkS1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa481af76-20ca-4955-8d6c-81e474146d9e_1912x1671.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>ESM3 addresses this limitation by incorporating a form of multiscale data into its training process. Instead of focusing solely on amino acid sequence data as ESM2 did, it integrates:</p><ol><li><p><strong>Atomic coordinates:</strong> Providing information about protein structure</p></li><li><p><strong>Sequence data:</strong> Offering the fundamental building blocks of proteins</p></li><li><p><strong>Function data:</strong> Giving context to the protein's role in biological systems</p></li></ol><p>All three of these modalities are tokenized and learned jointly by the model (see image above). That is, unlike ESM2 which only made predictions for a protein&#8217;s amino acid sequence, ESM2 learns to simultaneously make predictions about a given protein&#8217;s amino acid sequence, 3D structure, and high-level functional details. This multiscale approach allows ESM3 to jointly learn about proteins at multiple levels of abstraction:</p><ul><li><p><strong>Low-level</strong>: Understanding what sequence codes for a particular protein</p></li><li><p><strong>Mid-level:</strong> Comprehending the protein's shape after folding</p></li><li><p><strong>High-level:</strong> Grasping the protein's function(s) in nature</p></li></ul><p>By learning to reason about proteins across these multiple scales, ESM3 will likely achieve significant performance improvements, particularly in generative tasks that require integration of knowledge from all three scales. The EvolutionaryScale team validated this with a case study in which they designed a new fluorescent protein that had never been seen before in nature (image below). </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!nwdO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!nwdO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 424w, https://substackcdn.com/image/fetch/$s_!nwdO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 848w, https://substackcdn.com/image/fetch/$s_!nwdO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!nwdO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!nwdO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png" width="1080" height="1080" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1080,&quot;width&quot;:1080,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!nwdO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 424w, https://substackcdn.com/image/fetch/$s_!nwdO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 848w, https://substackcdn.com/image/fetch/$s_!nwdO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!nwdO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b3581f4-71bb-4de3-983b-35c7834bcaa7_1080x1080.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>They did this by specifying high-level functional details, important protein structure requirements for fluorescence, and known amino acid sequences that code for those structural snippets. Given this conditioning data, the model was able to generate the remainder of the protein through reasoning about the constraints across all three scales of complexity: sequence, structure, and function.</p><h1>Conclusion</h1><p>Overall, ESM3 warrants the hype and signals a potential paradigm shift in the field of protein language modeling. It represents a first step in moving from an era focused on scaling up amino acid sequence data alone towards a focus more on the integration of diverse, multiscale data sources. This approach aligns more closely with the success of large language models in natural language processing, where the universality of text allows for seamless learning across various levels of complexity. By incorporating multiple modalities of biological data, ESM3 and its successors will be much better positioned to replicate this success in the biological domain.</p><p>Moving forward, this shift implies a need for accelerated data collection efforts across various biological modalities, rather than focusing solely on protein sequencing and structure determination. For ESM4 to truly serve as a foundation model for biology the way GPT4 has served as a foundation model for text, we will need to go beyond sequence, structure, and function to include reaction pathways, cellular expression levels, and more.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[A Perspective on the Limitations of Language Modeling]]></title><description><![CDATA[Probing the upper limits of compute required for AGI]]></description><link>https://www.chrishayduk.com/p/a-perspective-on-the-limitations</link><guid isPermaLink="false">https://www.chrishayduk.com/p/a-perspective-on-the-limitations</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Sat, 22 Jun 2024 18:29:59 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!A-mW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A-mW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A-mW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!A-mW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!A-mW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!A-mW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A-mW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png" width="1456" height="819" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Taking brain simulation to the next level &#8211; the multi-scale approach&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Taking brain simulation to the next level &#8211; the multi-scale approach" title="Taking brain simulation to the next level &#8211; the multi-scale approach" srcset="https://substackcdn.com/image/fetch/$s_!A-mW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 424w, https://substackcdn.com/image/fetch/$s_!A-mW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 848w, https://substackcdn.com/image/fetch/$s_!A-mW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 1272w, https://substackcdn.com/image/fetch/$s_!A-mW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74d12efe-e510-4212-a1e4-744e0b04e609_1920x1080.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Scenario #1: </strong>Imagine you want to model a sequence of coin flips. Your goal is to accurately predict the result of the next coin flip given the history of all previous flips in this sequence. Suppose in this scenario that you have no idea that coin flips are independent and that the underlying distribution for this process is binomial with a probability of 0.5. So, without any strong or informed prior on what this process looks like, we spend billions of dollars of compute to train up a massive autoregressive sequence model to predict the next flip in each sequence. This model will approach, but never exceed, 50% accuracy in token prediction on the test set given the nature of the underlying distribution.</p><p><strong>Scenario #2: </strong>Now instead imagine that we have enough compute to model each coin toss analytically. We can simulate the impact of air resistance with CFD, the angle of the coin as it leaves the tosser&#8217;s thumb, etc. Given enough compute and sufficient accuracy of the measurements for the initial conditions, we should be able to predict the result of the coin toss with near 100% accuracy using the standard classical mechanics that have allowed us to put satellites into orbit and men on the moon.</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><p>These two scenarios deal with the same underlying process (predicting a coin flip) but address that process from two totally different perspectives. Scenario #1 approaches the prediction task from an <strong>external perspective.</strong> It attempts to model the process without actually understanding any of the internal processes that generate the process. In the case of the coin flip, no matter how well we model from this external perspective, we can <em>never</em> achieve better than 50% accuracy on a large enough test set. The fundamental limit is not in the data or in the model but in the <em>perspective from which we are modeling.</em></p><p>By contrast, in Scenario #2, we approach the prediction task using an <strong>internal perspective. </strong>We analyze the causal factors that contribute to the outcome of each coin flip and model those factors, allowing us to make predictions based on the mechanics underlying the flip rather than simply using the sequence itself. Here, given enough compute and sufficiently sensitive measurement devices, we can far exceed the 50% accuracy ceiling that limited the external perspective.</p><p>LLMs model human thought using the external perspective described above and, as such, will have a large amount of error that could be avoided by modeling thought from an internal perspective.</p><p>In order to move to modeling thought from an internal perspective, there are two promising avenues:</p><ol><li><p>Combining symbolic AI with large language models</p></li><li><p>Simulation of the human connectome</p></li></ol><p><a href="https://en.wikipedia.org/wiki/Symbolic_artificial_intelligence">Symbolic AI</a> comprises formal logic, proof verification systems, knowledge graphs, and more. These approaches attempt to model the mechanisms of human thought directly, focusing in particular on producing logically correct deductions (in the parlance of Kahneman&#8217;s <em>Thinking Fast and Slow</em>, these attempt to model System 2 thinking). Advances in combining these approaches with large language models will probably come from algorithmic and engineering work rather than scaling up compute. Thus, if this approach works, we should expect to see artificial general intelligence (AGI) without much of a cost in increased scaling. Since this article is most interested in exploring the maximum lower bound of compute required for AGI, we will ignore this case and focus on modeling the human connectome.</p><h1>Simulating the Brain</h1><p>In the case that merging symbolic AI with LLMs does not work, our clearest avenue towards AGI would be mapping and simulating the human connectome. A simulation of the actual underlying hardware of the brain &#8212; the billions of neurons and trillions of synapses that comprise it as well as all of their interactions &#8212; should produce thought through a faithful reconstruction of its underlying mechanics in the same way that computer simulations can map the trajectory of a real rocket. And if we provide it with a map of the human connectome of someone like Albert Einstein and feed it data on our accumulated store of knowledge, it <em>should </em>be a general intelligence that can solve difficult, out-of-distribution problems. </p><p>The above paragraph rests on a number of critical assumptions (namely, that we <em>can</em> map the full human connectome at a high level of detail <em>and</em> that we can develop well-formulated models for human neurons and synapses). For the sake of this exercise, we will ignore the significant work that remains to be done in those domains and instead answer the question &#8212; <em>if</em> we already had a complete map of the human mind, how much compute would we need to run the simulation and generate thought? We&#8217;ll answer this question at a rough order of magnitude level, but it should give us a picture of when the amount of compute needed to run this simulation will become available to large corporations and research labs.</p><p>To accurately simulate the human connectome, we need to consider several factors:</p><ol><li><p><strong>Neuronal Complexity</strong>: Each neuron is a complex computational unit with intricate dynamics. Simulating a single neuron with high fidelity requires significant computational power.</p></li><li><p><strong>Synaptic Plasticity</strong>: The strength and nature of connections between neurons are constantly changing. Modeling this plasticity adds another layer of complexity to the simulation.</p></li><li><p><strong>Temporal Resolution</strong>: Neural processes occur on multiple timescales, from milliseconds to hours. A comprehensive simulation must account for these various temporal dynamics.</p></li><li><p><strong>Spatial Resolution</strong>: The spatial arrangement of neurons and their connections is crucial for understanding brain function. High-resolution mapping of the connectome is essential for accurate simulation.</p></li></ol><p>Given that we would like a high-fidelity simulation of the brain that can produce emergent, intelligent thought, we will tend towards higher complexity along all four of these factors. Our back-of-the-envelope calculations will assume a highly-complex neuronal model, long- and short-term synaptic plasticity, a temporal resolution of 1 millisecond, and full-connectome modeling, including all 100 billion neurons and 600 trillion synapses.</p><h2>Neurons</h2><p>Let's focus on a single-neuron model as an example, using the Hodgkin-Huxley (HH) model, which is one of the more computationally intensive but biologically realistic models:</p><ol><li><p>Membrane Potential Calculation: The HH model uses a differential equation: C(dV/dt) = -g_Na(V-E_Na) - g_K(V-E_K) - g_L(V-E_L) + I_ext Solving this numerically (e.g., using the Euler method) requires:</p><ul><li><p>3 subtractions</p></li><li><p>3 multiplications</p></li><li><p>3 additions</p></li><li><p>1 division (for dt) </p></li><li><p><strong>Total:</strong> ~10 floating point operations timestep</p></li></ul></li><li><p>Ion Channel Dynamics: For each of sodium, potassium, and leak channels: dm/dt = &#945;_m(1-m) - &#946;_m*m (similar for h and n) Each gate variable (m, h, n) requires:</p><ul><li><p>2 exponential calculations (~10 floating point operations each)</p></li><li><p>4 multiplications</p></li><li><p>2 additions/subtractions </p></li><li><p><strong>Total:</strong> ~30 floating point operations per gate, ~90 floating point operations for all three</p></li></ul></li><li><p>Conductance Calculations: g_Na = g_Na_max * m^3 * h g_K = g_K_max * n^4 Requires:</p><ul><li><p>5 multiplications</p></li><li><p>2 exponentiations </p></li><li><p><strong>Total:</strong> ~20 floating point operations</p></li></ul></li><li><p>Current Calculations: I_Na = g_Na * (V - E_Na), etc. Requires:</p><ul><li><p>3 subtractions</p></li><li><p>3 multiplications </p></li><li><p><strong>Total:</strong> ~6 floating point operations</p></li></ul></li></ol><p>Summing these up, we get approximately 126 floating point operations per timestep for a single-compartment HH model. However, this is a significant underestimate for a realistic neuron simulation:</p><ol start="5"><li><p>Multiple Compartments: Real neurons aren't single compartments. A moderately detailed model might have 10-100 compartments, each requiring its own HH-like calculations. <strong>Total:</strong> 126 * 10 to 126 * 100 = 1,260 to 12,600 floating point operations</p></li><li><p>Synaptic Integration: A typical neuron might have 1,000-10,000 synapses. For each active synapse:</p><ul><li><p>Calculate postsynaptic current (~5 floating point operations)</p></li><li><p>Update synaptic state (~5 floating point operations)</p></li></ul><p> If 10% of synapses are active in a timestep: <strong>Total:</strong> 100 to 1,000 * 10 floating point operations = 1,000 to 10,000 floating point operations</p></li><li><p>Intracellular Signaling: Calcium dynamics and second messenger systems might add another 100-500 floating point operations, depending on the level of detail.</p></li></ol><p>Adding these up, we get a range of about 2,360 to 23,100 floating point operations per neuron per timestep.</p><h2>Synapses</h2><p>We&#8217;ll now estimate the number of floating point operations needed to simulate a single synapse, accounting for both transmission and plasticity.</p><ol><li><p>Basic Synaptic Transmission: I_syn = g_syn * s * (V_post - E_rev) ds/dt = &#945; * T * (1 - s) - &#946; * s</p><ul><li><p>2 multiplications</p></li><li><p>2 subtractions</p></li><li><p>1 addition</p></li><li><p>1 division (for dt) </p></li><li><p><strong>Total:</strong> ~6 floating point operations</p></li></ul></li><li><p>Short-Term Plasticity (STP): </p><ol><li><p>Facilitation: dF/dt = (1 - F)/&#964;F + f * &#948;(t - t_spike) b.</p></li><li><p> Depression: dD/dt = (1 - D)/&#964;D - d * D * &#948;(t - t_spike) c. Synaptic efficacy: A = A0 * F * D</p><ul><li><p>4 subtractions</p></li><li><p>3 divisions</p></li><li><p>3 multiplications</p></li><li><p>2 additions </p></li><li><p><strong>Total:</strong> ~12 floating point operations</p></li></ul></li></ol></li><li><p>Long-Term Plasticity (LTP/LTD): </p><ol><li><p>NMDA receptor activation: I_NMDA = g_NMDA * s_NMDA * B(V) * (V_post - E_NMDA) B(V) = 1 / (1 + exp(-0.062 * V_post) * [Mg2+] / 3.57)</p><ul><li><p>4 multiplications</p></li><li><p>2 subtractions</p></li><li><p>1 division</p></li><li><p>1 exponentiation </p></li><li><p><strong>Total:</strong> ~18 floating point operations</p></li></ul></li><li><p>Calcium dynamics: d[Ca2+]/dt = -[Ca2+]/&#964;Ca + &#947; * I_NMDA + baseline</p><ul><li><p>1 division</p></li><li><p>2 multiplications</p></li><li><p>1 addition</p></li><li><p>1 subtraction</p></li><li><p><strong>Total:</strong> ~5 floating point operations</p></li></ul></li><li><p> CaMKII activation: dCaMKII/dt = k1 * [Ca2+]n * (1 - CaMKII) - k2 * CaMKII</p><ul><li><p>2 multiplications</p></li><li><p>1 subtraction</p></li><li><p>1 exponentiation</p></li><li><p>1 division </p></li><li><p><strong>Total:</strong> ~15 floating point operations</p></li></ul></li><li><p>Weight update rule (based on CaMKII): dw/dt = &#951; * (CaMKII - &#952;p)+ - &#951; * (CaMKII - &#952;d)- Where ()+ and ()- denote rectification</p><ul><li><p>2 subtractions</p></li><li><p>2 comparisons</p></li><li><p>2 multiplications </p></li><li><p><strong>Total:</strong> ~8 floating point operations</p></li></ul></li></ol></li><li><p>Homeostatic Plasticity: w = w * (1 + &#951;_homeo * (target_activity - actual_activity))</p><ul><li><p>2 subtractions</p></li><li><p>2 multiplications</p></li><li><p>1 addition </p></li><li><p><strong>Total:</strong> ~5 floating point operations</p></li></ul></li><li><p>Structural Plasticity (simplified): P_form = sigmoid(local_activity - threshold) P_elim = 1 - P_form</p><ul><li><p>1 subtraction</p></li><li><p>1 exponentiation (for sigmoid)</p></li><li><p>1 division (for sigmoid)</p></li><li><p>1 subtraction (for P_elim) </p></li><li><p><strong>Total:</strong> ~14 floating point operations</p></li></ul></li><li><p>Neuromodulation (simplified, e.g., dopaminergic influence on plasticity): plasticity_factor = baseline + k * [dopamine]</p><ul><li><p>1 multiplication</p></li><li><p>1 addition </p></li><li><p><strong>Total:</strong> ~2 floating point operations</p></li></ul></li></ol><p>Summing these components: 6 + 12 + 18 + 5 + 15 + 8 + 5 + 14 + 2 = 85 floating point operations per synapse per timestep.</p><h2>Full Estimate</h2><p>Assuming we need to simulate each neuron and synapse at a resolution of 1 millisecond, each neuron operation requires 20,000 floating-point operations, and each synapse operation requires 85 floating point operations, we can make a rough calculation:</p><p>100 billion neurons * 20,000 floating point operations/neuron * 1000 timesteps/second = 2 * 10^18 FLOPS</p><p>600 trillion synapses * 85 floating point operations/synapse * 1000 timesteps/second = 5.1 * 10^19 FLOPS</p><p><strong>Total: Approximately 5.3 * 10^19 FLOPS or 53 exaFLOPS</strong></p><p>As a result, simulating the human connectome for approximately 1 week would require more floating point operations than the entire training budget for GPT-4. Keep in mind that simulation, in this case, is essentially inference in the case of LLMs - imagine if serving 1 instance of GPT-4 for 1 week took the entire training compute budget of OpenAI. This is a monumental amount of compute and beyond the economic feasibility of any current company (even if the human connectome had been fully mapped, which it is currently far from being).</p><p>From another perspective, we can look at this compute requirement in terms of the number of H100s that it would require. FP64 is typically used for scientific simulation work, and the NVIDIA H100 can perform 34 teraFLOPS =  3.4 * 10^13 FLOPS in FP64 mode. Hence, our estimate for the human brain would require approximately 1,550,000 H100s for simulation. This is 2 orders of magnitude larger than the largest deployed H100 clusters today. There is not much data on the growth rate of FP64 FLOPS, but <a href="https://epochai.org/blog/trends-in-machine-learning-hardware">FP32 FLOPS are doubling every 2.3 years</a>. Since an FP64 multiplier unit takes roughly five times the area of an FP32 multiplier, we can estimate that it will require 5 times the growth in transistors in order to double the FP64 performance when compared to doubling FP32 performance. Hence, our doubling time for FP64 performance should be roughly 7.6 years, which would imply that FP64 performance will increase by one order of magnitude (i.e. 10x the performance of the H100) in about 25 years. </p><p><strong>If the above trends and assumptions hold (a very tenuous assumption), it will be 50 years before the FP64 performance has increased to the point where companies can spend roughly equivalent amounts to today&#8217;s largest AI training clusters in order to simulate the human brain.</strong> If algorithmic advances or further research in computational neuroscience supports a transition to FP32 from FP64, this 50-year timeline could be compressed to only 15 years.</p><p><strong>Hence, if simulating the brain is our only viable path to AGI, we can expect the required compute to be available for the largest corporations in 2040 at the earliest and 2075 at the latest given current trends.</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[A Case Study in Finetuning Open Source LLMs: Training LLaMA 2 for the Text-to-SQL Task]]></title><description><![CDATA[Introduction]]></description><link>https://www.chrishayduk.com/p/a-case-study-in-finetuning-open-source</link><guid isPermaLink="false">https://www.chrishayduk.com/p/a-case-study-in-finetuning-open-source</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Tue, 04 Jun 2024 13:49:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2VUq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Introduction</h1><p>Below is write-up of a consulting project I did for a client in late 2023 (with all client names removed). In the report, I detail the models, datasets, and approaches needed to create a state-of-the-art text-to-SQL model that outperforms GPT-4. If you don&#8217;t have time to read the full report, the three main takeaways are the following:</p><ol><li><p><strong>The dataset is everything</strong>. By far my highest ROI activity came from inspecting, correcting, and augmenting my dataset. This included: spotchecking and fixing errors in the table schemas in the training data, identifying additional non-SQL datasets to include in training that helped performance, implementing curriculum learning to improve convergence, and more.</p></li><li><p><strong>Don&#8217;t just validate next token prediction, test real task performance.</strong> You want to make sure your validation set is as close to the real thing as possible. To that end, I made two modifications:</p><ol><li><p>I biased the validation set towards harder SQL queries rather than having the same distribution as the training set</p></li><li><p>I created real SQL tables using GPT-4 to match the schemas that are in the data. I then executed the SQL statements produced by the model against these tables and compared the result of this execution to the result of executing the ground-truth SQL. This gave me a metric that tracked the performance of the model in a <strong>real-world setting</strong>, rather than just looking at how accurately it can guess the next token</p></li></ol></li><li><p><strong>Experimentation is key</strong>. All of these insights took many training runs to arrive at (likely about 50 in total). Thus, make sure to use a parameter efficient finetuning method such as QLoRA so you don&#8217;t bankrupt yourself on all of the experimentation training runs. Also take meticulous notes - sometimes the insights coalesce over the course of many trials. For tracking purposes, I used Weights &amp; Biases and kept all of my notes in a running Google Doc.</p></li></ol><p>These three insights together allowed me to develop a model based on Code LLaMA that outperformed GPT4 by 30 percentage points and cost 99% less to run at inference.</p><p>Now, if you&#8217;d like all the details, the full consulting report is below.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Overview</h1><p>Currently, usage of many database &amp; data warehouse management tools requires SQL knowledge in order to write queries. As a result, the number of seats per software license for these tools is constrained by the number of employees within a given organization who have a strong grasp of SQL, which puts a lower cap on the potential revenue per client. The recent emergence of large language models (LLMs) can help to alleviate this problem, as they demonstrate near human-level ability to translate from natural language to code. However, usage of common LLMs such as ChatGPT can present several challenges, including:</p><ol><li><p><strong>Cost: </strong>OpenAI charges per 1000 tokens of input and output (where 1 token roughly corresponds to 1 word). As a result, costs can skyrocket as the user base increases. For example, 100,000 users making on average 10 document summarization requests per day will cost about $5,800 per day in API fees alone. This can significantly reduce profit margins on a product. In addition, sudden unforeseen spikes in API usage can result in large losses without appropriate API throttling checks in place.</p></li><li><p><strong>Security: </strong>When using the OpenAI API endpoint, requests to ChatGPT are sent to an external endpoint. This increases the probability of data leakage and the associated negative downstream effects for a company, such as fines, legal fees, and loss of goodwill. Even if using an OpenAI endpoint deployed within a customer&#8217;s Azure environment, there will still be security concerns over using an external LLM on valuable, confidential data.</p></li><li><p><strong>Output Control: </strong>When using OpenAI&#8217;s API endpoints, you are at the mercy of any changes they decide to make to their models. GPT-3.5 and GPT-4 are constantly updated with new training data pouring in from the millions of user chats. While this is intended to improve the service, performance can end up degrading for your specific task &amp; prompt combination (see Figure 1 below). This can result in sudden, random drops in performance for your tool.&nbsp;</p></li></ol><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1_gM!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1_gM!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 424w, https://substackcdn.com/image/fetch/$s_!1_gM!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 848w, https://substackcdn.com/image/fetch/$s_!1_gM!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 1272w, https://substackcdn.com/image/fetch/$s_!1_gM!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1_gM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png" width="626" height="291" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:291,&quot;width&quot;:626,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1_gM!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 424w, https://substackcdn.com/image/fetch/$s_!1_gM!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 848w, https://substackcdn.com/image/fetch/$s_!1_gM!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 1272w, https://substackcdn.com/image/fetch/$s_!1_gM!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe710d181-f997-42d5-9cf3-79b19fc61aec_626x291.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 1. Chen, Lingjiao, Matei Zaharia, and James Zou. "How is ChatGPT's behavior changing over time?." arXiv preprint arXiv:2307.09009 (2023).</figcaption></figure></div><p>Open source LLMs, however, have the potential to address each of these key drawbacks of closed source LLMs through providing: (1) smaller models that reduce the cost required to serve the LLM, (2) ability to deploy the model in any environment, whether in a secure cloud VPC or even on a local machine, and (3) complete control over all model parameters, ensuring consistent output quality. In order to leverage these open source model advantages, we have developed a state-of-the-art text-to-SQL model based on Meta&#8217;s Code LLaMA model. Our new model, dubbed LLaMA2-SQL, significantly outperforms GPT-4 on the text-to-SQL task, while being lightweight enough to run on a CPU.&nbsp;</p><h1>Methodology</h1><h2>Base Model Choice</h2><p>We began the project with three main desired characteristics for our base model:</p><ol><li><p>Strong general reasoning capabilities</p></li><li><p>Commercially-permissive licensing</p></li><li><p>Large open source community ecosystem</p></li></ol><p>Of models released during the main phase of the project (March 2023-December 2023), Meta&#8217;s LLaMA 2 and Code LLaMA models were the highest performing LLMs to satisfy these three criteria. LLaMA 2 is a general purpose LLM which was pretrained on 2 trillion tokens. Code LLaMA extends LLaMA 2 by further pretraining the model on an additional 520 billion tokens of code, significantly improving performance on coding tasks (see Figure 2).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2VUq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2VUq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 424w, https://substackcdn.com/image/fetch/$s_!2VUq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 848w, https://substackcdn.com/image/fetch/$s_!2VUq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 1272w, https://substackcdn.com/image/fetch/$s_!2VUq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2VUq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png" width="1078" height="1316" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1316,&quot;width&quot;:1078,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2VUq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 424w, https://substackcdn.com/image/fetch/$s_!2VUq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 848w, https://substackcdn.com/image/fetch/$s_!2VUq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 1272w, https://substackcdn.com/image/fetch/$s_!2VUq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F031c512c-67c3-4ce8-aea2-a9d9ae8ce9a0_1078x1316.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 2.&nbsp; Llama and Code Llama performance compared to ChatGPT and several other open source models. HumanEval and MBPP both consist of tasks that the model must solve using Python code. Multilingual HumanEval extends HumanEval to include coding challenges in C#, Go, Java, JavaScript, Kotlin, Perl, PHP, Ruby, Scala, Swift, and TypeScript.</figcaption></figure></div><p>Given Code LLaMA&#8217;s strong general reasoning &amp; coding performance, we hypothesized that it would provide a good starting point from which to finetune a SQL-specific model. Specifically, we selected Code LLaMA - Instruct 34B to leverage its instruction-following capability when creating LLaMA2-SQL.</p><h2>Dataset Curation &amp; Augmentation</h2><p>Creation of the dataset was <strong>the most significant piece of the project</strong> and resulted in <strong>the largest gains in performance</strong>. To begin, we combined several datasets, including <a href="https://huggingface.co/datasets/wikisql">WikiSQL</a>, <a href="https://huggingface.co/datasets/spider">Spider</a>, and <a href="https://huggingface.co/datasets/iamtarun/code_instructions_120k_alpaca">Code Instructions Alpaca 120K</a>. These datasets combined to form the <a href="http://chrishayduk/Llama-2-SQL-and-Code-Dataset">LLaMA2-SQL AI dataset</a>. This amalgamation was not just a mere aggregation of data; it was a strategic blend designed to encompass a broad spectrum of SQL queries and structures.</p><p>In the pursuit of refining the dataset and enhancing the model's performance, several key strategies were employed. We adopted the instruct dataset format, which is known for its efficacy in guiding models towards more accurate and context-aware outputs. This format was instrumental in aligning the model's responses with the intricate demands of SQL query generation. Furthermore, we introduced a mix of general coding problems alongside SQL generation tasks. General coding questions tended to be longer and require more reasoning steps than the typical SQL queries found in WikiSQL and Spider datasets. As a result, this mixture improved the model&#8217;s reasoning capability on more complicated user questions, despite lowering the overall proportion of SQL queries in the training set.</p><p>One of the more technical challenges we addressed was a major table schema issue, where all columns in the training &amp; validation sets were indiscriminately coded as VARCHAR. By resolving this, we ensured that the model could recognize and handle a variety of data types, thereby increasing the accuracy and reliability of its SQL output. Additionally, we eliminated examples from the dataset where the response was less than 10 characters. This exclusion was based on the rationale that shorter responses often lack the complexity and detail required for effective training in SQL generation.</p><p>To further refine the dataset, we sorted examples in order of instruction length. This sorting approach (known as <a href="https://en.wikipedia.org/wiki/Curriculum_learning">curriculum learning</a>) allowed for a gradual and systematic exposure of the model to increasingly complex queries, thereby enhancing its learning curve. We also utilized an embedding model to identify and remove similar data points. This step was crucial in ensuring that the dataset was not only diverse but also free of redundant or overly repetitive examples.</p><p>Another innovative step was the randomization of the SQL schema order within the dataset. This randomization was a strategic move to prevent the model from developing biases or shortcuts based on schema ordering. Lastly, we intentionally biased the validation set to include mostly difficult SQL problems. This bias ensured that the model was rigorously tested against complex and challenging queries, which is essential for real-world applications where query complexity can vary greatly.</p><h2>Model Training</h2><p>Model training, particularly the fine-tuning of large language models, typically demands substantial GPU memory, posing significant challenges in terms of cost and efficiency. Even the smallest LLaMA model, with 7 billion parameters, requires approximately 140 GB of GPU RAM, while the 70B LLaMA model demands a staggering 1400 GB. This high resource requirement makes conventional fine-tuning methods both expensive and time-consuming. To address these challenges, we employed Parameter Efficient Fine-Tuning (PEFT), a novel approach that significantly reduces GPU RAM requirements for fine-tuning open source models. This method is a part of the <a href="https://github.com/huggingface">HuggingFace</a> library ecosystem and offers various out-of-the-box implementations.</p><p>Among the PEFT methods, we chose to use Quantized Low-Rank Adaptation (QLoRA). QLoRA builds upon the core concept of PEFT and introduces three critical enhancements (see Figure 3). Firstly, it quantizes the base model since the base model parameters are frozen and do not require maintenance in a high-precision format. This quantization effectively reduces memory usage. Secondly, it offloads some of the optimizer's values to the CPU memory when the GPU memory is insufficient, bringing them back to the GPU as needed. Lastly, it uses an interesting trick from linear algebra to reduce the number of trainable parameters. The trick works as follows: Assume the weight matrix W has dimension (d x d). If we set the dimensions of matrix A to be (d x r) and matrix B to be (r x d), then the multiplied matrix AB has the dimensions (d x d), the same as our original weight matrix W, but with potentially far fewer parameters! W has d*d = d^2 parameters, whereas A and B together have (d*r) + (r*d) = 2dr parameters. When r is less than d/2, this results in a reduction in trainable parameters compared to the original model. Low values of r can make this reduction dramatic. The combination of these three enhancements makes QLoRA incredibly memory efficient for fine-tuning while still maintaining strong performance.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KM50!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KM50!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 424w, https://substackcdn.com/image/fetch/$s_!KM50!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 848w, https://substackcdn.com/image/fetch/$s_!KM50!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 1272w, https://substackcdn.com/image/fetch/$s_!KM50!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KM50!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png" width="512" height="269" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:269,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!KM50!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 424w, https://substackcdn.com/image/fetch/$s_!KM50!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 848w, https://substackcdn.com/image/fetch/$s_!KM50!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 1272w, https://substackcdn.com/image/fetch/$s_!KM50!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5699aab2-3cc7-4c63-83f8-116dc599e23c_512x269.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 3.&nbsp; Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).</figcaption></figure></div><p>To facilitate PEFT fine-tuning, we utilized several libraries. <a href="https://github.com/OpenAccess-AI-Collective/axolotl">Axolotl</a> was instrumental in orchestrating the training process using common PEFT methods. Its user-friendly setup, requiring only a simple YAML file of training parameters, streamlined our workflow. Additionally, Axolotl's support for distributed training was crucial for managing the computational demands of our project.</p><p>Another key component in our training arsenal was <a href="https://github.com/Dao-AILab/flash-attention">Flash Attention</a>. This library implements an exceedingly efficient attention mechanism, allowing for faster training completion and lower memory usage, which translates to cost savings. This efficiency was vital in our pursuit of a balance between performance and cost.</p><p>Google Colab played a pivotal role in our training process. By renting A100 instances at an affordable rate, we could train models up to 34 billion parameters using a combination of QLoRA and Flash Attention. We particularly recommend Colab Pro+ for its background execution capabilities and priority access to A100 instances.</p><p>The synergy of QLoRA and Flash Attention proved to be incredibly powerful. In our project, we successfully fine-tuned the Code LLaMA 34B model on 67,000 training examples over five epochs in under 22 hours. This efficiency resulted in a cost-effective training process, costing approximately $28 in Google compute credits. The outcome was a fully fine-tuned model that surpassed GPT-4 in our specific task, a testament to the effectiveness of our chosen training methods and tools.</p><h1>Model Results</h1><h2>Accuracy</h2><p>LLaMA2-SQL was evaluated against its two main competing models: GPT-3.5 and GPT-4. All three models were evaluated on a dataset consisting of a natural language instruction as input with a desired SQL statement as output. Some examples included the table schema as input, where others excluded the schema to test the model&#8217;s ability to infer schema information from text alone. Each text and SQL example was accompanied by a sample SQL table. An example data point is displayed below in Figure 4 (the contained text is fairly small, so I recommend zooming in on PDF to see the data more clearly).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZdXA!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZdXA!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 424w, https://substackcdn.com/image/fetch/$s_!ZdXA!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 848w, https://substackcdn.com/image/fetch/$s_!ZdXA!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 1272w, https://substackcdn.com/image/fetch/$s_!ZdXA!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZdXA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png" width="1456" height="458" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:458,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZdXA!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 424w, https://substackcdn.com/image/fetch/$s_!ZdXA!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 848w, https://substackcdn.com/image/fetch/$s_!ZdXA!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 1272w, https://substackcdn.com/image/fetch/$s_!ZdXA!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4d8c483-aec8-4cfb-a0da-1d2ab864fa31_1513x476.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 4. Example data point from the evaluation set.</figcaption></figure></div><p>To produce our accuracy numbers, we executed the ground truth SQL statements against this sample table and compared the results to the tables produced by executing the SQL output of each of the LLMs tested in our benchmark suite. The accuracy thus represents the percent of evaluation examples where the model produced executable SQL that arrived at the correct answer. By examining the output accuracy rather than simply checking if the model&#8217;s SQL statement matches the desired SQL statement, we are able to account for examples where the model uses a different approach to arrive at the correct answer. As a result, our accuracy more closely reflects the accuracy that the model would achieve in the real world.&nbsp;</p><p>Using this evaluation setup, we see that LLaMA2-SQL significantly outperforms both GPT-4 and GPT-3.5, beating each by about 30 and 37 percentage points on the evaluation set, respectively (see Figure 5 below). It achieves an accuracy of 68.82%, which includes examples where no schema is provided at all. When provided table schemas along with every user question, LLaMA2-SQL&#8217;s accuracy exceeds 85%. As a result, we are able to provide a better user experience for the text-to-SQL task than is offered by closed source models, despite their increased model and training data size.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2h-E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2h-E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 424w, https://substackcdn.com/image/fetch/$s_!2h-E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 848w, https://substackcdn.com/image/fetch/$s_!2h-E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 1272w, https://substackcdn.com/image/fetch/$s_!2h-E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2h-E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png" width="989" height="590" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:590,&quot;width&quot;:989,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2h-E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 424w, https://substackcdn.com/image/fetch/$s_!2h-E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 848w, https://substackcdn.com/image/fetch/$s_!2h-E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 1272w, https://substackcdn.com/image/fetch/$s_!2h-E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e93b534-8b8a-4dd5-a97f-b52fdc0717b3_989x590.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Figure 5. Model results on evaluation set.</figcaption></figure></div><h2>Cost &amp; Compute Efficiency</h2><p>In addition to beating GPT-3.5 and GPT-4 handily in accuracy, LLaMA2-SQL has 1/10th as many parameters as GPT-3.5 and 1/100th as many as GPT-4. Moreover, we have optimized LLaMA2-SQL to further reduce the parameter size and enable it to run on CPUs. This translates to massive reductions in the compute needed to serve LLaMA2-SQL to customers, resulting in large cost savings. The LLaMA2-SQL model can either be served on a CPU machine, with response times of about 30-60 seconds, or on a cheap GPU machine, with response times of about 5 seconds. The deployment hardware can be determined by the desired usage &#8211; if response times do not need to be near instant to produce a strong user experience, then the cheaper CPU environment can be used. Estimated costs and theoretical maximum requests served for a single representative instance hosting the LLaMA2-SQL model are as follows:</p><ul><li><p><strong>GPU:&nbsp;</strong></p><ul><li><p><strong>AWS EC2 Instance Type:</strong> g4dn.xlarge</p></li><li><p><strong>Daily Cost: $12.60&nbsp;</strong></p></li><li><p><strong>Maximum Requests per Day: About 29,000</strong></p></li></ul></li><li><p><strong>CPU:</strong></p><ul><li><p><strong>AWS EC2 Instance Type: t4g.2xlarge</strong></p></li><li><p><strong>Daily Cost: $6.47</strong></p></li><li><p><strong>Maximum Requests per Day: About 1900</strong></p></li></ul></li></ul><p>LLaMA2-SQL&#8217;s compute efficiency allows thousands of daily user requests to be served at the cost of only a few dollars per day. By comparison, 29,000 daily requests to the GPT-4 API endpoint would cost roughly $1500 per day. As a result, when deploying our model on a GPU instance,<strong> we achieve a 99% reduction in cost while also improving performance by 30 percentage points when compared to GPT-4</strong>.</p><h1>Conclusion &amp; Next Steps</h1><p>This project&#8217;s key objective was to validate the potential of a custom, low cost text-to-SQL model, with the goal of increasing Client&#8217;s market penetration through making the tool accessible to non-technical users. LLaMA2-SQL achieved this objective, setting a new state-of-the-art in text-to-SQL generation, both from a performance and cost standpoint. Future work building on LLaMA2-SQL can take a number of directions, including:</p><ol><li><p>Retraining LLaMA2-SQL using newly-released open source models that are more powerful than Code LLaMA </p></li><li><p>Improving LLaMA2-SQL&#8217;s training dataset to more closely match the requests that will be seen in a production environment from Client&#8217;s users</p></li><li><p>Optimizing &amp; automating cloud infrastructure using Terraform to support rapid deployment of LLaMA2-SQL on AWS, Azure, and GCP</p></li><li><p>Developing UI/UX for LLaMA2-SQL to support its integration into the tool</p></li></ol><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2]]></title><description><![CDATA[The fundamental concepts behind ESM2, ESM3, and AlphaFold]]></description><link>https://www.chrishayduk.com/p/understanding-protein-language-models</link><guid isPermaLink="false">https://www.chrishayduk.com/p/understanding-protein-language-models</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Mon, 03 Jun 2024 15:06:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ZuG4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZuG4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZuG4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZuG4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZuG4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZuG4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZuG4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg" width="1300" height="509" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:509,&quot;width&quot;:1300,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Proteins | Microbiology&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Proteins | Microbiology" title="Proteins | Microbiology" srcset="https://substackcdn.com/image/fetch/$s_!ZuG4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 424w, https://substackcdn.com/image/fetch/$s_!ZuG4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 848w, https://substackcdn.com/image/fetch/$s_!ZuG4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!ZuG4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6478b229-c998-4c3d-a9f8-c115af1ba1bd_1300x509.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Note: </strong>This post is part of the &#8220;Understanding Protein Language Models&#8221; Series:</p><ol><li><p>[This article] Understanding Protein Language Models Part I: Multiple Sequence Alignment in AlphaFold2</p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models-e1a">Understanding Protein Language Models Part II: Encoder-only Transformers as Continuous Fuzzy String Matching</a></p></li><li><p><a href="https://www.chrishayduk.com/p/understanding-protein-language-models-40b">Understanding Protein Language Models Part III: Structure Prediction without Multiple Sequence Alignment in ESMFold</a></p></li></ol><div><hr></div><p>Over the past couple of months, I&#8217;ve been on a journey to understand protein language models - what <em>exactly </em>are they learning and how do they work? I began that journey by trying to understand the inner working of AlphaFold2, given that it represented the first leap forward for AI in biology. </p><p>In particular, protein language models sought to replace the multiple sequence alignment (MSA) component of AlphaFold2 in order to achieve similar performance at much lower computational costs. So, in order to better understand what protein language models are doing, it is first important to understand what they replaced. To that end, in these notes I dive into the role of multiple sequence alignments in AlphaFold2 and how it drives the model&#8217;s ability to infer contacts between residues in the protein. I hope you find these notes useful!</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><h2>What is MSA?</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ivsS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ivsS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 424w, https://substackcdn.com/image/fetch/$s_!ivsS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 848w, https://substackcdn.com/image/fetch/$s_!ivsS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 1272w, https://substackcdn.com/image/fetch/$s_!ivsS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ivsS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png" width="575" height="325" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:325,&quot;width&quot;:575,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:260217,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ivsS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 424w, https://substackcdn.com/image/fetch/$s_!ivsS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 848w, https://substackcdn.com/image/fetch/$s_!ivsS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 1272w, https://substackcdn.com/image/fetch/$s_!ivsS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc9c32b3f-7be0-472b-9dc3-8db2cb2b1198_575x325.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Multiple sequence alignment (MSA) refers to the process or the result of sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor. From the resulting MSA, sequence homology can be inferred and phylogenetic analysis can be conducted to assess the sequences' shared evolutionary origins. </p><p>Visual depictions of the alignment (as in the image below) illustrate mutation events such as point mutations (single amino acid or nucleotide changes) that appear as differing characters in a single alignment column, and insertion or deletion mutations (indels or gaps) that appear as hyphens in one or more of the sequences in the alignment. Multiple sequence alignment is often used to assess sequence conservation of protein domains, tertiary and secondary structures, and even individual amino acids or nucleotides.</p><p>Thus, by including an MSA as input, AlphaFold2 is able to infer information about the target sequence by assessing its shared evolutionary history with a number of other sequences. This is a powerful concept - it gives the model a strong &#8220;starting point&#8221; to make predictions for a new sequence.</p><p>The MSA databases used by AlphaFold2 to identify evolutionarily similar sequences to the target sequence are:</p><p>- MGnify</p><p>- UniRef90</p><p>- Uniclust30</p><p>- BFD</p><h2>AlphaFold2 Architecture</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Ca6!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Ca6!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 424w, https://substackcdn.com/image/fetch/$s_!9Ca6!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 848w, https://substackcdn.com/image/fetch/$s_!9Ca6!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 1272w, https://substackcdn.com/image/fetch/$s_!9Ca6!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Ca6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png" width="1273" height="443" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:443,&quot;width&quot;:1273,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:235326,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Ca6!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 424w, https://substackcdn.com/image/fetch/$s_!9Ca6!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 848w, https://substackcdn.com/image/fetch/$s_!9Ca6!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 1272w, https://substackcdn.com/image/fetch/$s_!9Ca6!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f10b871-75cc-42a7-a956-563577337a4b_1273x443.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Full architecture of AlphaFold2</figcaption></figure></div><p>Now that we know what MSA is and why it is used, let&#8217;s sketch out the high-level architectural details of AlphaFold2. In the above image, we can see the various inputs &amp; components that comprise the model. AlphaFold2 takes in three key inputs:</p><ol><li><p>The input sequence itself</p></li><li><p>An MSA using the input sequence as its starting point</p></li><li><p>Template structure related to the input sequence</p></li></ol><p>These three inputs are then distilled into two by using the template structures and input sequence to initialize a pair representation matrix. The pair representation matrix can be thought of as scores for &#8220;similarity&#8221; or &#8220;interaction&#8221; between each pair of amino acids i and j in the input sequence. </p><p>By contrast, the MSA representation can be thought of as storing a vector representation of each amino acid for each protein in the alignment. If we imagine the matrix as a 2D grid, each row represents a protein and column represents a position in the aligned amino acid sequence (e.g. amino acid #5 in the sequence). In each cell of this matrix, we can imagine a vector that represents the specified amino acid. In reality, this is a tensor of shape (number of sequences, number of residues, channels).</p><p>These two inputs then flow through the Evoformer block, which generates improved representations of the MSA and pair representation matrices for structure prediction. The journey for the full MSA matrix ends here, as we extract the representation for our input sequence from the first row of the MSA matrix and send it forward to the structure module.</p><p>Given that the processing for the MSA matrix takes place in the Evoformer block, we&#8217;ll dive in a bit deeper there.</p><h3>Evoformer Block</h3><p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FDP2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FDP2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 424w, https://substackcdn.com/image/fetch/$s_!FDP2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 848w, https://substackcdn.com/image/fetch/$s_!FDP2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 1272w, https://substackcdn.com/image/fetch/$s_!FDP2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FDP2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png" width="1315" height="357" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:357,&quot;width&quot;:1315,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:186528,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FDP2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 424w, https://substackcdn.com/image/fetch/$s_!FDP2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 848w, https://substackcdn.com/image/fetch/$s_!FDP2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 1272w, https://substackcdn.com/image/fetch/$s_!FDP2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2985718c-e15a-444c-bb18-3d1f06abd46d_1315x357.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Evoformer architecture in AlphaFold2</figcaption></figure></div><p>The Evoformer block begins with components for processing the MSA representation:</p><ol><li><p>Row-wise gated self-attention</p></li><li><p>Column-wise gated self-attention</p></li><li><p>Transition</p></li></ol><p>Following these three blocks, the MSA representation matrix is integrated into the pair representation matrix through the outer product mean block and resulting sum.</p><p>We&#8217;ll now dive deep into the three core components of the Evoformer block (alongisde the outer product mean integration) to better understand how the MSA matrix is being updated.</p><h2>MSA Row-wise Gated Self-Attention</h2><p>Row-wise attention builds attentions weights for residue pairs within the same sequence and integrates the information from the pair representation as an additional bias term. The updated MSA representation matrix thus ensures that each sequence has a <em>contextual representation</em> for its residues - that is, for sequence k, the embedding of the residue at index i takes into account information from the residues at indices 1, &#8230;, i-1, i+1, &#8230;, r</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DHGe!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DHGe!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 424w, https://substackcdn.com/image/fetch/$s_!DHGe!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 848w, https://substackcdn.com/image/fetch/$s_!DHGe!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 1272w, https://substackcdn.com/image/fetch/$s_!DHGe!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DHGe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png" width="1417" height="640" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:640,&quot;width&quot;:1417,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:163423,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DHGe!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 424w, https://substackcdn.com/image/fetch/$s_!DHGe!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 848w, https://substackcdn.com/image/fetch/$s_!DHGe!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 1272w, https://substackcdn.com/image/fetch/$s_!DHGe!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bf27cba-1829-4d22-99d2-e0b9db936726_1417x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Architecture of row-wise gated self-attention</figcaption></figure></div><h2>MSA Column-wise Gated Self-Attention</h2><p>Column-wise attention lets the elements that belong to the same target residue exchange information <em>across</em> sequences in the MSA. The updated MSA representation matrix thus ensures that each residue has a <em>cross-sequence representation</em> - that is, for the embedding for residue i in sequence k also takes into account information from residue i in sequences 1, &#8230;, k-1, k+1, &#8230;, s.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hCcJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hCcJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 424w, https://substackcdn.com/image/fetch/$s_!hCcJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 848w, https://substackcdn.com/image/fetch/$s_!hCcJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 1272w, https://substackcdn.com/image/fetch/$s_!hCcJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hCcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png" width="1290" height="389" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:389,&quot;width&quot;:1290,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:125025,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hCcJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 424w, https://substackcdn.com/image/fetch/$s_!hCcJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 848w, https://substackcdn.com/image/fetch/$s_!hCcJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 1272w, https://substackcdn.com/image/fetch/$s_!hCcJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe3a0d28-9c30-49e0-8710-75522afc39ff_1290x389.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Architecture of column-wise gated self-attention</figcaption></figure></div><h2>MSA Transition</h2><p>After row-wise and column-wise attention the MSA stack contains a 2-layer MLP as the transition layer. The intermediate number of channels expands the original number of channels by a factor of 4.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ihRX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ihRX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 424w, https://substackcdn.com/image/fetch/$s_!ihRX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 848w, https://substackcdn.com/image/fetch/$s_!ihRX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 1272w, https://substackcdn.com/image/fetch/$s_!ihRX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ihRX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png" width="1229" height="238" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:238,&quot;width&quot;:1229,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59744,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ihRX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 424w, https://substackcdn.com/image/fetch/$s_!ihRX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 848w, https://substackcdn.com/image/fetch/$s_!ihRX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 1272w, https://substackcdn.com/image/fetch/$s_!ihRX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F27b551bf-b913-4afa-8051-4803d7ebd191_1229x238.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a><figcaption class="image-caption">Architecture of MSA transition</figcaption></figure></div><p></p><h2>Integration of MSA with Pair Representation</h2><p>The &#8220;Outer product mean&#8221; block transforms the MSA representation into an update for the pair representation. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h9zR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h9zR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 424w, https://substackcdn.com/image/fetch/$s_!h9zR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 848w, https://substackcdn.com/image/fetch/$s_!h9zR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 1272w, https://substackcdn.com/image/fetch/$s_!h9zR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h9zR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png" width="1258" height="335" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:335,&quot;width&quot;:1258,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:73370,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h9zR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 424w, https://substackcdn.com/image/fetch/$s_!h9zR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 848w, https://substackcdn.com/image/fetch/$s_!h9zR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 1272w, https://substackcdn.com/image/fetch/$s_!h9zR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff34c331b-57d9-427a-81b5-354640f9c46b_1258x335.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Integration of MSA representation with pair representation</figcaption></figure></div><p>In particular, this step grabs two vectors of representations for residues i and j, where the vectors span the representations of all sequences included in the MSA. The outer product step creates a matrix of all dot product combinations. In the below image, you can think of u_1 as "sequence 1, residue i", u_2 as "sequence 2, residue i", and so on. Similarly, you can think of v_1 as "sequence 1, residue j", v_2 as "sequence 2, residue j", and so on. Since these dot products gives us a measure of the <strong>similarity</strong> between the representation of "sequence k, residue i" and "sequence m, residue j", we can think of it as a matrix that captures the pairwise similarities between all residues at positions i and j in the MSA.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WKYv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WKYv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 424w, https://substackcdn.com/image/fetch/$s_!WKYv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 848w, https://substackcdn.com/image/fetch/$s_!WKYv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 1272w, https://substackcdn.com/image/fetch/$s_!WKYv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WKYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png" width="335" height="133" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/09e53740-1785-4625-85b7-1448bff4b29f_335x133.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:133,&quot;width&quot;:335,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:7687,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WKYv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 424w, https://substackcdn.com/image/fetch/$s_!WKYv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 848w, https://substackcdn.com/image/fetch/$s_!WKYv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 1272w, https://substackcdn.com/image/fetch/$s_!WKYv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F09e53740-1785-4625-85b7-1448bff4b29f_335x133.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>This matrix ends up being of shape (s, c, c), where s is the number of sequences and each c dimension denotes the number of features for residue i and j's representations, respectively. AlphaFold then takes a mean over the s dimension of the matrix. What this means intuitively is that we average the pairwise similarity of residue i and residue j across <em>all possible pairs</em> of sequences in the MSA matrix. This collapses matrix from shape (s, c, c) to shape (c, c).</p><p>The last step projects the features to from (c, c) to c_z. This allows them to be added to each entry in the pairwise representation.</p><h2>Conclusion - What is the MSA Representation Doing in AlphaFold?</h2><p>So, putting this all together - the MSA steps compute a representation that optimally captures similarity of residues, both:</p><p>1. <strong>Within sequences</strong> by using row-wise attention to attend across amino acids inside a given sequence</p><p>2. <strong>Across sequences</strong> by using column-wise attention to attend across sequences for a given amino acid index</p><p>This representation is then used to generate a measure of similarity between all possible residue pairs in the MSA representation. We then update the pair representation of the target sequence by adding in these values. In essence, we use the MSA to "find out" which residues are similar to which other residues, and then add this information to the pair representation so that the structure module can guess at which residues are in contact with one another (based on the fact that they co-evolve and are therefore similar in the MSA representation). This allows for highly accurate structure prediction, incorporating information from the evolutionary tree to infer the optimal folded structure of a given input protein.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Reverse Goal Planning]]></title><description><![CDATA[Planning Backwards to Create a Roadmap for Success]]></description><link>https://www.chrishayduk.com/p/small-goals-to-accomplish-big-dreams</link><guid isPermaLink="false">https://www.chrishayduk.com/p/small-goals-to-accomplish-big-dreams</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Tue, 11 Oct 2022 15:56:14 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!uP91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!uP91!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!uP91!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uP91!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uP91!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uP91!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!uP91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg" width="1400" height="931" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/e54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:931,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!uP91!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 424w, https://substackcdn.com/image/fetch/$s_!uP91!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 848w, https://substackcdn.com/image/fetch/$s_!uP91!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!uP91!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fe54d5c65-3655-4866-8748-0c0025e85b33_1400x931.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@glenncarstenspeters?utm_source=medium&amp;utm_medium=referral">Glenn Carstens-Peters</a> on <a href="https://unsplash.com/?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure></div><p>Think of the biggest goal you have for your life. It probably feels out of reach and intimidating, something abstract and totally unattainable. When you begin to think of how to accomplish your goal, you might feel lost, like you&#8217;ve been asked to find your way to a remote location without a map. This feeling of disorientation &#8212; the lack of an idea, any idea, to get from where you are to where you want to be &#8212; can cause you to feel hopeless and paralyze you in the face of real, profound change in your life. But it doesn&#8217;t need to be this way.</p><p>The anxiety surrounding our largest goals &#8212; whether they be career-, family-, or spiritually-focused &#8212; centers around the fact that given any starting point in life, there are an infinite number of paths forward. The more difficult, concrete, and far off into the future the goal is, the fewer of these branching paths will reach the destination you desire. Identifying the correct path, or even seeing that one exists, can feel nearly impossible.</p><p>This is because most people try to plan forward to their goals, instead of backward from them.</p><p>Imagine an event in your life, something big that you have accomplished. Think back to while you were in the process of completing that goal &#8212; the uncertainty you may have felt. Now look backward from today, and feel how certain the path forwards looks in hindsight. When looking backward at the path, the route seems obvious and safe. When looking forward, the path seems obscure and treacherous. The objective, then, is to imagine yourself having already completed the goal and then imagine the steps it took you to get there.</p><p>In this way, we can work iteratively backward from your goal to the present day, littering the path forward with checkpoints and benchmarks that provide clear guideposts along your path.</p><p>For example, let&#8217;s say your goal is to learn Spanish to fluency. This is an extremely large, abstract goal that involves multiple years of effort. Looking forward from today to your goal, reaching the point of fluency can feel impossible with no clear way to achieve it. Let&#8217;s take the opposite approach, and imagine that you already speak Spanish. What would someone learned Spanish to fluency have already done? They&#8217;ve probably read something like Don Quijote.</p><p>We now have our first checkpoint along the path to fluency&#8202;&#8212;&#8202;reading Don Quijote. This checkpoint now becomes our new endpoint in the journey in this iterative process. So now we must ask ourselves, what would someone who has read Don Quijote have already accomplished? That&#8217;s a hard book, so they probably started somewhere easier. Maybe they read through the Harry Potter series to grow their vocabulary and grammar knowledge. There are eight Harry Potter, giving us a further eight checkpoints along the journey.</p><p>Now, imagine you&#8217;re someone who has already read the first Harry Potter book in Spanish. What would you have already accomplished? You probably need to know a few thousand words and some grammar to read that book, so you would have likely memorized the 3000 most common words in Spanish and worked through a grammar book.</p><p>This becomes our new objective, and we&#8217;ve now reached one that can be acted upon today as a small, daily task. We can start by buying a grammar workbook and finding a list of vocabulary to memorize, and. set a goal to work through a small portion of them every day. Once we finish the vocabulary list and grammar book, we can continue by working towards the other objectives we&#8217;ve outlined during this exercise, following the journey we laid out.</p><p>In this iterative process, working backward from our true goal, we can develop a plan of checkpoints that carve out the path forward for us. By looking backward into an imagined past, we can determine our very real future, making the vague and abstract, clear and concrete.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Musings by Chris Hayduk! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Role of Experimentation in Fundamental Machine Learning Research]]></title><description><![CDATA[I recently watched an excellent talk from Dr.]]></description><link>https://www.chrishayduk.com/p/the-role-of-experimentation-in-fundamental</link><guid isPermaLink="false">https://www.chrishayduk.com/p/the-role-of-experimentation-in-fundamental</guid><dc:creator><![CDATA[Chris Hayduk]]></dc:creator><pubDate>Tue, 11 Oct 2022 15:52:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!0MV8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0MV8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0MV8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 424w, https://substackcdn.com/image/fetch/$s_!0MV8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 848w, https://substackcdn.com/image/fetch/$s_!0MV8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!0MV8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0MV8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg" width="1400" height="878" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:878,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0MV8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 424w, https://substackcdn.com/image/fetch/$s_!0MV8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 848w, https://substackcdn.com/image/fetch/$s_!0MV8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!0MV8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F48ee08cc-3245-4d27-9435-f640dadf4d14_1400x878.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Photo by <a href="https://unsplash.com/@jkoblitz?utm_source=medium&amp;utm_medium=referral">Julia Koblitz</a> on <a href="https://unsplash.com/?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure></div><p>I recently watched <a href="https://players.brightcove.net/679256133001/NkgrDczuol_default/index.html?videoId=6291482418001">an excellent talk from Dr. Tom Goldstein</a> given to the National Science Foundation in which he discussed the current limitations of machine learning (ML) research and a path forward to correct those issues. The fundamental thrust of his argument &#8212; that ML research needs to focus more on experimentation and less on theory &#8212; addresses many of the shortcomings in machine learning research and taps on several interesting ideas in theories of the mind, complex systems, and the development of true artificial intelligence (AI).</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.chrishayduk.com/subscribe?"><span>Subscribe now</span></a></p><h1><strong>Taking Lessons from Science</strong></h1><p>In the current fundamental ML research paradigm, experiments tend to be informed by theory. In particular, many researchers attempt to advance machine learning through a math-style research process, in which new theorems are deduced logically from existing theorems, lemmas, and corollaries in the machine learning corpus of knowledge. Experimental studies then attempt to validate these theories, potentially using a toy dataset to demonstrate the theory&#8217;s predictions. In this paradigm, it is unacceptable to publish experimental results that are not supported by theory. In Dr. Goldstein&#8217;s talk, he gives examples of two papers he worked on which produced surprising and counterintuitive results but were based upon empirical experimentation rather than rigorous proof. As a result, both papers struggled to be accepted at reputable conferences. However, theoretical results which are contradicted by experimental evidence tend to still be published despite the apparent inconsistency.</p><p>By contrast, the experiment-based approach used in science inverts the hierarchy in machine learning. Theory becomes subservient to experiment &#8212; the goal of theory switches to explaining what we observe in the real world. Theory is useless if it does not align with already-existing experimental results, and previously accepted theories are tossed out if new experimental results refute them. Porting this paradigm to machine learning research would result in a landscape where most progress comes from attempting new ideas on real-world datasets. Theories would then retrospectively attempt to tie together experimental results from trying new network architectures, hyperparameters, and preprocessing techniques, developing an explanation for the results that we had already verified empirically. This would not only produce theories that are more consistent with how machine learning operates in the real world, but it would also unshackle applied machine learning progress from the constraints of existing theory. Fundamental research would now be directly oriented towards demonstrating new ideas empirically on real-world datasets, further accelerating in the application of machine learning.</p><p>While very appealing, this line of thought does raise a key question: why is machine learning research better suited for experimentation-based methods rather than theory-based methods?</p><h1><strong>Complexity Theory</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Vuz8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Vuz8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 424w, https://substackcdn.com/image/fetch/$s_!Vuz8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 848w, https://substackcdn.com/image/fetch/$s_!Vuz8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 1272w, https://substackcdn.com/image/fetch/$s_!Vuz8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Vuz8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png" width="1400" height="787" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:787,&quot;width&quot;:1400,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Vuz8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 424w, https://substackcdn.com/image/fetch/$s_!Vuz8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 848w, https://substackcdn.com/image/fetch/$s_!Vuz8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 1272w, https://substackcdn.com/image/fetch/$s_!Vuz8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F531ad1d0-6e61-4f6f-ac71-74fb462406e4_1400x787.png 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Before we dive into why top-down theories of machine learning are so difficult to construct through deductive logic, let&#8217;s take a brief detour into complexity theory. According to <a href="https://en.wikipedia.org/wiki/Complexity">Wikipedia</a>, complexity is defined as follows,</p><blockquote><p><em>Complexity characterizes the behavior of a system or model whose components interact in multiple ways and follow local rules, meaning there is no reasonable higher instruction to define the various possible interactions.</em></p></blockquote><p>Essentially, complex systems are built upon agents acting under rather simple rules. These agents are typically constrained by distance, available information, or other limiting factors. As a simplified example, think about the interactions between people in an economy. Each person&#8217;s actions are constrained by their geographic setting, their limited knowledge of the world around them, and their available resources. If we consider an economy with no availability to credit, each agent&#8217;s actions essentially consist of buying or selling goods &amp; services within this framework of constraints. While the action space (buy &amp; sell) is rather small and the constraints placed upon each individual agent (knowledge, geographic distance, available resources) limit the scope of their actions significantly, the interactions between the agents and their decisions produce massively complex economies. To make it more concrete, while describing the economic decisions available to a merchant or farmer in the Roman Empire might be rather straightforward, describing the economic machine of Rome, which essentially amounts to interactions between many merchants and farmers, is an extremely difficult task.</p><p>This property of complex systems, in which highly complex behaviors develop from the interactions between constrained agents operating from a small set of possible actions, is known as <a href="https://en.wikipedia.org/wiki/Emergence">emergence</a>. This behavior makes describing complex systems in terms of top-down theory extremely difficult, if not impossible. These systems must be defined in terms of the interactions between their constituent parts &#8212; only then does the global behavior become clear. This is a key component for the &#8220;intelligent&#8221; behavior of many systems we observe in nature and is essential to understanding the emergent properties of neural networks.</p><h1><strong>Biological &amp; Artificial Neural Networks as Complex Systems</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!K_ty!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!K_ty!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!K_ty!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!K_ty!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!K_ty!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!K_ty!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg" width="1280" height="720" data-attrs="{&quot;src&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:720,&quot;width&quot;:1280,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!K_ty!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 424w, https://substackcdn.com/image/fetch/$s_!K_ty!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 848w, https://substackcdn.com/image/fetch/$s_!K_ty!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!K_ty!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F13b3d29c-1e77-402b-886c-53cbc48950e3_1280x720.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The brain, the most powerful thinking machine known, consists of about 86 billion neurons. Each individual neuron behaves rather simply &#8212; it receives input from its environment in the form of pressure, stretch, chemical transmitters, and changes of the electric potential across the cell membrane. This input then determines whether the neuron &#8220;turns on&#8221; or not. That is, the voltage of the cell membrane rapidly rises and falls, creating an electrical spike in response to the input. The key piece of the brain, and the reason why neurons are not simple voltage spiking machines, is that each neuron is connected to thousands of other neurons through synapses. In this way, the electrical spikes that occur in one neuron propagate to thousands of others, either inhibiting or facilitating spikes in those neurons. In turn, those neurons&#8217; signals propagate to other neurons, creating a cascade of neural activation.</p><p>These cascades of neural activation create complex behavior, such as your ability to read this article while simultaneously being conscious of yourself, your thoughts, and your emotions, despite starting from a rather simple process &#8212; that of the activation of a single neuron. This is best captured by Bassett and Gazzaniga in their 2011 paper <a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3170818/#:~:text=Complexity%20and%20multiscale%20organization&amp;text=The%20brain%20is%20a%20complex,and%20biological%20basis%20of%20cognition.">Understanding Complexity in the Human Brain</a>:</p><blockquote><p><em>Perhaps most simply, emergence &#8212; of consciousness or otherwise &#8212; in the human brain can be thought of as characterizing the interaction between two broad levels: the mind and the physical brain. To visualize this dichotomy, imagine that you are walking with Leibniz through a mill. Consider that you can blow the mill up in size such that all components are magnified and you can walk among them. All that you find are mechanical components that push against each other but there is little if any trace of the function of the whole mill represented at this level. This analogy points to an important disconnect in the mind&#8211;brain interface: although the material components of the physical brain might be highly decomposable, mental properties seem to be fundamentally indivisible.</em></p></blockquote><p>When we zoom into a single neuron, the functionality appears fairly straightforward, but the emergent properties of the mind are completely obscured. It is the interactions between massive numbers of neurons that drive the highly complex behavior exhibited by humans and other animals with large numbers of interacting neurons and neural connections. In this way, top-down theories of intelligence and brain function have been thwarted &#8212; developing a compact theory of brain function is akin to developing a compact theory describing the interactions of hundreds of millions of people in the US economy.</p><p>Similarly, artificial neural networks are composed of artificial neurons, loosely based on their biological equivalents. Connections between neurons in the network are established, similar to synapses in the human brain, allowing the artificial neurons to transmit signals to one another. Deep neural networks, the most successful example of machine learning in the field today, stacks multiple layers of neurons between the input and output layers. These additional layers allow for massive numbers of neurons and connections (the language model GPT-3 has about 175 billion parameters, roughly corresponding to the number of neurons and connections available in the model). These huge networks, as in the case of biological neural networks, exhibit intelligent behavior despite relatively simple components. This is captured by Testolin, Piccolini, and Suweis in their 2018 paper <a href="https://arxiv.org/abs/1809.10941">Deep Learning Systems as Complex Networks</a>,</p><blockquote><p><em>&#8230;in deep learning even knowing perfectly how a single neuron (node) of the network works does not allow to understand how learning occurs, why these systems work so efficiently in many different tasks, and how they avoid getting trapped in configurations that deteriorate computational performance. In these models, interactions play a crucial role during the learning process, therefore a step forward toward a more comprehensive understanding of deep learning systems is their study also in terms of their emerging topological properties.</em></p></blockquote><p>The neurons themselves are governed by quite simple laws, but the interactions between those neurons produce incredibly complex behavior, such as automatic speech-to-text programs, self-driving cars, and facial recognition software. Given what we know about complex systems and the property of emergence, it seems reasonable that deep neural networks would be difficult to describe without accounting for the interactions between the billions of neural connections that make up the network. The quest for top-down theories produced from deductive logical reasoning may be fruitless in this complexity-laden case.</p><h1><strong>Experimental Research as a Solution to Understanding Complexity</strong></h1><p>Now, returning to our original topic, we can begin to see how experimental methods address the fundamental issues with understanding complex systems. Deep neural networks and other machine learning methods based on large-scale interactions between simple components do not lend themselves well to top-down theoretical understanding. However, we can tease out the emergent behavior of these systems by using real-world datasets, choosing a particular behavior we would like to examine, and constructing experiments to understand the emergent behavior of the network in question.</p><p>For example, we may construct experiments to see how the number of iterations it takes for the weights of a convolutional neural network to converge changes as the quality of the image dataset increases or decreases. While this experimental result would not explain fundamentally why the network&#8217;s convergence rate behaves the way it does, it provides us insight into the emergent behavior of the complex system &#8212; namely, the convolutional neural network itself. With enough experimental results of this nature, we may be able to piece together insights to begin understanding the behavior of these networks and how they respond to varying stimuli. I believe this type of understanding as put forth by Dr. Goldstein, while lacking from the perspective of theoretical explanations and justifications, will facilitate the way forward for significant improvements in machine learning, both in academia and industry.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.chrishayduk.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Musings by Chris Hayduk is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>