A Perspective on the Limitations of Language Modeling
Probing the upper limits of compute required for AGI
Scenario #1: Imagine you want to model a sequence of coin flips. Your goal is to accurately predict the result of the next coin flip given the history of all previous flips in this sequence. Suppose in this scenario that you have no idea that coin flips are independent and that the underlying distribution for this process is binomial with a probability of 0.5. So, without any strong or informed prior on what this process looks like, we spend billions of dollars of compute to train up a massive autoregressive sequence model to predict the next flip in each sequence. This model will approach, but never exceed, 50% accuracy in token prediction on the test set given the nature of the underlying distribution.
Scenario #2: Now instead imagine that we have enough compute to model each coin toss analytically. We can simulate the impact of air resistance with CFD, the angle of the coin as it leaves the tosser’s thumb, etc. Given enough compute and sufficient accuracy of the measurements for the initial conditions, we should be able to predict the result of the coin toss with near 100% accuracy using the standard classical mechanics that have allowed us to put satellites into orbit and men on the moon.
These two scenarios deal with the same underlying process (predicting a coin flip) but address that process from two totally different perspectives. Scenario #1 approaches the prediction task from an external perspective. It attempts to model the process without actually understanding any of the internal processes that generate the process. In the case of the coin flip, no matter how well we model from this external perspective, we can never achieve better than 50% accuracy on a large enough test set. The fundamental limit is not in the data or in the model but in the perspective from which we are modeling.
By contrast, in Scenario #2, we approach the prediction task using an internal perspective. We analyze the causal factors that contribute to the outcome of each coin flip and model those factors, allowing us to make predictions based on the mechanics underlying the flip rather than simply using the sequence itself. Here, given enough compute and sufficiently sensitive measurement devices, we can far exceed the 50% accuracy ceiling that limited the external perspective.
LLMs model human thought using the external perspective described above and, as such, will have a large amount of error that could be avoided by modeling thought from an internal perspective.
In order to move to modeling thought from an internal perspective, there are two promising avenues:
Combining symbolic AI with large language models
Simulation of the human connectome
Symbolic AI comprises formal logic, proof verification systems, knowledge graphs, and more. These approaches attempt to model the mechanisms of human thought directly, focusing in particular on producing logically correct deductions (in the parlance of Kahneman’s Thinking Fast and Slow, these attempt to model System 2 thinking). Advances in combining these approaches with large language models will probably come from algorithmic and engineering work rather than scaling up compute. Thus, if this approach works, we should expect to see artificial general intelligence (AGI) without much of a cost in increased scaling. Since this article is most interested in exploring the maximum lower bound of compute required for AGI, we will ignore this case and focus on modeling the human connectome.
Simulating the Brain
In the case that merging symbolic AI with LLMs does not work, our clearest avenue towards AGI would be mapping and simulating the human connectome. A simulation of the actual underlying hardware of the brain — the billions of neurons and trillions of synapses that comprise it as well as all of their interactions — should produce thought through a faithful reconstruction of its underlying mechanics in the same way that computer simulations can map the trajectory of a real rocket. And if we provide it with a map of the human connectome of someone like Albert Einstein and feed it data on our accumulated store of knowledge, it should be a general intelligence that can solve difficult, out-of-distribution problems.
The above paragraph rests on a number of critical assumptions (namely, that we can map the full human connectome at a high level of detail and that we can develop well-formulated models for human neurons and synapses). For the sake of this exercise, we will ignore the significant work that remains to be done in those domains and instead answer the question — if we already had a complete map of the human mind, how much compute would we need to run the simulation and generate thought? We’ll answer this question at a rough order of magnitude level, but it should give us a picture of when the amount of compute needed to run this simulation will become available to large corporations and research labs.
To accurately simulate the human connectome, we need to consider several factors:
Neuronal Complexity: Each neuron is a complex computational unit with intricate dynamics. Simulating a single neuron with high fidelity requires significant computational power.
Synaptic Plasticity: The strength and nature of connections between neurons are constantly changing. Modeling this plasticity adds another layer of complexity to the simulation.
Temporal Resolution: Neural processes occur on multiple timescales, from milliseconds to hours. A comprehensive simulation must account for these various temporal dynamics.
Spatial Resolution: The spatial arrangement of neurons and their connections is crucial for understanding brain function. High-resolution mapping of the connectome is essential for accurate simulation.
Given that we would like a high-fidelity simulation of the brain that can produce emergent, intelligent thought, we will tend towards higher complexity along all four of these factors. Our back-of-the-envelope calculations will assume a highly-complex neuronal model, long- and short-term synaptic plasticity, a temporal resolution of 1 millisecond, and full-connectome modeling, including all 100 billion neurons and 600 trillion synapses.
Neurons
Let's focus on a single-neuron model as an example, using the Hodgkin-Huxley (HH) model, which is one of the more computationally intensive but biologically realistic models:
Membrane Potential Calculation: The HH model uses a differential equation: C(dV/dt) = -g_Na(V-E_Na) - g_K(V-E_K) - g_L(V-E_L) + I_ext Solving this numerically (e.g., using the Euler method) requires:
3 subtractions
3 multiplications
3 additions
1 division (for dt)
Total: ~10 floating point operations timestep
Ion Channel Dynamics: For each of sodium, potassium, and leak channels: dm/dt = α_m(1-m) - β_m*m (similar for h and n) Each gate variable (m, h, n) requires:
2 exponential calculations (~10 floating point operations each)
4 multiplications
2 additions/subtractions
Total: ~30 floating point operations per gate, ~90 floating point operations for all three
Conductance Calculations: g_Na = g_Na_max * m^3 * h g_K = g_K_max * n^4 Requires:
5 multiplications
2 exponentiations
Total: ~20 floating point operations
Current Calculations: I_Na = g_Na * (V - E_Na), etc. Requires:
3 subtractions
3 multiplications
Total: ~6 floating point operations
Summing these up, we get approximately 126 floating point operations per timestep for a single-compartment HH model. However, this is a significant underestimate for a realistic neuron simulation:
Multiple Compartments: Real neurons aren't single compartments. A moderately detailed model might have 10-100 compartments, each requiring its own HH-like calculations. Total: 126 * 10 to 126 * 100 = 1,260 to 12,600 floating point operations
Synaptic Integration: A typical neuron might have 1,000-10,000 synapses. For each active synapse:
Calculate postsynaptic current (~5 floating point operations)
Update synaptic state (~5 floating point operations)
If 10% of synapses are active in a timestep: Total: 100 to 1,000 * 10 floating point operations = 1,000 to 10,000 floating point operations
Intracellular Signaling: Calcium dynamics and second messenger systems might add another 100-500 floating point operations, depending on the level of detail.
Adding these up, we get a range of about 2,360 to 23,100 floating point operations per neuron per timestep.
Synapses
We’ll now estimate the number of floating point operations needed to simulate a single synapse, accounting for both transmission and plasticity.
Basic Synaptic Transmission: I_syn = g_syn * s * (V_post - E_rev) ds/dt = α * T * (1 - s) - β * s
2 multiplications
2 subtractions
1 addition
1 division (for dt)
Total: ~6 floating point operations
Short-Term Plasticity (STP):
Facilitation: dF/dt = (1 - F)/τF + f * δ(t - t_spike) b.
Depression: dD/dt = (1 - D)/τD - d * D * δ(t - t_spike) c. Synaptic efficacy: A = A0 * F * D
4 subtractions
3 divisions
3 multiplications
2 additions
Total: ~12 floating point operations
Long-Term Plasticity (LTP/LTD):
NMDA receptor activation: I_NMDA = g_NMDA * s_NMDA * B(V) * (V_post - E_NMDA) B(V) = 1 / (1 + exp(-0.062 * V_post) * [Mg2+] / 3.57)
4 multiplications
2 subtractions
1 division
1 exponentiation
Total: ~18 floating point operations
Calcium dynamics: d[Ca2+]/dt = -[Ca2+]/τCa + γ * I_NMDA + baseline
1 division
2 multiplications
1 addition
1 subtraction
Total: ~5 floating point operations
CaMKII activation: dCaMKII/dt = k1 * [Ca2+]n * (1 - CaMKII) - k2 * CaMKII
2 multiplications
1 subtraction
1 exponentiation
1 division
Total: ~15 floating point operations
Weight update rule (based on CaMKII): dw/dt = η * (CaMKII - θp)+ - η * (CaMKII - θd)- Where ()+ and ()- denote rectification
2 subtractions
2 comparisons
2 multiplications
Total: ~8 floating point operations
Homeostatic Plasticity: w = w * (1 + η_homeo * (target_activity - actual_activity))
2 subtractions
2 multiplications
1 addition
Total: ~5 floating point operations
Structural Plasticity (simplified): P_form = sigmoid(local_activity - threshold) P_elim = 1 - P_form
1 subtraction
1 exponentiation (for sigmoid)
1 division (for sigmoid)
1 subtraction (for P_elim)
Total: ~14 floating point operations
Neuromodulation (simplified, e.g., dopaminergic influence on plasticity): plasticity_factor = baseline + k * [dopamine]
1 multiplication
1 addition
Total: ~2 floating point operations
Summing these components: 6 + 12 + 18 + 5 + 15 + 8 + 5 + 14 + 2 = 85 floating point operations per synapse per timestep.
Full Estimate
Assuming we need to simulate each neuron and synapse at a resolution of 1 millisecond, each neuron operation requires 20,000 floating-point operations, and each synapse operation requires 85 floating point operations, we can make a rough calculation:
100 billion neurons * 20,000 floating point operations/neuron * 1000 timesteps/second = 2 * 10^18 FLOPS
600 trillion synapses * 85 floating point operations/synapse * 1000 timesteps/second = 5.1 * 10^19 FLOPS
Total: Approximately 5.3 * 10^19 FLOPS or 53 exaFLOPS
As a result, simulating the human connectome for approximately 1 week would require more floating point operations than the entire training budget for GPT-4. Keep in mind that simulation, in this case, is essentially inference in the case of LLMs - imagine if serving 1 instance of GPT-4 for 1 week took the entire training compute budget of OpenAI. This is a monumental amount of compute and beyond the economic feasibility of any current company (even if the human connectome had been fully mapped, which it is currently far from being).
From another perspective, we can look at this compute requirement in terms of the number of H100s that it would require. FP64 is typically used for scientific simulation work, and the NVIDIA H100 can perform 34 teraFLOPS = 3.4 * 10^13 FLOPS in FP64 mode. Hence, our estimate for the human brain would require approximately 1,550,000 H100s for simulation. This is 2 orders of magnitude larger than the largest deployed H100 clusters today. There is not much data on the growth rate of FP64 FLOPS, but FP32 FLOPS are doubling every 2.3 years. Since an FP64 multiplier unit takes roughly five times the area of an FP32 multiplier, we can estimate that it will require 5 times the growth in transistors in order to double the FP64 performance when compared to doubling FP32 performance. Hence, our doubling time for FP64 performance should be roughly 7.6 years, which would imply that FP64 performance will increase by one order of magnitude (i.e. 10x the performance of the H100) in about 25 years.
If the above trends and assumptions hold (a very tenuous assumption), it will be 50 years before the FP64 performance has increased to the point where companies can spend roughly equivalent amounts to today’s largest AI training clusters in order to simulate the human brain. If algorithmic advances or further research in computational neuroscience supports a transition to FP32 from FP64, this 50-year timeline could be compressed to only 15 years.
Hence, if simulating the brain is our only viable path to AGI, we can expect the required compute to be available for the largest corporations in 2040 at the earliest and 2075 at the latest given current trends.
Regarding Scenario 1 in the internal/external distinction, wouldn't an external approach ultimately be able to extract internal features (unless we don't allow it to see anything that reflects the starting conditions - without which we could not arrive at the analytical solution using the 'internal' approach either)?
Given identical input - like the first few frames of a video of the coin being flipped, I think a sequence model could definitely achieve more than 50% accuracy/the causal factors are able to be reconstructed from the provided data. Maybe I misunderstood the terminology here?
In that sense, I'm not sure I understand why fundamentally this would be a barrier to recreating something like human thought (from text for example). Although there is definitely added 'complexity'/training needed for the model to reconstruct those internal factors.
Also I'm curious as to your thoughts on computational complexity regarding the storage size of such a model. I have no experience with how this is usually done for truly large models, but from my initial understanding, memory speed can be a large contributing factor. At even 1 byte per synapse that's 600 TB of data. This seems non-trivial to get running even using H100's, but I would imagine even if you do the effective FLOPS values might be different.
Interesting read! Honestly would be really cool to even see a simulated insect brain - I think we could get a lot of interesting insights on the scaling/viability of such an approach.