Regarding Scenario 1 in the internal/external distinction, wouldn't an external approach ultimately be able to extract internal features (unless we don't allow it to see anything that reflects the starting conditions - without which we could not arrive at the analytical solution using the 'internal' approach either)?
Given identical input - like the first few frames of a video of the coin being flipped, I think a sequence model could definitely achieve more than 50% accuracy/the causal factors are able to be reconstructed from the provided data. Maybe I misunderstood the terminology here?
In that sense, I'm not sure I understand why fundamentally this would be a barrier to recreating something like human thought (from text for example). Although there is definitely added 'complexity'/training needed for the model to reconstruct those internal factors.
Also I'm curious as to your thoughts on computational complexity regarding the storage size of such a model. I have no experience with how this is usually done for truly large models, but from my initial understanding, memory speed can be a large contributing factor. At even 1 byte per synapse that's 600 TB of data. This seems non-trivial to get running even using H100's, but I would imagine even if you do the effective FLOPS values might be different.
Interesting read! Honestly would be really cool to even see a simulated insect brain - I think we could get a lot of interesting insights on the scaling/viability of such an approach.
Thanks for your comment Alex! You bring up an interesting point regarding the input of the sequence model. The way I had framed it in my mind was the following:
- The coin flip sequence model and text-based LLMs are both looking at the products of the process. They cannot see the internal processes that produced those products (the underlying physics in the case of the coin flip model and brain activity in the case of the LLM)
- The physics-based coin flip model and the brain simulation model look beneath the products to the actual origin of the process and model that directly
I think in your scenario of the coin flip model getting to see the video of the coin, it would almost definitely achieve higher than 50% accuracy given enough data. But, crucially, you are now allowing it to look at the underlying process that produced the coin flip and "derive" the laws of physics for itself in some sense. An equivalent brain-based model would probably be something like training it on fMRI data from a person writing text or speaking (or something higher-resolution that comes out in the future) + the text/voice data that person produced. This way, the model would also have a chance to derive what parts of the brain are important for producing the text without explicitly being given the rules governing the brain.
Given all the above, I do think LLMs are closer to the coin flip sequence model which maxes out at 50% accuracy. It would be interesting if, given a large enough dataset of paired brain scans & text, we could achieve better accuracy. Although that seems like quite a difficult dataset to build.
On the notion of size, you're absolute right that a to-scale model of the human brain would be massive - probably on the scale of petabytes. I think we're actually much closer on the memory size than on the compute side though. For example, Meta has built a cluster of 24,576 GPUs (https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/). If these were all H100s and if the VRAM on all of them were fully accessible (tenuous assumptions but probably will be roughly true in about a year or so), this would be roughly 2 petabytes of VRAM available in this cluster. Given the compute timelines estimated at the end of the article (2040-2075), we should have clusters with at least one order of magnitude more VRAM by that time, and potentially more. So we'd be looking at probably 100+ petabyte VRAM clusters.
Thanks for the reply! I guess the way I saw it personally was more as a spectrum than internal/external in terms of transparency of the internal structure. For example with the coin toss, the analytical process using the classical mechanics approach is also an abstraction/simplification of a potentially more accurate approach with contributions from quantum and relativistic effects. Given enough samples these features should be theoretically reproducible - although in this case the amount of data needed to extract these kind of features with either approach would be extreme.
In the case of LLM's it's also an interesting case because there is no specific structure to reconstruct as the training data originates from across many different 'brains'/other models, so there is no low-level-abstract fine structure to extract (or to feasibly extract at least). The lowest level extracted features are no longer real/directly model anything real but something like 'composite-hyperreal' simulacra. I feel like this kind of low level structure should be mostly possible to extract from enough text data - but this might just be intuition getting in the way of seeing the problem clearly.
Some of these GPU clusters are getting really big - I really underestimated the scale when I saw the headline before! They definitely must have good confidence in the scaling capability for future models to make that kind of investment (or maybe really not confident in the ability to find a more efficient approach :D ). Either way, exciting to see where it's going to go within the next 50 or so years.
Thanks for another great comment! That's a good point on the models themselves being abstractions of reality - in some sense you don't ever get a "true" internal perspective of the underlying process. As you mentioned, there's a spectrum where we can use increasingly more complex models that abstract away fewer and fewer of the details, but we can't ever reach a "true" model in some sense (unless of course the universe is doing computation underneath all of this, but that's a whole other can of works). I really like your framing here this setup as a spectrum - the Newtonian vs quantum mechanics point is great.
With that said, I agree partially that given enough data these abstracted models should be reproducible, but with two caveats:
1. Deep learning models can learn incredibly flexible models, so they may approximate the underlying physical realtiy in-distribution but completely fail to generalize out of distribution (since their flexibility allows them to take on essentially arbitrary values out of distribution). On the other hand, if we have a physical law that describes a system well (such as Newtonian Mechanics), we know that we will arrive at the correct answer, even if those particular input parameters have never been observed before. In this sense, biophysics/computational neuroscience may be able to model better out-of-distribution than a deep learning model trained on brain data some decades down the road (or at least provide some physics-based constraints to the model itself. See: https://maziarraissi.github.io/PINNs/)
2. For a deep learning model to reproduce the models of underlying reality, they need to be fed the right data. Going back to the coin flip example, sequences of heads & tails (like HTTTHHT) trained using autoregressive token prediction will never be able to do better than 50% accuracy like we discussed. However, your suggestion of using coin flip videos has a pretty solid chance of recovering some physics knowledge if we have enough of those videos. By that same idea, I think LLMs trained on text won't be able to recover the underlying structure producing the text (namely, the brain activity that gives rise to thought) since they currently aren't being fed any data that reflects that underlying structure. But if brain activity data were fed in jointly with text, they may in theory be able to derive what sort of brain activity is important for generating the thoughts that led to that text.
My point #2 is a little less defensible since my claim that text doesn't reflect the underlying brain activity well-enough to derive those features might not be true, but my intuition is that it's probably the case that we're missing something fundamental from the brain's perspective.
Your point on LLMs not having a specific structure to reconstruct is another can of worms :). This opens up a bit of a debate: are thoughts/abstract structures/the whole edifice of human cognition just reflections of our brain's physical state? I think that debate broadly leaves us with two options:
1. If thought is physical, there is an underlying structure for the LLM to model (namely, the physics of the brain). Even the hierarchies of abstraction that we construct as humans should correspond to different physical states of the brain if this idea holds.
2. If thought isn't physical, then I agree that there's probably no specific underlying structure that we can access to model.
I tend to lean towards option 1, mostly because I struggle to conceive of an explanation for what thought is if it isn't the physical state of the brain & nervous system.
Regarding Scenario 1 in the internal/external distinction, wouldn't an external approach ultimately be able to extract internal features (unless we don't allow it to see anything that reflects the starting conditions - without which we could not arrive at the analytical solution using the 'internal' approach either)?
Given identical input - like the first few frames of a video of the coin being flipped, I think a sequence model could definitely achieve more than 50% accuracy/the causal factors are able to be reconstructed from the provided data. Maybe I misunderstood the terminology here?
In that sense, I'm not sure I understand why fundamentally this would be a barrier to recreating something like human thought (from text for example). Although there is definitely added 'complexity'/training needed for the model to reconstruct those internal factors.
Also I'm curious as to your thoughts on computational complexity regarding the storage size of such a model. I have no experience with how this is usually done for truly large models, but from my initial understanding, memory speed can be a large contributing factor. At even 1 byte per synapse that's 600 TB of data. This seems non-trivial to get running even using H100's, but I would imagine even if you do the effective FLOPS values might be different.
Interesting read! Honestly would be really cool to even see a simulated insect brain - I think we could get a lot of interesting insights on the scaling/viability of such an approach.
Thanks for your comment Alex! You bring up an interesting point regarding the input of the sequence model. The way I had framed it in my mind was the following:
- The coin flip sequence model and text-based LLMs are both looking at the products of the process. They cannot see the internal processes that produced those products (the underlying physics in the case of the coin flip model and brain activity in the case of the LLM)
- The physics-based coin flip model and the brain simulation model look beneath the products to the actual origin of the process and model that directly
I think in your scenario of the coin flip model getting to see the video of the coin, it would almost definitely achieve higher than 50% accuracy given enough data. But, crucially, you are now allowing it to look at the underlying process that produced the coin flip and "derive" the laws of physics for itself in some sense. An equivalent brain-based model would probably be something like training it on fMRI data from a person writing text or speaking (or something higher-resolution that comes out in the future) + the text/voice data that person produced. This way, the model would also have a chance to derive what parts of the brain are important for producing the text without explicitly being given the rules governing the brain.
Given all the above, I do think LLMs are closer to the coin flip sequence model which maxes out at 50% accuracy. It would be interesting if, given a large enough dataset of paired brain scans & text, we could achieve better accuracy. Although that seems like quite a difficult dataset to build.
On the notion of size, you're absolute right that a to-scale model of the human brain would be massive - probably on the scale of petabytes. I think we're actually much closer on the memory size than on the compute side though. For example, Meta has built a cluster of 24,576 GPUs (https://engineering.fb.com/2024/03/12/data-center-engineering/building-metas-genai-infrastructure/). If these were all H100s and if the VRAM on all of them were fully accessible (tenuous assumptions but probably will be roughly true in about a year or so), this would be roughly 2 petabytes of VRAM available in this cluster. Given the compute timelines estimated at the end of the article (2040-2075), we should have clusters with at least one order of magnitude more VRAM by that time, and potentially more. So we'd be looking at probably 100+ petabyte VRAM clusters.
Thanks for the reply! I guess the way I saw it personally was more as a spectrum than internal/external in terms of transparency of the internal structure. For example with the coin toss, the analytical process using the classical mechanics approach is also an abstraction/simplification of a potentially more accurate approach with contributions from quantum and relativistic effects. Given enough samples these features should be theoretically reproducible - although in this case the amount of data needed to extract these kind of features with either approach would be extreme.
In the case of LLM's it's also an interesting case because there is no specific structure to reconstruct as the training data originates from across many different 'brains'/other models, so there is no low-level-abstract fine structure to extract (or to feasibly extract at least). The lowest level extracted features are no longer real/directly model anything real but something like 'composite-hyperreal' simulacra. I feel like this kind of low level structure should be mostly possible to extract from enough text data - but this might just be intuition getting in the way of seeing the problem clearly.
Some of these GPU clusters are getting really big - I really underestimated the scale when I saw the headline before! They definitely must have good confidence in the scaling capability for future models to make that kind of investment (or maybe really not confident in the ability to find a more efficient approach :D ). Either way, exciting to see where it's going to go within the next 50 or so years.
Thanks for another great comment! That's a good point on the models themselves being abstractions of reality - in some sense you don't ever get a "true" internal perspective of the underlying process. As you mentioned, there's a spectrum where we can use increasingly more complex models that abstract away fewer and fewer of the details, but we can't ever reach a "true" model in some sense (unless of course the universe is doing computation underneath all of this, but that's a whole other can of works). I really like your framing here this setup as a spectrum - the Newtonian vs quantum mechanics point is great.
With that said, I agree partially that given enough data these abstracted models should be reproducible, but with two caveats:
1. Deep learning models can learn incredibly flexible models, so they may approximate the underlying physical realtiy in-distribution but completely fail to generalize out of distribution (since their flexibility allows them to take on essentially arbitrary values out of distribution). On the other hand, if we have a physical law that describes a system well (such as Newtonian Mechanics), we know that we will arrive at the correct answer, even if those particular input parameters have never been observed before. In this sense, biophysics/computational neuroscience may be able to model better out-of-distribution than a deep learning model trained on brain data some decades down the road (or at least provide some physics-based constraints to the model itself. See: https://maziarraissi.github.io/PINNs/)
2. For a deep learning model to reproduce the models of underlying reality, they need to be fed the right data. Going back to the coin flip example, sequences of heads & tails (like HTTTHHT) trained using autoregressive token prediction will never be able to do better than 50% accuracy like we discussed. However, your suggestion of using coin flip videos has a pretty solid chance of recovering some physics knowledge if we have enough of those videos. By that same idea, I think LLMs trained on text won't be able to recover the underlying structure producing the text (namely, the brain activity that gives rise to thought) since they currently aren't being fed any data that reflects that underlying structure. But if brain activity data were fed in jointly with text, they may in theory be able to derive what sort of brain activity is important for generating the thoughts that led to that text.
My point #2 is a little less defensible since my claim that text doesn't reflect the underlying brain activity well-enough to derive those features might not be true, but my intuition is that it's probably the case that we're missing something fundamental from the brain's perspective.
Your point on LLMs not having a specific structure to reconstruct is another can of worms :). This opens up a bit of a debate: are thoughts/abstract structures/the whole edifice of human cognition just reflections of our brain's physical state? I think that debate broadly leaves us with two options:
1. If thought is physical, there is an underlying structure for the LLM to model (namely, the physics of the brain). Even the hierarchies of abstraction that we construct as humans should correspond to different physical states of the brain if this idea holds.
2. If thought isn't physical, then I agree that there's probably no specific underlying structure that we can access to model.
I tend to lean towards option 1, mostly because I struggle to conceive of an explanation for what thought is if it isn't the physical state of the brain & nervous system.
Thanks again for the great comments!