Prediction is not Generation
October 13, 2025
[a long overdue response to Aidan :)]
In ML, generation and prediction are practically synonymous. Your model learns an appropriate, performant compression of your dataset, and somehow such artifacts generate "completions" (broadly construed) with high accuracy.
It's tempting to then make the leap to man, if I just managed to tokenize the entire past and future of the universe and train a transformer (with infinite compute) to predict the next universe state every Planck time1 from the universe history up until that point, then it'll be guaranteed to faithfully represent the true laws of physics somewhere in its weights!
I claim this is unclear! Even if the laws of physics were describable by some finite state automata, the optimal predictive representation of a process does not have to necessarily correspond to the optimal generative representation!
Here's a toy case. Consider the space of all stationary Markov processes generating the symbol set
The key point to remember here is that we're using the entropy of the stationary state distribution as a measure of "optimality," in the sense that lower entropy means higher simplicity and as a result it is "more optimal." It stands to reason that if generation and prediction are "the same," then it should be impossible to construct a generative process with lower entropy than
Well. Consider
You can check for yourself that this process outputs
and its entropy
Have we been hoodwinked? Maybe one should never trust a sentence beginning with "Clearly, . . ." in a mathematical text. Maybe there's a secret predictive policy that magically has lower entropy for
I argue against this. In particular, there is a particular interpretation of "prediction" we're using here that I claim is simultaneously natural and correct.
Consider an infinite string of tokens
Observe that the introduction of causality meaningfully differentiates solely generative policies from predictive ones! We have constructed a lower-entropy generative process by relaxing the assumption that we only rely on meaningfully causally differentiated belief states given the token history. There's a sense in which this is the fundamental difference between prediction and generation. It remains to be seen how widely this holds, but the two concepts are canonically differentiated.
Examples taken from James et. al..
Ignoring that the universe doesn't have a sense of absolute time.
In general, belief states are not by default interpretable.
The ranges in which the Lohr model has lower entropy than the predictive model.