Scaling Laws for Transfer Learning
October 11, 2025
Chinchilla scaling posits that
where
Of course, this analysis is limited:
- parameters do not hold across architectural shifts: dense vs. MoE for ex. (h/t Kushal for sending me this paper)
- "scale data and model size similarly" is derived from the regime where compute
- data repetition may or may not degrade performance in the long-term: it seems like 4x is the limit for traditional autoregressive transformers, but 100x can be useful for diffusion LMs
is in-distribution test loss, loss often not predictive of downstream task performance, although loss-to-loss predictions across different training distributions are predictable- the original Chinchilla paper likely did their regressions wrong
Still, it is remarkable that test loss performance is so clearly a function of parameters and dataset size, with little assumptions made about the distributional structure or architectural specifics (this holds across varying depth/width ratios, for instance). I find two questions natural:
- does scaling hold outside of the cross-entropy pretraining regime?
- can we derive scaling relationships for downstream task performance? in particular, how predictable is transfer learning?
In 2021, OpenAI studied the question of "how much more quickly does a pretrained LM achieve low loss on a finetuning dataset than an LM initialized from scratch on the dataset?" (Note pretraining and finetuning on this case are "the same operation", the objective is still cross-entropy). They find that the "effective data transferred"1
Brandfonbrener et. al. do somewhat better. They derive scaling relationships for train-train, train-test, and test-test loss transfer for models trained on different datasets, which can be expressed as
where you have models
There's a meta-analysis out this year that claims that scaling laws are unreliable for dowstream task performance prediction.. Seems correct. Metrics are noisy and don't have nice algorithmic properties like cross-entropy loss might. Perhaps intriguing is their observation that irregular scaling is (1) common and (2) can occur for cross-entropy on normal tasks and normal LM datasets. This paper claims that larger models & models trained for longer have better downstream task performance even when holding loss constant. Which is an argument for certain training setups & architectures having better inductive biases?
Honestly, I am kind of sad that the extant literature here seems to be tainted by publication bias? I wouldn't really trust these papers (or the ten others I read writing this), and I want to run the experiments myself. The Tinker API seems good for doing this quickly. I'll go do that.
(Titular question pending.)
In essence, the number of tokens that you "save" seeing in finetuning by pretraining. ↩
Notably