Scaling Laws for Transfer Learning

October 11, 2025

Chinchilla scaling posits that

L(N,D)=L+ANα+BDβ,

where N is the number of parameters in your language model, D is the number of training tokens seen, A,B,α,β,L are constants, and L(N,D) is the test loss.

Of course, this analysis is limited:

Still, it is remarkable that test loss performance is so clearly a function of parameters and dataset size, with little assumptions made about the distributional structure or architectural specifics (this holds across varying depth/width ratios, for instance). I find two questions natural:

In 2021, OpenAI studied the question of "how much more quickly does a pretrained LM achieve low loss on a finetuning dataset than an LM initialized from scratch on the dataset?" (Note pretraining and finetuning on this case are "the same operation", the objective is still cross-entropy). They find that the "effective data transferred"1 DT is described by DT=k(DF)α(N)β, where DF is the size of the finetuning dataset (in tokens) and N is the number of non-embedding parameters of the model.2 This is great! Strong evidence of the generality of abstractions the model learns in pretraining (especially given the independence of β from the source distribution). However, it doesn't explicitly tell us about downstream task performance given an external metric.

Brandfonbrener et. al. do somewhat better. They derive scaling relationships for train-train, train-test, and test-test loss transfer for models trained on different datasets, which can be expressed as

Li(f^jN,D)K(Lk(flN,D)Ek|l)κ+Ei|j,

where you have models fj,fl trained on distributions j,l evaluated on distributions i,k and you're fitting the constants K,κ.3 As an example, the case of train-train would be where (i,j)=(0,0) and (k,l)=(1,1). We pair models by (N,D) for coherence. Notably, these laws hold for diverse datasets, but only well in low-loss regimes and when Em|n terms can be well estimated. Still no breaking of the pretraining regime, and no explicit predictions for downstream metric performance!

There's a meta-analysis out this year that claims that scaling laws are unreliable for dowstream task performance prediction.. Seems correct. Metrics are noisy and don't have nice algorithmic properties like cross-entropy loss might. Perhaps intriguing is their observation that irregular scaling is (1) common and (2) can occur for cross-entropy on normal tasks and normal LM datasets. This paper claims that larger models & models trained for longer have better downstream task performance even when holding loss constant. Which is an argument for certain training setups & architectures having better inductive biases?

Honestly, I am kind of sad that the extant literature here seems to be tainted by publication bias? I wouldn't really trust these papers (or the ten others I read writing this), and I want to run the experiments myself. The Tinker API seems good for doing this quickly. I'll go do that.

(Titular question pending.)

1

In essence, the number of tokens that you "save" seeing in finetuning by pretraining.

2

Notably β only depends on architecture and TARGET distribution (not SOURCE), while α is a rough "distributional proximity" proxy that can be easily estimated.

3

Em|n is the irreducible loss of a model trained on with infinite compute on distribution n evaluated on distribution m.