Notes

Mid-October Links

i am tired so here is a linkpost. i'll try not to do more than two of these a month

  1. Misha Gromov mathematizes biology (in an IHES lecture series). See also his manuscript on ergosystems.
  2. Gauge/string dualities as special cases of Schur-Weyl duality.
  3. Tao on when eigenvalues are stable under (small) perturbations, what gauges are, and orders of infinity.
  4. Constructed languages are processed by the same brain mechanisms as natural languages.
  5. Negarestani on the alien will, toy aesthetics, and toy philosophy. He also has a complexity sciences reading list which is surprisingly reasonable.
  6. A 3000pg algebraic topology reference with pictures.
  7. Tsvi on gemini modeling and counting down vs. counting up coherence.
  8. Schulman on when low-rank LoRA underperforms and matches fullFT. In particular, 1-rank LoRA is sufficient for RL tasks.
  9. Associative memory in an optical spin glass made of rubidium Bose-Einstein condensates. Ganguli is a co-author.
  10. Lada Nuzhna on where are all the trillion dollar biotechs?
  11. Francis Bach with some more "classical" settings for scaling laws. See accompanying blog posts.
  12. Andrew Critch has an interesting blog post on Newcombian implications on self-trust. Christiano also has a blog post on integrity for consequentialists.
  13. Homotopy is not concrete.
  14. Nick Bostrom profile in the New Yorker.
  15. Dean Ball on what it's like to work in the White House. He ~wrote the AI Action Plan.
  16. Vitalik on memory access actually taking O(N1/3) time, low-risk defi as an Ethereum business model, copy-left vs. permissive licenses, and musings on ideologies. His posts are great. Highly recommend.
  17. Ben Kuhn on how taste can be the leading contributor to impact. See also Chris Olah's exercises for developing research taste.
  18. Exposition of homotopy type theory with -topos semantics by Emily Riehl. I really like these! They're cogent and clear.
  19. The Ohio State University is hosting an International Conference on Ancient Magic this weekend.
  20. Aging as a loss of goal-directedness, from the Levin lab.
  21. Eliot's The Hollow Men and Blake's The Proverbs of Hell.
  22. An optimistic case for protein foundation model companies.
  23. Grokking as a first order phase transition in neural networks. Good example of mean field theory as a thermodynamic theory of learning.
  24. If you like the items on this list, or especially if you wish the items on this list were better, email me!

MDL meets SLT

[paper highlight: Compressibility Measures Complexity: Minimum Description Length Meets Singular Learning Theory]

two major contributions of the paper:

  • theoretically links minimum description length to singular learning theory, in that they prove that for all ground-truth distributions q and n i.i.d samples drawn from q, there exists a two-part code with asymptotic redundancy Rn=λlogn(m1)loglogn+Op(1), where λ is the LLC
  • experimental results showing LLC variance with model quantization (where quantization is ~roughly a stand-in for compression and LLC measures complexity, so one can study empirical correlations)

what is a two-part code? admittedly I'm still slightly bamboozled by the MDL formalism they choose, so this will be a mix of hand-wavy intuition and opaque jargon.

Let q(n)Δ(Xn) be a data-generating distribution over n-sequences drawn from the sample space (token vocabulary) X. Any distribution p(n) over Xn induces a code for any sample x(n)Xn, where a code is just an associated bitstring for the sample. The bitstring has length logp(n)(x(n)) (the entropy), and the minimum description length principle is essentially that good encodings should seek to minimize the minimum average length of samples. Given i.i.d sampling, the long-run optimal encoding distribution is the ground-truth distribution q(n), and KL(q||p) has a clean interpretation in this context: the expected excess length per-symbol given by encoding distribution p vs. q.

a two-part code is an encoding with two parts: one specifying the encoding distribution ("model") with respect to a model class, and the other specifying the message ("sample") given the model. intuition for this setup: imagine a sender and receiver having mutual information over the fact that both will communicate via some language with some grammatical structure, but the structure insufficiently specifies a full language and vocabulary, however they both have a dictionary mapping bitstrings to complete languages that they can coordinate on first before communicating. (there are much better ways of explaining this).

anyway, you want some way of measuring the performance of your encoding in the two-part setting. there's a quantity called redundancy that measures performance with regards to the underlying data distribution, roughly given by Rn=len([[p]])+KL(q||p), in the average case, where [[p]] is your bitstring encoding of your model w.r.t. your model class. a natural way of optimizing this is choosing a p which accurately models q and eating the specification cost. However! you might have a model class M uniquely unsuited to encoding q, in which case your optimization problem is more interesting.

restating the central theoretical result: there exists a two-part code for any realizable1 data generating distribution qM and dataset x(n) sampled i.i.d from q, the asymptotic redundancy is Rn=λlogn(m1)loglogn+Op(1), where λ is the LLC of q and m is the "multiplicity."1

it is late and my brain is not quite working, but i don't see optimality guarantees for this result? the construction is of the flavor "choose codes such that

len([[p]])=logVol(W)Vpn(ϵ),

and then this has Rn given above." (where p are the model encodings at the center of ϵ-balls covering model regions with sufficiently small KL divergence, necessary because discretization is needed to reduce to a set small enough to fully specify with codes). like, this is implicitly sane because KL ϵ-balls partition assigns probability proportional to the share of the ϵ-ball vs. the volume of the entire space, but IDK why is this the optimal encoding or agreement algorithm? mumbles in Jeffrey's prior

a maybe helpful image:

pythia-llc

I suppose this is why the empirical results are needed. But the empirical results are like "linear relationships between LLC estimates and critical compression thresholds for models up to 7B parameters" where the critical compression thresholds nq are literally "how many times do I need to quantize the model before the difference in loss exceeds some threshold." which is cool! but a bit confusing

pythia-llc

still don't quite understand the theory behind LLC estimation. MDL and SLT connections are cool though, it would be nice to get some naturality results bc the experimental results are not that convincing by themselves (many patterns replicate this, LLC estimation is an art not a science, and quantizing models arbitrarily and doing inference on them seems like it naturally leads to buggy implementations)

1

see technical conditions in the paper


Prediction is not Generation

[a long overdue response to Aidan :)]

In ML, generation and prediction are practically synonymous. Your model learns an appropriate, performant compression of your dataset, and somehow such artifacts generate "completions" (broadly construed) with high accuracy.

It's tempting to then make the leap to man, if I just managed to tokenize the entire past and future of the universe and train a transformer (with infinite compute) to predict the next universe state every Planck time1 from the universe history up until that point, then it'll be guaranteed to faithfully represent the true laws of physics somewhere in its weights!

I claim this is unclear! Even if the laws of physics were describable by some finite state automata, the optimal predictive representation of a process does not have to necessarily correspond to the optimal generative representation!

Here's a toy case. Consider the space of all stationary Markov processes generating the symbol set {0,1}. Clearly the best way to predict a process like this (given Markovity) is to assign some probability p to 1 being generated after a 0, and some probability q to 0 being generated after a 1. There are two "belief states" of this policy (let's call them A and B—each corresponding to the "belief" that 0,1 will be generated2) that the reasoner will occupy with probabilities

P(A)=1q2pq,P(B)=1p2pq respectively. The entropy of this two-state system is just the entropy of the stationary distribution (given above), which turns out to be

Cμ=P(A)log2P(A)P(S1)log2P(B)=q12pqlog2(1q2pq)+p12pqlog2(1p2pq).

The key point to remember here is that we're using the entropy of the stationary state distribution as a measure of "optimality," in the sense that lower entropy means higher simplicity and as a result it is "more optimal." It stands to reason that if generation and prediction are "the same," then it should be impossible to construct a generative process with lower entropy than Cμ for some values p,q. Right?

Well. Consider p=q=0.4, and consider the generating process below.

Lohr

You can check for yourself that this process outputs 01 with probability p, and 10 with probability q for 0p=q1/2. This process has a stationary distribution

π=[12p,2p,12p],

and its entropy H[π] for p=q=0.4 is approximately 0.922, less than Cμ=1.

Have we been hoodwinked? Maybe one should never trust a sentence beginning with "Clearly, . . ." in a mathematical text. Maybe there's a secret predictive policy that magically has lower entropy for p[0.38,0.5]3 that we're just missing.

I argue against this. In particular, there is a particular interpretation of "prediction" we're using here that I claim is simultaneously natural and correct.

Consider an infinite string of tokens X2,X1,X0,X1,X2,. The Markov property states that P(X0|X:0)=P(X0|X1): that my causal state is fully determined by timestep T1, and thus the last token output contains all the information I could use to predict the next token output. As an optimal predictor, I want to adhere to the optimal causal policy which is the minimal entropy policy over belief states that can be causally differentiated. In this case, it is the two-state policy μ with entropy Cμ above.

Observe that the introduction of causality meaningfully differentiates solely generative policies from predictive ones! We have constructed a lower-entropy generative process by relaxing the assumption that we only rely on meaningfully causally differentiated belief states given the token history. There's a sense in which this is the fundamental difference between prediction and generation. It remains to be seen how widely this holds, but the two concepts are canonically differentiated.

Examples taken from James et. al..

1

Ignoring that the universe doesn't have a sense of absolute time.

2

In general, belief states are not by default interpretable.

3

The ranges in which the Lohr model has lower entropy than the predictive model.


Miscellaneous Poetry Drafts

I.

a blade of grass hides
minuscule migratory men
Lilliputian fiends

II.

mighty merry rascals
fickle, high off foglefreude
die English skippers

III.

I once beheld Seneca's estate,
Credulously inspecting his wicker tomb—
Yet Margate Mennons and liced defendants
Both swore by its awkward loom.

My heart gasped and lips shuddered
When, to my utmost surprise
The elderly Roman statesman lay
As mummified nitride.

Tick-tock, goes the clock
Garrulous gyrations too
Gizzardly Gentry, nice surprise
Confiding in a martyr's womb.

Foiled! the cuckoo's dead—
Not I, not I, not I!
Lambastation! Aberration!
To defy Nero's evil eye.

IV.

weed stands stout
thistles, burr
gadolinum kraut
hummus and herb

olden spires bristle, copper-waxed
kelpish tides awash blades of glass
Betty stout, orange'd brass
vermillion mounts, dugong grass

betwixed, witched, yonder
your shivers roll down my spine
I gave you a bouquet of thorned tulips
at sunrise, on Cocoa Beach

the rocket's red glare, the bombs
bursting over a lackadaiscal mare
I wish Martians dreamt of the stars
alas


Scaling Laws for Transfer Learning

Chinchilla scaling posits that

L(N,D)=L+ANα+BDβ,

where N is the number of parameters in your language model, D is the number of training tokens seen, A,B,α,β,L are constants, and L(N,D) is the test loss.

Of course, this analysis is limited:

Still, it is remarkable that test loss performance is so clearly a function of parameters and dataset size, with little assumptions made about the distributional structure or architectural specifics (this holds across varying depth/width ratios, for instance). I find two questions natural:

  • does scaling hold outside of the cross-entropy pretraining regime?
  • can we derive scaling relationships for downstream task performance? in particular, how predictable is transfer learning?

In 2021, OpenAI studied the question of "how much more quickly does a pretrained LM achieve low loss on a finetuning dataset than an LM initialized from scratch on the dataset?" (Note pretraining and finetuning on this case are "the same operation", the objective is still cross-entropy). They find that the "effective data transferred"1 DT is described by DT=k(DF)α(N)β, where DF is the size of the finetuning dataset (in tokens) and N is the number of non-embedding parameters of the model.2 This is great! Strong evidence of the generality of abstractions the model learns in pretraining (especially given the independence of β from the source distribution). However, it doesn't explicitly tell us about downstream task performance given an external metric.

Brandfonbrener et. al. do somewhat better. They derive scaling relationships for train-train, train-test, and test-test loss transfer for models trained on different datasets, which can be expressed as

Li(f^jN,D)K(Lk(flN,D)Ek|l)κ+Ei|j,

where you have models fj,fl trained on distributions j,l evaluated on distributions i,k and you're fitting the constants K,κ.3 As an example, the case of train-train would be where (i,j)=(0,0) and (k,l)=(1,1). We pair models by (N,D) for coherence. Notably, these laws hold for diverse datasets, but only well in low-loss regimes and when Em|n terms can be well estimated. Still no breaking of the pretraining regime, and no explicit predictions for downstream metric performance!

There's a meta-analysis out this year that claims that scaling laws are unreliable for dowstream task performance prediction.. Seems correct. Metrics are noisy and don't have nice algorithmic properties like cross-entropy loss might. Perhaps intriguing is their observation that irregular scaling is (1) common and (2) can occur for cross-entropy on normal tasks and normal LM datasets. This paper claims that larger models & models trained for longer have better downstream task performance even when holding loss constant. Which is an argument for certain training setups & architectures having better inductive biases?

Honestly, I am kind of sad that the extant literature here seems to be tainted by publication bias? I wouldn't really trust these papers (or the ten others I read writing this), and I want to run the experiments myself. The Tinker API seems good for doing this quickly. I'll go do that.

(Titular question pending.)

1

In essence, the number of tokens that you "save" seeing in finetuning by pretraining.

2

Notably β only depends on architecture and TARGET distribution (not SOURCE), while α is a rough "distributional proximity" proxy that can be easily estimated.

3

Em|n is the irreducible loss of a model trained on with infinite compute on distribution n evaluated on distribution m.