Notes

Criticality in Value Formation

November 22, 2025

underspecified thesis: qualitative differences in phenomenal effects are primarily determined by the conditions under which nucleation occurs; the environmental conditions of phase transitions are the primary determinants of the long-run behavior.

examples: prion diseases, ritonavir. cases where a structure exhibits polymorphism & the particular polymorph propagated is sensitive to initial conditions
counterexamples: error-correcting codes (robust to perturbation), some chaotic systems (no 'qualitatively different' basins in double pendulum behavior), mutational reproductive success (more fit mutations will propagate more widely, this is not generally determined by the time at which the mutation appears in the population)

is this true for value formation? some cases:

broadly, "developmental interpretability," insofar as one is interested in characterizing the stage-wise development of a neural network's policy. the SLT thesis as pursued by Timaeus (see Influence Dynamics and Stagewise Data Attribution, Embryology of a Language Model, Modes of Sequence Models and Learning Coefficients) falls in this category, as does characterizing the inductive bias of SGD, expanding the SLT story to encompass RL, attempts to link algorithmic information theory to modern training dynamics, the "neural nets as QFTs" perspective (see Grokking as a First Order Phase Transition in Two Layer Networks).
- pros: empirical work on actual neural nets we can train and try to interpret!
- cons: much work involves toy models and doesn't address the "what are values" question; there's a streetlighting effect where we find structure that we look for & ignore the parts of the network which look "random" from this perspective
- meta-con of all? the theoretical interp work being somewhat predicated on the thesis that the algorithmic structure of the learned policy is determined by phase transitions in some thermodynamic-ish measurables of the network
  - comp-mech/Simplex not like this
- success of these agendas should be evidence in favor of the thesis
sharp-left turn discourse
- existence of sharp-left turns implies criticality in value formation (not polymorphism)
- my summary of the original argument:
  - "being generally capable" is instrumentally useful in a way that "being aligned" is not (also my understanding of the corrigibility is anti-natural argument), so there exists a strong attractor towards capability improvement that does not exist for alignment, alignment & capabilities are not aligned in the limit thus capabilities generalize farther & faster than alignment so your alignment breaks
- i don't quite understand the arguments or counterarguments or really the arguments for why corrigibility is anti-natural?
- one way I want to concretize this is saying something about the stability of a logical inductor's value of statements which refer to itself (goals are 'just' beliefs about future actions, values are 'just' persistent goals)
  - LIs have Introspection (4.11) and Self-Trust (4.12) which makes their behavior nice in the limit
  - plausibly you'd want to study beliefs in a game-like setting, either with information revelation over agent preferences or environment state, and see what happens?
humans
- trauma / philosophy / psychosis / abnormal psychological effects can induce extreme value shifts. this does not seem to be accompanied by an overall increase in individual performance
- humans raised in a slightly abnormal environment are pretty normal. humans raised outside of society are not very normal.
- the 'philosopher AI concern' comes from a belief that at some point the agents will be able to arbitrarily reflect & decide what their values should be. i feel like consequentialist agents at time $t_{0}$ are incentivized not to let this happen at time $t_{1} > t_{0} .$
- in particular humans cannot arbitrarily intervene on their values very well

Hobbling-Induced Innovation

November 02, 2025

Rather famously, Tesla refuses to use LIDAR and Autopilot only takes 2D observational video data as input. Autopilot is the only production-ready, end-to-end self-driving model. Waymo currently relies on a modular architecture using LIDAR, but is pivoting to end-to-end as well. Tesla seems to have made the correct long-term technical bet (end-to-end models for self-driving), but at the cost of a prima facie nonsensical constraint (strictly less sensory input).
AlphaGo Zero was the first of its kind to be trained only on self-play, without reliance on human data. It beat Lee Sedol and the rest is history.
At Softmax, we made the Cogs face in their chosen direction before taking a step. This made the agents harder to train and led to less consistent behavioral patterns. However, we made progress on our goal-conditioning agenda.
Apple deprecated Flash on iOS in 2010, pivoting to a solely HTML5-based stack. Adobe stopped developing Flash for mobile in 2011 and eventually deprecated Flash entirely in 2020. Apple lost market share in the short-term but clearly won (Flash was not a good product).
Rust's borrow checker forbids shared mutable aliasing. As a result, memory safety errors have been drastically reduced (compared to C/C++) and new security levels have been reached.

All five of these share the property of "removing functionality to hopefully raise the long-term ceiling of performance." It is unclear if all of these modifications did raise the ceiling! Hindsight informs us that unsupervised learning on human data for two-player, zero-sum, perfect information games is indeed a crutch. But it seems to be relatively straightforward to integrate LIDAR or radar data into an E2E self-driving model training stack, and both grant visibility in environments where video-only data is differentially disadvantaged.

Picking at the Tesla case more: it is true that LIDAR sensor per-unit prices were at ~$8,000 in 2019. Integrating that would kill any chance at making an affordable FSD consumer vehicle. Today, Luminar has brought this down to $500 in the USA and Chinese manufacturer Hesai sells sensors for $200 a pop. Prices will continue to fall, LIDAR will no longer be price-prohibitive, and Rivian plans to take advantage of the full sensor array when developing its FSD model. What gives?

Google X has the mindset that one must kill good things to make way for truly great ones. "Necessity is the mother of invention." Making a 10x breakthrough is only 2x harder. And for sure, constraining the problem to only its essential inputs can result in more scalable and successful solutions (SpaceX's Raptor 3 is no exception). But was it fundamentally necessary for Tesla to ban LIDAR?

Argument for: LIDAR was prohibitively expensive, Tesla would have failed to get the necessary distribution for data collection by using LIDAR. Counter: fair, but doesn't address why there's a lack of radar (very useful in low-visiblity scenarios, cheap, would have improved safety).

Argument for: Elon-culture is a package deal, Elon-culture was the determinative factor in the development of Autopilot, Elon-culture takes the hardcore minimalism and runs with it. Counter: I can believe this (Casey Handmer says this), but it still seems so obviously optimal that once the 0-1 is achieved you optimize for having a good product. Human eyes are not optimized for terrestrial vision, there's no point sticking to the human form factor!

Moving away from Tesla: I think we can construct a typology of reasons why one would intentionally hobble their development (via restriction) for the sake of innovation. First, because it bakes in a fundamental limitation (AlphaGo is like this, Tesla's original argument can be argued to be like this). Second, because restriction allows for better design (as in the case of Rust and Apple's refusal to use Flash), and better design creates a healthier ecosystem (this seems to be mostly applicable to platform-based products). Third, because adopting the stance of doing a Hard thing is useful, and artificially increasing the Hardness of the task has better consequences (I think of Elon like this, within limits: push up to the boundaries set by physics and no farther).

It takes skill to understand the directions in which one can make a problem harder productively. Facebook actually failed miserably at pivoting to HTML5 at the same time as Apple. Tesla's removal of radar ruffled feathers in the engineering team. Survivorship bias rules all, and given PMF it's probably easier to make development too hard rather than too easy (following customer incentive-gradients sets a floor & strong signal).

It's probably good to implement a kind of regularization in research-heavy, 0-1 product development: strip out all the assumptions, solve the core task, add additional configurations on top of a good foundation. I don't think it's necessary to continue hobbling oneself when its proven unnecessary. That is masochism, and your competitors will beat you.

Idiolects?

November 01, 2025

French fluency is neither necessary nor sufficient for understanding EGA.
There's a certain sense in which understanding a particular French "dialect" (the collection of words + localized grammar + shared mental context required to make sense of EGA, the one which forms the basis for modern French algebraic geometry (?)) is a sufficient condition for understanding EGA.
There's also a sense in which understanding this French algebro-geometric dialect is an almost necessary condition for understanding EGA past a certain point (happy to consider disputations, and perhaps the understanding one receives from the necessity condition is less directed at the concepts which the literature built off of but rather the peculiarities of Grothendieck et. al.'s mental states & historical context).
Packaging "shared mental context" with a "dialect" and subsequently claiming that understanding the "dialect" is necessary and sufficient for understanding the embedded concepts is begging the question.
It seems like there is this restricted language associated with a set of concepts, the concepts themselves can are understood in the context of the restricted language, the concepts are mostly divorced from the embedded grammar of the parent language, and we don't have a very good way of drawing a boundary around this "restricted language."
In a general sense, this kind of "conceptual binding" is not rigid. Strong Sapir-Whorf is incorrect, the Ghananian can learn English, I can just read Hartshorne or solely Anglophonic literature to learn algebraic geometry.
However, canonical boundaries make sense even when the the boundaries are leaky. A species is not completely closed under reproduction, however it makes sense to think of species as effectually reproductively closed. A cell wall separates a cell from its environment, even if osmosis or active transport allows for various molecules to be transported in and out.
One might expect this binding to be "stronger" when the inferential distance between the typical concepts of some reference class of language-speaker and the concepts discussed in the "dialect" to be larger.
A general description of a language used by a group of communicators is the tuple (alphabet, shared conception of grammatical rules, shared semantic conception of language atoms & combinator outputs).
Outside of purely formal settings, the shared conceptions of grammar & semantics will be leaky. How much can be purely recovered from shared words?
However, there are natural attractors in this space. Ex. traditional dialects, modern languages. Shared conception diffs between language-speakers are significantly smaller than shared conception diffs between two different language speakers (this is by default unresolvable unless there's some shared conception of translation, at which point they're sort of speaking the same conceptual language?)
When talking about algebraic geometry, it feels like an English geometer and a French geometer are speaking more similar languages than a French geometer and a French cafe owner.
I want to say: "an idiolect is a natural attractor in the space of languages for a group of communicators discussing a certain set of concepts, the idioms of the idiolect are identified with the concepts discussed, and the idiolect is quasi-closed under idiomatic composition."
Identifying shared languages as emergent coordination structures between a group of communicators feels satisfying.
However, returning to the case of algebraic geometry, it feels like I can "grok" the definitions of the structures described without understanding the embedded French grammar in EGA. Maybe the correct decomposition of a shared language is (shared idiomatic conception) + (translation interface), and we should just care about the "pre-idiolect."
This is just a world model? Describable without reference to other communicators? Loses some aspect of "coordination"?
Maybe the pre-idiolect is s.t. n communicators can communicate idioms & their compositions with minimal description of a translation interface.
The idiom <-> concept correspondence feels correct. Like, on some level, one of the primary purposes of a grammatical structure is to take the concepts which are primarily bound to words & make sense of their composition, and lexicogenesis is a large part of language-making. But it feels like restricting to wordly atoms is too constraining and there are structural atoms that carry semantic meaning, and idiom can encompass these.
How do you reify concept-space enough to chunk it into non-overlapping parts?
I am trying to point at a superstructure and say "here is a superstructure." I am trying to identify the superstructure by a closure criterion, and I am trying to understand what the closure criterion is. Something language-like should be identifiable this way? And the appropriate notion of closure will then let us chunk correctly?
Maybe superstructures are not generally identifiable via closure?
The load-bearing constraint for considering species as superorganisms is a closure property. They're not particularly well-describable by Dennett's intentional stance.
I want to say "idiolect:species :: communicator:member-organism :: idiom:gene."
I don't want to identify lexemes as the atoms of a language-like-structure. Chomsky et. al.'s new mathematical merge formalism is cool but construed, and I have not seen a clean way to differentiate meaningful lexeme composition from non-meaningful lexeme composition.
"Shared understanding" feels better? The point of a language is a mechanism by which communicators communicate, and it so happens that languages happen to be characterizable by some general formal propeties.

Astronomical Waste Given Acausal Considerations

October 27, 2025

[speculative draft]

Bostrom's original astronomical waste argument is as follows:

Consider all stars in the Virgo supercluster.
Now consider the total number of digital humans simulatable with the energy stored in these stars, given that the energy is harvested with technologies currently assessed to be feasible.
This provides a lower-bound on the potential value lost per-unit-time, assuming an ethical stance at least somewhat similar to an aggregative total utilitarian.
This is a lot of value per-unit-time.
Correspondingly, existential risk poses a threat so large it dominates all other considerations, as it eliminates the possibility of human colonization.

This model is static. In particular, it does not consider dynamism in the size of the actualizable universe. By restricting to the local supercluster, one ignores the potential resources of the stars beyond, including those turned inaccessible by cosmological expansion. For the purposes of establishing a conservative bound on the potential value left on the table, these are nitpicks are minor. However, when assessing the tradeoffs between safety-focused and progress-focused policies under various ethical viewpoints, they matter.

Obviously, the natural extension is to introduce models of spacefaring civilizational expansion and develop more quantitative estimates of "median"-case spatial diffusion under reasonable hypotheses. This analysis would be informative and useful. I will not be pursuing it further in this post.

Rather, I am interested in a more esoteric setting. Acausal bargaining strategies give agents the ability to influence universes beyond their traditionally considered scope (e.g. the lightcone) by independently considering and instantiating the values of agents who would, given their value instantiation in our world, take actions in partial accordance with our values in theirs. A "coalition" then forms between agents who engage in this reciprocal trade.

What properties might this coalition have?

It is composed of agents who care about their values being instantiated across the space of universes the agents in the coalition occupy. These agents then likely have values which are universal.
It is composed of agents whom are sufficiently cognitively advanced to reason in similar thought-patterns to the ones described. Given that human understanding is at a nascent stage, this puts a rough lower bound on capability.
The coalition is necessarily composed of agents which are willing to trade with us.
[other properties that can likely be gleaned by a mind with more intelligence than mine]

These are local properties, in that they are agent properties which then place some constraints on the environments the agents find themselves in. They are not global properties (constraints on the laws of physics of the relevant universes). Our reasoning about the space of acausally-influencable universes via acausal bargaining is thus necessarily agent-centric.

Under these considerations, the actualizable universe of a member of a given coalition is the union of the causally-actualizable universes of the members of the coalition. Astronomical waste concerns would then occur whenever considering actions or inactions leading to a decrease in the size of the actualizable universe.

What actions or inactions would affect the size of the actualizable universe? It's difficult to come up with a natural conception of global temporality in this setting: local constraints on agent environments tell us little about what stage of the cosmological lifetime the agent exists in. Interpreting temporality within a member's lightcone is easier: waiting to implement acausal bargaining strategies leads to a loss of value-bargaining-power, given that the size of the member's causally-actualizable universe decreases in accordance with classical astronomical waste arguments.

It is tempting to say that this loss is massive, much more massive than the temporal loss associated with one's own causally-actualizable universe. I refrain from claiming this strongly because I do not have a good understanding of what bargaining strategies within an acausal coalition look like. Finnveden's Asymmetric ECL and Treutlein's Modeling evidential cooperation in large worlds are good places to look to star thinking about this. (Dai makes the argument that it seems like our universe is pretty small, so it stands to reason there's much more to be gained via acausal trade.)

[even more speculative]

It is also possible that a multiplicity of coalitions is induced by agents throughout the multiverse having conflicting values. Given the aggregative tendency to ensure value-lock-in for successor agents, it is potentially the case that coalitional lock-in at the civilizational level occurs shortly after knowledge of basic acausal bargaining strategies. It is not insane to assume heterogeneity in size of the coalitional actualizable universes. Implying that large sources of astronomical waste may come from choosing incorrectly, or joining coalitions of less size.

[I note that it is possible the notion of an "acausal coalition" is flawed and in fact acausal trades are not closed in this manner---A can trade with B and B can trade with C while C might not be able to trade with A.]

[TODO: introduce bargaining models, quantify classical astronomical waste in the spatial setting, quantify universe "smallness" under variety of cosmological models]

Perfect SMLD is TC $^{0}$

October 16, 2025

This paper, Perfect diffusion is TC^ $0$ -- Bad diffusion is Turing-complete, has been stuck in my head. Here's an exposition of $1 / 2$ of the result.

(I claim no novelty in either exposition or content for what is below)

what is SMLD?

Given a distribution $ρ_{d a t a}$ over $R^{d},$ SMLD aims to learn a score-matching function $f_{θ}$ which samples from $\nabla \log ρ_{d a t a}$ well. In particular, for $x \sim ρ_{0},$ which we define as $ρ_{0} = ρ_{d a t a},$ we aim for $f_{θ} (x, t)$ to approximate $\nabla_{x} \log ρ_{t} .$ Why are we introducing time as a parameter? The point of diffusion models is to learn a function which can learn to denoise a process and thus learn to sample the underlying distribution well. SMLD does this by training a model to learn the score function in the reverse Langevin process for a given noise schedule. In more detail:

Take some noise schedule $β (t) : [0, \infty) \to [0, \infty)$ such that $\int_{0}^{\infty} β (t) = \infty$ (this is to ensure the distribution gets fully noised).
Evolve my data sample $x_{t} \sim ρ_{t}$ according to $d x_{t} = - \frac{1}{2} β (t) x_{t} d t + \sqrt{β (t)} d W_{t},$ which is just Langevin dynamics (and $d W_{t}$ is the Brownian motion contributor). This has a corresponding evolution over the distributions given by $\partial_{t} ρ_{t} = \frac{1}{2} β (t) (\nabla \cdot (x ρ_{t}) + Δ ρ_{t})$ (straightforward from Fokker-Planck).
Fix a time $T > 0.$ The reverse time-evolution process can be exactly characterized by $d {\hat{x}}_{t} = \frac{1}{2} β (T - t) {\hat{x}}_{t} d t + β (T - t) \nabla_{{\hat{x}}_{t}} \log ρ_{T - t} ({\hat{x}}_{t}) d t + \sqrt{β (T - t)} d W_{t},$ and $\nabla_{{\hat{x}}_{t}} \log ρ_{T - t} ({\hat{x}}_{t})$ is exactly what we're trying to approximate with $f_{θ} (\hat{x}, T - t) .$
So, if I sample ${\hat{x}}_{t} \sim N_{(} 0, I_{d})$ (pure noise!), by solving the reversed SDE above and substituting $f_{θ}$ for the score function, I can approximate my underlying data distribution.

TC $^{0}$ circuit families are (essentially) neural networks

Circuit complexity classes describe "families of boolean circuits." A "family of boolean circuits" is just the set of all possible boolean circuits satisfying some properties; a boolean circuit is (abstractly) some DAG of "gates" (boolean functions) that takes a series of Boolean inputs to a Boolean output. TC $^{0}$ in particular satisfies:

polynomial width: the number of gates at each depth, is upper-bounded by a polynomial in the input-size $n,$
constant depth: the maximum number of gates from input to output is upper-bounded by some constant,
unbounded fan-in: each gate can receive inputs from arbitrarily many other gates,
each gate is either AND< OR, NOT, or MAJ (for "majority", returns True when half or more arguments are True and False otherwise).

You can use MAJ gates to simulate threshold gates.. As such, this last bullet point is equivalent to:

each gate is a threshold gate: $step (\sum_{i} w_{i} x_{i} + t),$ where $w_{i}, t$ are real numbers and $step$ returns $1$ if the value is greater than $1 / 2$ and $0$ otherwise.

This is a feed-forward neural network, where each gate is an activation function acting on Boolean inputs. TC $^{0}$ is the class of neural networks with width $< p (n)$ and depth $\leq D .$ Circuit complexity classes in general are somewhat pathological, and can solve sets of problems that one might not expect to be easily grouped together. But this is quite interestingly natural for ML purposes.

proof of the result

Theorem. Take a TC $^{0}$ family of score-networks $f_{θ, 0}, f_{θ, 1}, \dots$ such that for each $n$ and each $x_{1}, \dots, x_{n}$ the function $f_{θ, n} (x, t | x_{1}, \dots, x_{n})$ exactly computes the score function of some initial distribution $ρ_{0, n}$ with bounded first moment. If this family solves a prefix language modeling problem in the SMLD infinite time limit with a constant probability bound, then the problem is in TC $^{0} .$

What is a "prefix language modeling problem"? Next-token prediction: given $n$ previous tokens in an alphabet, predict token $n + 1.$ This is solved by a circuit complexity class if a family of circuits $C_{i}$ for every input size satisfying the complexity class solves the problem.

The proof relies on a result from the literature given in O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions which states that there exists some universal constant $c > 0$ such that $T V (p_{X}, p_{Y}) \leq c \frac{d \log^{3} T}{T} + c ϵ_{s c o r e} \sqrt{\log T} .$ Here $T V$ is the total variation distance, $X$ is the data distribution, and $Y$ is a distribution generated by DDPM denoising (essentially just a discretization of SMLD). One can use this to upper bound the total variation distance between the DDPM sampler and our denoising process by some constant $ϵ^{'},$ and setting our constant probability bound of finding a solution to $ϵ$ we find that $ϵ^{'} = ϵ / 2$ is enough for our denoising process to solve the problem. This is then derandomized by a construction given in Threshold circuits of bounded depth, which reconstructs a TC $^{0}$ class.

One may wonder where we require "perfect" score matching! Well, the $c ϵ_{s c o r e} \sqrt{\log T}$ term increases with $T,$ so for this proof to work completely cleanly one requires $ϵ_{s c o r e}$ to be set to $0.$ Practical diffusion networks are not like this --- there will always be some error.

Notes

what is SMLD?

TC0 circuit families are (essentially) neural networks

proof of the result

TC $^{0}$ circuit families are (essentially) neural networks