thoughts on VeLO
February 03, 2026
[1] hypernetwork architecture. there's a tensor-level LSTM that takes in bulk loss statistics and outputs a weighting vector $c_{hyper}$ used to interpolate between a bank of pre-trained per-parameter MLPs that then gets passed the per-parameter loss statistics to compute parameters $d,m$ such that $\Delta p = 10^{-3} \cdot d \cdot \exp(10^{-3} \cdot c_{lr}) ||p||_2$ (where $c_{lr}$ is also produced by the tensor-level LSTM). not immediately obvious to me where these pre-trained MLPs come from
[2] meta-training distribution is surprisingly diverse. image classification / image generation / language modeling, across MLP / CNN / vision transformer / autoencoder / VAE / transformer architectures, with different initializations. + task augmentation! (weight reparameterization / gradient renormalization / etc.). 4k TPU-months of compute. these tasks all dealt with relatively small models: VeLO generalization to 100M param LLM was not good (compared to a tuned Adafactor baseline), but "within distribution" VeLO seems to outperform tuned Adam / Shampoo / other classical optimizers.
[3] ES as a gradient update method. meta-objective was to minimize end-of-training loss (vs. average loss per-step), and the classical method was used to estimate gradients: perturb the parameters with sampled Gaussian noise and use the estimator $$\frac{1}{2m\sigma} \sum_{i=1}^m (L(\theta + \sigma \epsilon_i) - L(\theta - \sigma \epsilon_i))\epsilon_i,$$ where $\epsilon_i$ is drawn from a normal distribution. I wonder how much there's to be gained in iterating on this approach? the ES literature hasn't quite stagnated
[4] Adam is really quite good? VeLO outperforms by (eyeballing) ~2-5x with intense resource input, and even then isn't pareto compared to tuned Adam OOD. (VeLO is also much more expensive, & cannot generalize to RL or GNNs (as far as they tested))
[5] convinced me that truly there's a lot of performance one could eke out of meta-training a learned optimizer, provided one doesn't care about resource constraints. many cool directions to extend this to