<- Back
Comments (18)
- bee_rider> Directions we think are wide open> Second-order optimizers and natural gradient methodsDo second order optimizers help improve data efficiency? I assumed they’d help you get to the same minimum faster (but this is way outside my wheelhouse).
- STARGAThe data-limited regime is where most of the interesting engineering happens. When you have infinite data, you can paper over bad architecture choices with more tokens. When data is fixed, every design decision — tokenizer vocabulary, attention pattern, positional encoding, regularization — has measurable impact on sample efficiency.The ensemble approach is worth examining closely. In low-data regimes, model diversity matters more than individual model quality. If your 8 models converge to similar representations (which happens with identical architectures and similar init), the ensemble gain is minimal. The interesting question is whether architectural diversity (different attention patterns, different FFN ratios) gives better ensemble coverage than just different random seeds.The aggressive regularization finding aligns with what we see in other domains. When your dataset is small, the model's capacity-to-data ratio is the dominant variable. Dropout, weight decay, and data augmentation are doing more work than the optimizer or learning rate schedule.
- linolevanThere was this very interesting paper out of Stanford this last September about pretraining under the unlimited compute but limited data paradigm[0]. Pretty much exactly the same thing but with ~200M training tokens instead.[0] https://www.alphaxiv.org/abs/2509.14786
- kseniamorphCurious about the baseline choice. modded-nanogpt was optimized for wall-clock speed, not data efficiency, so it seems like an unusual reference point for this kind of benchmark. Why not vanilla NanoGPT?
- archermarksVery cool idea. Interested to see how this progresses. One question: how worried are you about over-training on this particular dataset? i.e. instead of generalizing you lean more toward memorization? Obviously you leave out a validation set but since you're meta-optimizing the model itself by its performance on the validation dataset you're still at risk of over-fitting.
- lzaborowskiI like the idea of flipping the constraint. Most ML benchmarks assume unlimited data and limited compute, so people optimize for speed.If high-quality training data becomes the real bottleneck, then the interesting question is how much signal you can extract from the same dataset when compute is cheap.
- refulgentisThis looks awesome!!! I’m curious on the ensemble: does it mean “train 8 different models and pick the best one”? That’s what my mind jumps to, but that also seems wrong, because I assume we could just keep increasing the number of different models you train to get a win.
- navvyeanandAmazing job!
- suddenlybananasReminds me a fair bit of the BabyLM challenge. It would be good to give them a shout-out and see how this challenge differs.
- riajain2525Super cool!