<- Back
Comments (19)
- amlutoThis design implicitly does something similar to something that I sometimes think conventional transformers should try: allowing later layers to query the KV data from earlier layers. As far as I can tell, with a conventional transformer, if a layer (and presumably higher-level-thinking) layer wants wants to take input from earlier tokens from something lower down, it needs to get it from the output and “remember” it by itself instead of just reading it directly.But suppose an extra attention head were added that queried the KV data from lower layers. At the very least, I imagine this might cleanly solve the STRAWBERRY problem: whatever layer has figured out that the prompt wants to count instances of R could attend to lower layers that actually perceive those Rs.
- marojejianSounds like a further improvement in the spirit of HRM & TRM models.Decent comment via x: https://x.com/r0ck3t23/status/2002383378566303745I continue to be fascinated by these architectures that: - Build in recurrence / inference scaling to transformers more natively. - Don't use full recurrent gradient traces, and succeed not just despite, but because of that.
- MoosdijkInteresting. Instead of running the model once (flash) or multiple times (thinking/pro) in its entirety, this approach seems to apply the same principle within one run, looping back internally.Instead of big models that “brute force” the right answer by knowing a lot of possible outcomes, this model seems to come to results with less knowledge but more wisdom.Kind of like having a database of most possible frames in a video game and blending between them instead of rendering the scene.
- mysterEFrankI'm surprised more attention isn't paid to this research direction, that nobody has tried to generalize it for example by combining the recurrence concept with next token prediction. That said despite the considerable gains this seems to just be some hyperparameter tweaking rather than a foundational improvement.
- numbers_guyI'm confused about ARC-AGI. I thought the point of it was that you train a foundational model. Then you test it against ARC-AGI to figure out how well it reasons. Here and in some of the other reasoning papers, they are training on ARC-AGI. How much sense does that make in practice?
- E-ReveranceIt should be noted that this is NOT the official scores on the private evaluation set
- mlproLol. trying to copy the Universal Weight Subspace paper's naming to get famous.