Need help?
<- Back

Comments (62)

  • stephantul
    The fact that this was on the set of training problems with a custom harness basically makes the headline a lie.What if you give opus the same harness? Do people even care about meaningful comparisons any more or is it all just “numbers go up”
  • lairv
    Note that this uses a harness so it doesn't qualify for the official ARC-AGI-3 leaderboardAccording to the authors the harness isn't ARC-AGI specific though https://x.com/agenticasdk/status/2037335806264971461
  • gslin
    https://en.wikipedia.org/wiki/Goodhart's_law> Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.
  • mohsen1
    Uses public dataset to evaluate which is not meant for evaluation. Writes super specific prompt[1] and claims eye catching results.This is the state of "AI" these days I guess...[1] https://github.com/symbolica-ai/ARC-AGI-3-Agents/blob/symbol...
  • modeless
    On the public set of 25 problems. These are intended for development and testing, not evaluation. There are 110 private problems for actual evaluation purposes, and the ARC-AGI-3 paper says "the public set is materially easier than the private set".
  • padolsey
    Knowing the nature of a test ahead of time, building out your capabilities and tooling before entering the exam hall when your peers don't have that advantage, makes you a cheater.
  • andy12_
    Apparently the score would be a little higher if it weren't for the fact that scores are penalized for being worse than the human baseline, but aren't rewarded for being better than the human baseline (which seems like an arbitrary decision. The human baseline is not optimal).
  • bytesandbits
    we constantly underestimate the power of inference scaffolding. I have seen it in all domains: coding, ASR, ARC-AGI benchmarks you name it. Scaffolding can do a lot! And post-training too. I am confident our currently pre-trained models can beat this benchmark over 80% with the right post-training and scaffolding. That being said I don't think ARC-AGI proves much. It is not a useful task at all in the wild. it is just a game; a strange and confusing one. For me this is just a pointless pseudo-academic exercise. Good to have, but by no means measures intelligence and even less utility of a model.
  • esafak
    Anybody used this Agentica of theirs?
  • AbanoubRodolf
    [dead]