ARC-AGI-3

<- Back

ARC-AGI-3

lairv

Comments (186)

Tiberium
https://x.com/scaling01 has called out a lot of issues with ARC-AGI-3, some of them (directly copied from tweets, with minimal editing):- Human baseline is "defined as the second-best first-run human by action count". Your "regular people" are people who signed up for puzzle solving and you don't compare the score against a human average but against the second best human solution- The scoring doesn't tell you how many levels the models completed, but how efficiently they completed them compared to humans. It uses squared efficiency, meaning if a human took 10 steps to solve it and the model 100 steps then the model gets a score of 1% ((10/100)^2)- 100% just means that all levels are solvable. The 1% number uses uses completely different and extremely skewed scoring based on the 2nd best human score on each level individually. They said that the typical level is solvable by 6 out of 10 people who took the test, so let's just assume that the median human solves about 60% of puzzles (ik not quite right). If the median human takes 1.5x more steps than your 2nd fastest solver, then the median score is 0.6 * (1/1.5)^2 = 26.7%. Now take the bottom 10% guy, who maybe solves 30% of levels, but they take 3x more steps to solve it. this guy would get a score of 3%- The scoring is designed so that even if AI performs on a human level it will score below 100%- No harness at all and very simplistic prompt- Models can't use more than 5X the steps that a human used- Notice how they also gave higher weight to later levels? The benchmark was designed to detect the continual learning breakthrough. When it happens in a year or so they will say "LOOK OUR BENCHMARK SHOWED THAT. WE WERE THE ONLY ONES"
Real_Egor
I'll probably be the skeptic here, but:- Take a person who grew up playing video games. They'll pass these tests 100% without even breaking a sweat.- BUT, put a grandmother who has never used a computer in front of this game, and she'll most likely fail completely. Just like an LLM.As soon as models are "natively" trained on a massive dataset of these types of games, they'll easily adapt and start crushing these challenges.This is not AGI at all.
BeetleB
> As long as there is a gap between AI and human learning, we do not have AGI.Back in the 90's, Scientific American had an article on AI - I believe this was around the time Deep Blue beat Kasparov at chess.One AI researcher's quote stood out to me:"It's silly to say airplanes don't fly because they don't flap their wings the way birds do."He was saying this with regards to the Turing test, but I think the sentiment is equally valid here. Just because a human can do X and the LLM can't doesn't negate the LLM's "intelligence", any more than an LLM doing a task better than a human negates the human's intelligence.
jwpapi
This is a very good estimation of AGI. We give humans and AI the same input and measure the results. Kudos to ARC for creating these games.I really wonder why so many people fight against this. We know that AI is useful, we know that AI is researchful, but we want to know if they are what we vaguely define as intelligence.I’ve read the airplanes don’t use wings, or submarines don’t swim. Yes, but this is is not the question. I suggest everyone coming up with these comparisons to check their biases, because this is about Artificial General Intelligence.General is the keyword here, this is what ARC is trying to measure. If it’s useful or not. Isn’t the point. If AI after testing is useful or not isn’t the point either.This so far has been the best test.And I also recommend people to ask AI about specialized questions deep in your job you know the answer to and see how often the solution is wrong. I would guess it’s more likely that we perceive knowledge as intelligence than missing intelligence. Probably commom amongst humans as well.
typs
My takeaway from playing a number of levels is that I am definitely not AGI
lukev
I'm not sure how this relates to AGI.This measures the ability of a LLM to succeed in a certain class of games. Sure, that could be a valuable metric on how powerful (or even generally powerful) a LLM is.Humans may or may not be good at the same class of games.We know there exists a class of games (including most human games like checkers/chess/go) that computers (not LLMs!) already vastly outpace humans.So the argument for whether a LLM is "AGI" or not should not be whether a LLM does well on any given class of games, but whether that class of games is representative of "AGI" (however you define that.)Seems unlikely that this set of games is a definition meaningful for any practical, philosophical or business application?
culi
The thing I most appreciate about the ARC-AGI leaderboards is how the graph also takes into account cost per task. All of the recent major advancements in benchmarks seem a little less impressive when also taking into account the massive rise in cost they're paired with. The fact is we can always get a little bit better output if we're willing to use more electricity
strongpigeon
This is a good and clever benchmark and a worthy successor to the previous two. That being said, I find that the "No tools" approach is a bit odd. They're basically saying that it's OK to have tools as long as they're hidden behind the API layer. Isn't this an odd line to draw?It feels like it should be about having no ARC-AGI-3-specific tools, not "no not-built-in-tool"...
Zedseayou
I was a human tester (I think) for this set of games. I did 25 games in the 90 minutes allotted. IIRC the instructions did mention to minimize action count but the incentives/setup ($5 per game solved) pushed for solve speed over action count. I do recall trying to not just randomly move around while thinking but that was not the primary goal, so I would expect that the baseline for the human solutions have more actions than might otherwise be needed.
Stevvo
Maybe I'm just not intelligent, but I gave it a couple of minutes and couldn't figure out WTF the game wants from you or how to win it.
largbae
I feel like we've got tunnel vision. Things you can do on a computer are a tiny subset of what a human can do.If the AI has to control a body to sit on a couch and play this game on a laptop that would be a step in the right direction.
cedws
It's like playing The Witness. Somebody should set LLMs loose on that.
ranyume
This is an interesting update. And a big challenge for companies and labs. The new tools for measurement are indeed what I'd like out of future agents, and agents that solve the games will need to use different subsystems to do so. This is basically optimization for achieving goals (as opposed to prompt engineering / magic spells to make the LLM do what is told to do) which imo is the future we should aspire to build.
NiloCK
I hope at least some of these are direct Chip's Challenge ports. Waiting for some old muscle memory to kick in here.
levmiseri
For a loosely similar 'benchmark', I recently tried to test major LLMs on my coding game (models write code controlling their units in a 1v1 RTS) - https://yare.io/ai-arena
anon
undefined
baron816
Looks like I’m generally unintelligent
andai
In the year 2032: ARC-AGI-13: Almost definitely AGI this time!
convexly
My issue with AGI benchmarks is you can never tell if you're measuring actual capability or just how much the training data overlapped with the test.
spprashant
I played the demo, but it definitely took me a minute to grok the rules.I don't know if this is how we want to measure AGI.In general I believe the we should probably stop this pursuit for human equivalent intelligence that encourages people to think of these models as human replacements. LLMs are clearly good at a lot of things, lets focus on how we can augment and empower the existing workforce.
abraxas
Even if tomorrow's models get good enough to complete these games we won't be able to proclaim AGI. In the realm of silly computer games alone I'm going on record saying that there are plenty of 8 bit games that AIs will trip on even when this benchmark is crushed. 2D platformers like Manic Miner or Mario need skills that none of these games appear to capture.
EternalFury
The real question is: Can it be generated using programs? If it can be, then LLMs will eventually monkey type these programs.
jesse_dot_id
At this point, I'm pretty sure we'll just know when it happens.
WarmWash
Captcha's about to get wild.Maybe the internet will briefly go back to a place mainly populated with outliers.
semiinfinitely
i feel bad that we make the LLMs play this
OsrsNeedsf2P
Some of these tasks are crazy. Even I can't beat them: https://arcprize.org/tasks/ar25
chaise
The official leaderboard for ARC-AGI-3 for current LLMs : https://arcprize.org/leaderboard (yous should select the 3th leaderboard)CRAZY 0.1% in average lmao
Geee
Would be fun to play but the controls are janky.
k2xl
I submitted puzzle game Pathology (https://thinky.gg) for ARC Prize 3. Sad to see didn’t hear back from the committee.It is a simple game with simple rules that solvers have an incredibly difficult time solving compared to humans at a certain level. Solutions are easy to validate but hard to find.
jmkni
ok clearly I'm a robot because I can't figure out wtf to do
6thbit
Not clear to me the diff with v2?
dinkblam
what is the evidence that being able to play games equates to AGI?
CamperBob2
Without reading the .pdf, I tried the first game it gave me, at https://arcprize.org/tasks/ls20, and I couldn't begin to guess what I was supposed to do. Not sure what this benchmark is supposed to prove.Edit: Having messed around with it now (and read the .pdf), it seems like they've left behind their original principle of making tests that are easy for humans and hard for machines. I'm still not convinced that a model that's good at these sorts of puzzles is necessarily better at reasoning in the real world, but am open to being convinced otherwise.
nubg
Any benchmarks?
saberience
So this is another ARC-"AGI" benchmark which is again designed around using eyesight for LLMs which are trained to be great at text, what is the point?Yes, we get that LLMs are really bad when you give them contrived visual puzzles or pseudo games to solve... Well great, we already knew this.The "hype" around the ARC-AGI benchmarks makes me laugh, especially the idea we would have AGI when ARC-AGI-1 was solved... then we got 2, and now we're on 3.Shall we start saying that these benchmarks have nothing to do with AGI yet? Are we going to get an ARC-AGI-10 where we have LLMs try and beat Myst or Riven? Will we have AGI then?This isn't the right tool for measuring "AGI", and honestly I'm not sure what it's measuring except the foundation labs benchmaxxing on it.
tasuki
So ARC-AGI was released in 2019. That's been solved, then there was ARC-AGI-2, and now there's ARC-AGI-3. What is even the point? Will ARC-AGI-26 hit the front page of Hacker News in 2057 ?