Our LLM-controlled office robot can't pass butter

<- Back

Our LLM-controlled office robot can't pass butter

lukaspetersson

Comments (105)

lukeinator42
The internal dialog breakdowns from Claude Sonnet 3.5 when the robot battery was dying are wild (pages 11-13): https://arxiv.org/pdf/2510.21860
ummonk
I wonder whether that LLM has actually lost its mind so to speak or was just attempting to emulate humans who lose their minds?Or to put it another way, if the writings of humans who have lost their minds (and dialogue of characters who have lost their minds) were entirely missing from the LLM’s training set, would the LLM still output text like this?
koeng
95% for humans. Who failed to get the butter?
ghostly_s
Putting aside success at the task, can someone explain why this emerging class of autonomous helper-bots is so damn slow? I remember google unveiled their experiments in this recently and even the sped-up demo reels were excruciating to sit through. We generally think of computers as able to think much faster than us, even if they are making wrong decisions quickly, so what's the source of latency in these sytems?
Reason077
The most surprising thing is that 5% of humans apparently failed this task! Where are they finding these test subjects?!
fentonc
I built a whimsical LLM-driven robot to provide running commentary for my yard: https://www.chrisfenton.com/meet-grasso-the-yard-robot/
ge96
Funny I was looking at the chart like "what model is Human?"
Animats
Using an LLM for robot actuator control seems like pounding a screw. Wrong tool for the job.Someday, and given the billions being thrown at the problem, not too far out, someone will figure out what the right tool is.
amelius
> The results confirm our findings from our previous paper Blueprint-Bench: LLMs lack spatial intelligence.But I suppose that if you can train an llm to play chess, you can also train it to have spatial awareness.
WilsonSquared
Guess it has no purpose then
anon
undefined
Finnucane
I have a cat that will never fail to find the butter. Will it bring you the butter? Ha ha, of course not.
anon
undefined
zzzeek
will noone claim the Rick and Morty reference? I've seen that show like, once and somehow I know this?
DubiousPusher
I guess I'm very confused as to why just throwing an LLM at a problem like this is interesting. I can see how the LLM is great at decomposing user requests into commands. I had great success with this on a personal assistant project I helped prototype. The LLM did a great job of understanding user intent and even extracting parameters regarding the requested task.But it seems pretty obvious to me that after decomposition and parameterization, coordination of a complex task would much better be handled by a classical AI algorithm like a planner. After all, even humans don't put into words every individual action which makes up a complex task. We do this more while first learning a task but if we had to do it for everything, we'd go insane.
bhewes
Someone actually paid for this?
yieldcrv
95% pass rate for humanswaiting for the huggingface Lora
sam_goody
The error messages were truly epic, got quite a chuckle.But boy am I glad that this is just in the play stage.If someone was in a self driving car that had 19% battery left and it started making comments like those, they would definitely not be amused.
anon
undefined
hidelooktropic
How can I get early access to this "Human" model on the benchmarks? /s
fsckboy
>Our LLM-controlled office robot can't pass butterwas the script of Last Tango in Paris part of the training data? maybe it's just scared...
throwawayffffas
It feels misguided to me.I think the real value of llms for robotics is in human language parsing.Turning "pass the butter" to a list of tasks the rest of the system is trained to perform, locate an object, pick up an object, locate a target area, drop off the object.