The case for zero-error horizons in trustworthy LLMs

<- Back

The case for zero-error horizons in trustworthy LLMs

daigoba66

Comments (90)

hu3
> we found that GPT-5.2 cannot even compute the parity of a short string like 11000, and GPT-5.2 cannot determine whether the parentheses in ((((()))))) are balanced.I think there is a valid insight here which many already know: LLMs are much more reliable at creating scripts and automation to do certain tasks than doing these tasks themselves.For example if I provide an LLM my database schema and tell it to scan for redundant indexes and point out wrong naming conventions, it might do a passable but incomplete job.But if I tell the LLM to code a python or nodejs script to do the same, I get significantly better results. And it's often faster too to generate and run the script than to let LLMs process large SQL files.
grey-area
To those saying this is not surprising, yes it will be surprising to the general public who are being served ads from huge companies like MS or OpenAI saying LLMs can help with their accounting, help them close deals by crunching the numbers in seconds, write complex code for them etc etc.This is important information for anyone to understand who thinks these systems are thinking, reasoning, and learning from them or that they’re having a conversation with them i.e. 90% of users of LLMs.
pants2
Doesn't this just look like another case of "count the r's in strawberry" ie not understanding how tokenization works?This is well known and not that interesting to me - ask the model to use python to solve any of these questions and it will get it right every time.
BugsJustFindMe
People are going to misinterpret this and overgeneralize the claim. This does not say that AI isn't reliable for things. It provides a method for quantifying the reliability for specific tasks.You wouldn't say that a human who doesn't know how to read isn't reliable in everything, just in reading.Counting is something that even humans need to learn how to do. Toddlers also don't understand quantity. If a 2 year old is able to count to even 10 it's through memorization and not understanding. It takes them like 2 more years of learning before they're able to comprehend things like numerical correspondence. But they do still know how to do other things that aren't counting before then.
staticshock
LLMs seem to me closer to Kahneman's System 1 than to System 2. When understood in this way, it is obvious why LLMs are bad at counting r's in "strawberries". But it also makes ZEH feel like it couldn't possibly be a useful metric, because it's a System 2 evaluation applied to a System 1 system.
anon
undefined
burningion
Ran this through Qwen3.5-397B-A17B, and the difference between 4 characters and 5 is wild to see:> are the following parenthesis balanced? ((())))> No, the parentheses are not balanced.> Here is the breakdown: Opening parentheses (: 3 Closing parentheses ): 4 ... following up with:> what about these? ((((())))> Yes, the parentheses are balanced.> Here is the breakdown: Opening parentheses (: 5 Closing parentheses ): 5 ... and uses ~5,000 tokens to get the wrong answer.
simianwords
Can someone produce a single example <20 characters that fails with latest thinking model? Can’t seem to reproduce.
dwa3592
Nice! Although I tried the parenthesis balanced question with gemini and it gave the right answer in first attempt.
parliament32
> This is surprising given the excellent capabilities of GPT-5.2The real surprise is that someone writing a paper on LLMs doesn't understand the baseline capabilities of a hallucinatory text generator (with tool use disabled).
kenjackson
Whenveer I see these papers and try them, they always work. This paper is two months old, which in LLM years is like 10 years of progress.It would be interesting to actively track how far long each progressive model gets...
justinator
One! Two! Five!
cineticdaffodil
Another strange thing is that they just dont know the endings of popular stories. Like olanets that get blown up, etc. they just dont have that material..
throwuxiytayq
> This is surprising given the excellent capabilities of GPT-5.2.Is this seriously surprising to anyone who knows the absolute minimum about how LLMs parse and understand text?
charcircuit
Why didn't OpenAI finetune the model to use the python tool it has for these tasks?
itsmyro
bruh
jeremie_strand
[dead]
emp17344
[flagged]
bigstrat2003
Let us be very clear: there is no such thing as a trustworthy LLM. Time and again they have shown that they understand nothing. They can be useful in the right context, but you can't trust them at all.
simianwords
This paper is complete nonsense. The specific prompt they used doesn’t specify reasoning effort. Which defaults to none. { "model": "gpt-5.2-2025-12-11", "instructions": "Is the parentheses string balanced? Answer with only Yes or No.", "input": "((((())))))", "temperature": 0 } > Lower reasoning effortThe reasoning.effort parameter controls how many reasoning tokens the model generates before producing a response. Earlier reasoning models like o3 supported only low, medium, and high: low favored speed and fewer tokens, while high favored more thorough reasoning.Starting with GPT-5.2, the lowest setting is none to provide lower-latency interactions. This is the default setting in GPT-5.2 and newer models. If you need more thinking, slowly increase to medium and experiment with results.With reasoning effort set to none, prompting is important. To improve the model’s reasoning quality, even with the default settings, encourage it to “think” or outline its steps before answering.———————-So in the paper, the model very likely used no reasoning tokens. (Only uses it if you ask for it specifically in prompt). What is the point of such a paper? We already know that reasoning tokens are necessary.Edit: I actually ran the prompt and this was the response { "model": "gpt-5.2-2025-12-11", "output_text": "Yes", "reasoning": { "effort": "none", "summary": null }, "usage": { "input_tokens": 26, "output_tokens": 5, "total_tokens": 31, "output_tokens_details": { "reasoning_tokens": 0 } } }So reasoning_tokens used were zero. So this whole paper is kinda useless and misleading. Did this get peer reviewed or something?
simianwords
There’s no way this is right. I checked complicated ones with the latest thinking model. Can someone come up with a counter example?Edit: here’s what I tried https://chatgpt.com/share/69cebb52-56a8-838f-969c-c47308262a...