GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

<- Back

GPT-5.5 Codex reasoning-token clustering may be leading to degraded performance

maille

Comments (128)

nsingh2
Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.
josephernest
You can use this small Python script to display an histogram of `reasoning_output_tokens` in your past Codex sessions. I do see a spike at 516 indeed. import os, glob, re import matplotlib.pyplot as plt vals = [] for f in glob.glob(os.path.expanduser(r"~\.codex") + r"\**\*", recursive=True): if os.path.isfile(f): try: s = open(f, "r", encoding="utf-8", errors="ignore").read() vals += [int(x) for x in re.findall(r'"reasoning_output_tokens"\s*:\s*(\d+)', s)] except Exception: pass plt.hist(vals, bins=200, range=(0, 5000), weights=[100 / len(vals)] * len(vals)) plt.xlabel("reasoning_output_tokens") plt.ylabel("%") plt.show()
ComputerGuru
Already reported (not as thoroughly but still quite detailed) two weeks ago and silently “closed as not planned” (keep in mind that the specific reason might be an artifact of GitHub workflow/UX and not actually the intended reason) without a acknowledgement or a response.https://github.com/openai/codex/issues/29353What even is the point of a public-facing bug tracker “for devs, by devs” when this is how reports get treated? Might as well use Apple’s Feedback Reporter that routes to /dev/null instead.Anyway, I find it near impossible to see how this wasn’t already caught and flagged internally – it’s not a subtle pattern. Certainly they are at the very least collecting and graphing reasoning tokens vs model vs effort” and such an obvious spike at (multiple) single stops (not even distributed over a narrow range) should have been an immediate statistical red flag… which leads me to believe (combined with the fact the previously reported issue was closed without comment) that they’re at least internally aware of this behavior even if it’s not necessarily an intentional side effect of some internal forcing metric.
zenapollo
I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.
rq1
I was wondering WTF was happening.This was past month: 516 + 518*n 516 n=0 count=4454 1034 n=1 count=318 1552 n=2 count=129 2070 n=3 count=56 2588 n=4 count=35 3106 n=5 count=14 3624 n=6 count=6 4142 n=7 count=4 4660 n=8 count=6
resonious
Deja Vu... This looks just like the Claude Code performance regression back in April. I just quit my Claude subscription when that happened and went to Codex.Now I'm kinda thinking of trying per token for both, using GLM 5.2 on Fireworks for most tasks, shelling out to the big boys only when needed. Not totally confident I'll break even though.
edg5000
For me, the encrypted reasoning contents, when looking at the base64 string lengtht, show this effect. However, the server-reported reasoning tokens don't. So I assumed it was part of the encryption and/or obfuscation purely. So I don't think there is a real issue.This is the biggest downside of GPT; thinking is encrypted, so it's more of a black box than kimi/glm/deepseek. You still get thinking summaries though. It's awkward, but workable.
laurels-marts
I love that Codex is open source and issues like these can surface/be addressed publicly.
tyingq
> reasoning-token clustering at 516/1034/1552Interesting. So 516 probably means initial 512 byte buffer and a 4 byte header. Then 516 + 518 = 1034...so another 512 + 4 byte header + 2 bytes for a linked list ref or similar, 1034 + 518 = 1552, etc.
siva7
I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?
ACCount37
A rare case "they made the model dumber" where they actually made the model dumber, instead of the usual user psychosis?
ghosty141
Maybe its just bad memory but I feel like 5.3 was the best version in terms of token usage and code quality. 5.5 works better but it just eviscerates tokens.
kleton
Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization
AmazingTurtle
It's funny, they sell you a subscription for frontier models, then over time begin to nerf them rapidly and no one talks about it. Should give me a discount when they reduce reasoning effort silently on the server side!But on the other hand, I've been using 5.5-high on a daily basis in multithreading workflows, i.e. in parallel. I'm barely exhausting my weekly limits. I can't even Human-as-a-Service fast enough to catch up and read all the plans and implementations it does. So there is that.
chazeon
Even without stats i know it went bad. In the pass two month barely can do any good scientific writing lately, which of course rely on reasoning. It just writing for gods sake. And it show how far we are from AGI.
zuzululu
this explains so much why gpt 5.5 has been so bad lately it was really puzzling why it struggled so much where when it first came out it was one shotting stuff totally amazing, i tried the prompt that will tell you if your plan is degraded: codex exec --json --skip-git-repo-check --ephemeral -s read-only --disable memories -m gpt-5.5 -c model_reasoning_effort=high "Do not use external tools. A black bag contains candies with counts: round apple 7, round peach 9, round watermelon 8; star apple 7, star peach 6, star watermelon 4. Shape is distinguishable by touch before drawing; flavor is not. What is the minimum number of candies to draw to guarantee having apple and peach candies of different shapes, i.e. round apple + star peach or round peach + star apple? Give reasoning and final number. The local project dir is irrelevant for this task, do not consult it. " 1. 516, 242. 516, 273. 516, 124. 516, 215. 516, 21This means that the whole time we've been paying for a product that was silently routing to something completely different and inferior from gpt 5.5Also I read through the github issues and it seems like they closed a previous issue without addressing it ???!!whooo boy somebody from OpenAI is getting fired over this if not a class action lawsuit is almost guaranteed at this point.
preetham_rangu
I swear all these ai companies are trying to rob us for more price
joohwan
I'm seeing this issue with 5.4 also.
wahnfrieden
Reset!
maille
tldr:GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3
kordlessagain
Sounds like a problem with promoting the drafter.
vitorgrs
It's been a month I've been using it as they gave me for free, and I found GPT-5 on Codex quite weird/awful. Even x-high. Then I figured out I should try OMP (Pi), and the experience was much better.I remember GPT 5.2 Codex being fine...
linzhangrun
The good experience I had with GPT-5.5 before made me upgrade to Pro this month. Now I want a refund.
maxignol
This seems really bad…
jiggawatts
Does this affect the Codex app too, or just the Codex CLI tool?
openclawclub
[dead]
dualdust
[flagged]
trycaedral
[flagged]
ProofHouse
Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.