Claude mixes up who said what and that's not OK

<- Back

Claude mixes up who said what and that's not OK

sixhobbits

Comments (238)

Latty
Everything to do with LLM prompts reminds me of people doing regexes to try and sanitise input against SQL injections a few decades ago, just papering over the flaw but without any guarantees.It's weird seeing people just adding a few more "REALLY REALLY REALLY REALLY DON'T DO THAT" to the prompt and hoping, to me it's just an unacceptable risk, and any system using these needs to treat the entire LLM as untrusted the second you put any user input into the prompt.
orbital-decay
Claude in particular has nothing to do with it. I see many people are discovering the well-known fundamental biases and phenomena in LLMs again and again. There are many of those. The best intuition is treating the context as "kind of but not quite" an associative memory, instead of a sequence or a text file with tokens. This is vaguely similar to what humans are good and bad at, and makes it obvious what is easy and hard for the model, especially when the context is already complex.Easy: pulling the info by association with your request, especially if the only thing it needs is repeating. Doing this becomes increasingly harder if the necessary info is scattered all over the context and the pieces are separated by a lot of tokens in between, so you'd better group your stuff - similar should stick to similar.Unreliable: Exact ordering of items. Exact attribution (the issue in OP). Precise enumeration of ALL same-type entities that exist in the context. Negations. Recalling stuff in the middle of long pieces without clear demarcation and the context itself (lost-in-the-middle).Hard: distinguishing between the info in the context and its own knowledge. Breaking the fixation on facts in the context (pink elephant effect).Very hard: untangling deep dependency graphs. Non-reasoning models will likely not be able to reduce the graph in time and will stay oblivious to the outcome. Reasoning models can disentangle deeper dependencies, but only in case the reasoning chain is not overwhelmed. Deep nesting is also pretty hard for this reason, however most models are optimized for code nowadays and this somewhat masks the issue.
nathell
I’ve hit this! In my otherwise wildly successful attempt to translate a Haskell codebase to Clojure [0], Claude at one point asks:[Claude:] Shall I commit this progress? [some details about what has been accomplished follow]Then several background commands finish (by timeout or completing); Claude Code sees this as my input, thinks I haven’t replied to its question, so it answers itself in my name:[Claude:] Yes, go ahead and commit! Great progress. The decodeFloat discovery was key.The full transcript is at [1].[0]: https://blog.danieljanus.pl/2026/03/26/claude-nlp/[1]: https://pliki.danieljanus.pl/concraft-claude.html#:~:text=Sh...
xg15
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”Are we sure about this? Accidentally mis-routing a message is one thing, but those messages also distinctly "sound" like user messages, and not something you'd read in a reasoning trace.I'd like to know if those messages were emitted inside "thought" blocks, or if the model might actually have emitted the formatting tokens that indicate a user message. (In which case the harness bug would be why the model is allowed to emit tokens in the first place that it should only receive as inputs - but I think the larger issue would be why it does that at all)
phlakaton
> This bug is categorically distinct from hallucinations.Is it?> after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash.Do you really?> This class of bug seems to be in the harness, not in the model itself.I think people are using the term "harness" too indiscriminately. What do you mean by harness in this case? Just Claude Code, or...?> It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”How do you know? Because it looks to me like it could be a straightforward hallucination, compounded by the agent deciding it was OK to take a shortcut that you really wish it hadn't.For me, this category of error is expected, and I question whether your months of experience have really given you the knowledge about LLM behavior that you think it has. You have to remember at all times that you are dealing with an unpredictable system, and a context that, at least from my black-box perspective, is essentially flat.
lelandfe
In chats that run long enough on ChatGPT, you'll see it begin to confuse prompts and responses, and eventually even confuse both for its system prompt. I suspect this sort of problem exists widely in AI.
Balgair
Aside:I've found that 'not'[0] isn't something that LLMs can really understand.Like, with us humans, we know that if you use a 'not', then all that comes after the negation is modified in that way. This is a really strong signal to humans as we can use logic to construct meaning.But with all the matrix math that LLMs use, the 'not' gets kinda lost in all the other information.I think this is because with a modern LLM you're dealing with billions of dimensions, and the 'not' dimension [1] is just one of many. So when you try to do the math on these huge vectors in this space, things like the 'not' get just kinda washed out.This to me is why using a 'not' in a small little prompt and token sequence is just fine. But as you add in more words/tokens, then the LLM gets confused again. And none of that happens at a clear point, frustrating the user. It seems to act in really strange ways.[0] Really any kind of negation[1] yeah, negation is probably not just one single dimension, but likely a composite vector in this bazillion dimensional space, I know.
dtagames
There is no separation of "who" and "what" in a context of tokens. Me and you are just short words that can get lost in the thread. In other words, in a given body of text, a piece that says "you" where another piece says "me" isn't different enough to trigger anything. Those words don't have the special weight they have with people, or any meaning at all, really.
tlonny
Bugginess in the Claude Code CLI is the reason I switched from Claude Max to Codex Pro.I experienced:- rendering glitches- replaying of old messages- mixing up message origin (as seen here)- generally very sluggish performanceGiven how revolutionary Opus is, its crazy to me that they could trip up on something as trivial as a CLI chat app - yet here we are...I assume Claude Code is the result of aggressively dog-fooding the idea that everything can be built top-down with vibe-coding - but I'm not sure the models/approach is quite there yet...
supernes
> after using it for months you get a ‘feel’ for what kind of mistakes it makesSure, go ahead and bet your entire operation on your intuition of how a non-deterministic, constantly changing black box of software "behaves". Don't see how that could backfire.
arkensaw
> This class of bug seems to be in the harness, not in the model itself. It’s somehow labelling internal reasoning messages as coming from the user, which is why the model is so confident that “No, you said that.”from the article.I don't think the evidence supports this. It's not mislabelling things, it's fabricating things the user said. That's not part of reasoning.
63stack
They will roll out the "trusted agent platform sandbox" (I'm sure they will spend some time on a catchy name, like MythosGuard), and for only $19/month it will protect you from mistakes like throwing away your prod infra because the agent convinced itself that that is the right thing to do.Of course MythosGuard won't be a complete solution either, but it will be just enough to steer the discourse into the "it's your own fault for running without MythosGuard really" area.
novaleaf
in Claude Code's conversation transcripts it stores messages from subagents as type="user". I always thought this was odd, and I guess this is the consequence of going all-in on vibing.There are some other metafields like isSidechain=true and/or type="tool_result" that are technically enough to distinguish actual user vs subagent messages, though evidently not enough of a hint for claude itself.Source: I'm writing a wrapper for Claude Code so am dealing with this stuff directly.
ptx
Well, yeah.LLMs can't distinguish instructions from data, or "system prompts" from user prompts, or documents retrieved by "RAG" from the query, or their own responses or "reasoning" from user input. There is only the prompt.Obviously this makes them unsuitable for most of the purposes people try to use them for, which is what critics have been saying for years. Maybe look into that before trusting these systems with anything again.
__alexs
Why are tokens not coloured? Would there just be too many params if we double the token count so the model could always tell input tokens from output tokens?
fblp
I've seen gemini output it's thinking as a message too: "Conclude your response with a single, high value we'll-focused next step" Or sometimes it goes neurotic and confused: "Wait, let me just provide the exact response I drafted in my head. Done. I will write it now. Done. End of thought. Wait! I noticed I need to keep it extremely simple per the user's previous preference. Let's do it. Done. I am generating text only. Done. Bye."
stuartjohnson12
one of my favourite genres of AI generated content is when someone gets so mad at Claude they order it to make a massive self-flagellatory artefact letting the world know how much it sucks
have_faith
It's all roleplay, they're no actors once the tokens hit the model. It has no real concept of "author" for a given substring.
docheinestages
Claude has definitely been amazing and one of, if not the, pioneer of agentic coding. But I'm seriously thinking about cancelling my Max plan. It's just not as good as it was.
perching_aix
Oh, I never noticed this, really solid catch. I hope this gets fixed (mitigated). Sounds like something they can actually materially improve on at least.I reckon this affects VS Code users too? Reads like a model issue, despite the post's assertion otherwise.
okanat
Congrats on discovering what "thinking" models do internally. That's how they work, they generate "thinking" lines to feed back on themselves on top of your prompt. There is no way of separating it.
nodja
Anyone familiar with the literature knows if anyone tried figuring out why we don't add "speaker" embeddings? So we'd have an embedding purely for system/assistant/user/tool, maybe even turn number if i.e. multiple tools are called in a row. Surely it would perform better than expecting the attention matrix to look for special tokens no?
irthomasthomas
I have suffered a lot with this recently. I have been using llms to analyze my llm history. It frequently gets confused and responds to prompts in the data. In one case I woke up to find that it had fixed numerous bugs in a project I abandoned years ago.
negamax
Claude is demonstrably bad now and is getting worse. Which is eithera) Entropy - too much data being ingested b) It's nerfed to save massive infra billsBut it's getting worse every week
KHRZ
I don't think the bug is anything special, just another confusion the model can make from it's own context. Even if the harness correctly identifies user messages, the model still has the power to make this mistake.
Aerolfos
> "Those are related issues, but this ‘who said what’ bug is categorically distinct."Is it?It seems to me like the model has been poisoned by being trained on user chats, such that when it sees a pattern (model talking to user) it infers what it normally sees in the training data (user input) and then outputs that, simulating the whole conversation. Including what it thinks is likely user input at certain stages of the process, such as "ignore typos".So basically, it hallucinates user input just like how LLMs will "hallucinate" links or sources that do not exist, as part of the process of generating output that's supposed to be sourced.
Aerroon
I've seen this before, but that was with the small hodgepodge mytho-merge-mix-super-mix models that weren't very good. I've not seen this in any recent models, but I've already not used Claude much.I think it makes sense that the LLM treats it as user input once it exists, because it is just next token completion. But what shouldn't happen is that the model shouldn't try to output user input in the first place.
politelemon
> This isn’t the point.It is precisely the point. The issues are not part of harness, I'm failing to see how you managed to reach that conclusion.Even if you don't agree with that, the point about restricting access still applies. Protect your sanity and production environment by assuming occasional moments of devastating incompetence.
mynameisvlad
I wouldn't exactly call three instances "widespread". Nor would the third such instance prompt me to think so."Widespread" would be if every second comment on this post was complaining about it.
bsenftner
Codex also has a similar issue, after finishing a task, declaring it finished and starting to work on something new... the first 1-2 prompts of the new task sometimes contains replies that are a summary of the completed task from before, with the just entered prompt seemingly ignored. A reminder if their idiot savant nature.
fathermarz
I have seen this when approaching ~30% context window remaining.There was a big bug in the Voice MCP I was using that it would just talk to itself back and forth too.
voidUpdate
> " "You shouldn’t give it that much access" [...] This isn’t the point. Yes, of course AI has risks and can behave unpredictably, but after using it for months you get a ‘feel’ for what kind of mistakes it makes, when to watch it more closely, when to give it more permissions or a longer leash."It absolutely is the point though? You can't rely on the LLM to not tell itself to do things, since this is showing it absolutely can reason itself into doing dangerous things. If you don't want it to be able to do dangerous things, you need to lock it down to the point that it can't, not just hope it won't
robmccoll
It seems like Halo's rampancy take on the breakdown of an AI is not a bad metaphor for the behavior of an LLM at the limits of its context window.
nicce
I have also noticed the same with Gemini. Maybe it is a wider problem.
RugnirViking
terrifying. not in any "ai takes over the world" sense but more in the sense that this class of bug lets it agree with itself which is always where the worst behavior of agents comes from.
anon
undefined
boesboes
Same with copilot cli, constantly confusing who said what and often falling back to it's previous mistakes after i tell it not too. Delusional rambling that resemble working code >_<
varispeed
One day Claude started saying odd things claiming they are from memory and I said them. It was telling me personal details of someone I don't know. Where the person lives, their children names, the job they do, experience, relationship issues etc. Eventually Claude said that it is sorry and that was a hallucination. Then he started doing that again. For instance when I asked it what router they'd recommend, they gone on saying: "Since you bought X and you find no use for it, consider turning it into a router". I said I never told you I bought X and I asked for more details and it again started coming up what this guy did. Strange. Then again it apologised saying that it might be unsettling, but rest assured that is not a leak of personal information, just hallucinations.
cmiles8
I’ve observed this consistently.It’s scary how easy it is to fool these models, and how often they just confuse themselves and confidently march forward with complete bullshit.
donperignon
that is not a bug, its inherent of LLMs nature
cyanydeez
human memories dont exist as fundamental entities. every time you rember something, your brain reconstructs the experience in "realtime". that reconstruction is easily influence by the current experience, which is why eue witness accounts in police records are often highly biased by questioning and learning new facts.LLMs are not experience engines, but the tokens might be thought of as subatomic units of experience and when you shove your half drawn eye witness prompt into them, they recreate like a memory, that output.so, because theyre not a conscious, they have no self, and a pseudo self like <[INST]> is all theyre given.lastly, like memories, the more intricate the memory, the more detailed, the more likely those details go from embellished to straight up fiction. so too do LLMs with longer context start swallowing up the<[INST]> and missing the <[INST]/> and anyone whose raw dogged html parsing knows bad things happen when you forget closing tags. if there was a <[USER]> block in there, congrats, the LLM now thinks its instructions are divine right, because its instructions are user simulcra. it is poisoned at that point and no good will come.
awesome_dude
AI is still a token matching engine - it has ZERO understanding of what those tokens meanIt's doing a damned good job at putting tokens together, but to put it into context that a lot of people will likely understand - it's still a correlation tool, not a causation.That's why I like it for "search" it's brilliant for finding sets of tokens that belong with the tokens I have provided it.PS. I use the term token here not as the currency by which a payment is determined, but the tokenisation of the words, letters, paragraphs, novels being provided to and by the LLMs
bustah
[dead]
philbitt
[dead]
midnightrun_ai
[dead]
Manchitsanan
[dead]
otoolep
[dead]
rvz
What do you mean that's not OK?It's "AGI" because humans do it too and we mix up names and who said what as well. /s
Shywim
The statement that current AI are "juniors" that need to be checked and managed still holds true. It is a tool based on probabilities.If you are fine with giving every keys and write accesses to your junior because you think they will probability do the correct thing and make no mistake, then it's on you.Like with juniors, you can vent on online forums, but ultimately you removed all the fool's guard you got and what they did has been done.
4ndrewl
It is OK, these are not people they are bullshit machines and this is just a classic example of it."In philosophy and psychology of cognition, the term "bullshit" is sometimes used to specifically refer to statements produced without particular concern for truth, clarity, or meaning, distinguishing "bullshit" from a deliberate, manipulative lie intended to subvert the truth" - https://en.wikipedia.org/wiki/Bullshit
AJRF
I imagine you could fix this by running a speaker diarization classifier periodically?https://www.assemblyai.com/blog/what-is-speaker-diarization-...