Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

<- Back

Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

kachapopopow

Comments (205)

logicprog
I really enjoyed this article. I think the author is precisely right and I've been saying this for a long time. There's a ton of extremely interesting low hanging fruit that can vastly improve the effectiveness of even currently existing models hiding in how we design our agent harnesses; enough to — at least until we hit diminishing returns — make as much or more of a difference than training new models!I think one of the things that this confirms, for me at least, is that it's better to think of "the AI" as not just the LLM itself, but the whole cybernetic system of feedback loops joining the LLM and its harness. Because, if the harness can make as much if not more of a difference, when improved, as improvements to the model itself, then they have to be really considered equally important. Not to mention the fact that models are specifically reinforcement learned to use harnesses and harnesses are adapted to the needs of models in general or specific models. So they necessarily sort of develop together in a feedback loop. And then in practice, as they operate, it is a deeply intertwined feedback loop where the entity that actually performs the useful work, and which you interact with, is really the complete system of the two together.I think thinking like this could not only unlock quantitative performance improvements like the ones discussed in this blog post, but also help us conceive of the generative AI project as actually a project of neurosymbolic AI, even if the most capital intensive and a novel aspect is a neural network; and once we begin to think like that, that unlocks a lot of new options and more holistic thinking and might increase research in the harness area.
woah
Seems like a very cool technique, but also very oversold. He's seeing a 5% improvement on a find and replace benchmark of his own devising and saying stuff like this in the blog post:> Here is why that is backwards. I just showed that a different edit format improves their own models by 5 to 14 points while cutting output tokens by ~20%. That’s not a threat. It’s free R&D.He makes it sounds like he got a 5-14% boost on a top level benchmark, not 5% improvement on a narrow find and replace metric. Anecdotally, I don't usually have a lot of issues with editing in Claude Code or Cursor, and if there is an issue the model corrects it.Assuming that it costs double the tokens when it has to correct itself, and find and replace errors are as prominent in actual day to day use as his benchmark, we're talking a 5% efficiency gain in editing token use (not reasoning or tool use). Given that editing must be less than 1/3 of the token use (I assume much less?), we're talking an overall efficiency gain of less than 1%.This seems like a promising technique but maybe not a high priority in efficiency gains for these tools. The messianic tone, like assuming that Google cut off his access to suppress his genius editing technique rather than just because he was hammering their API also leaves a bad taste, along with the rampant and blatant ChatGPTisms in the blog post.
keeda
This makes sense to me because I've been having very accurate results with models from even 2+ years ago... but I had to "hold them right." Even when reasoning models and coding agents were just a gleam in Altman's and Amodei's eyes, I could tell a lot of the unrealized gains lay in building the right tools, harnesses and guardrails to manage the context and guide the model. (Relevant subthread as example: https://news.ycombinator.com/item?id=44171519)But this article hints at deeper wins to be had. Consider that these models are operating on source code, which is a verbose, noisy, textual serialization of the intended syntax / semantic trees. TFA improves accuracy by retro-fitting some structure onto the text. But what if models could operate directly on these underlying structures themselves?As a data point, there are projects like OpenRewrite, which encode a ton of information, from formatting to types with globally resolved dependencies for each symbol in what they call a "Lossless Semantic Tree", so that there is ~0 ambiguity about the code. When I worked with OpenRewrite (in the era before LLMs, how quaint!) compared to other tools, it produced the best results for code transformations with the highest fidelity to the surrounding code.Now imagine if the agent has access to such detailed information. It would not have to waste tokens figuring incidental things out like formatting. Although I haven't tested it out myself, I believe Moderne (the maintainers of OpenRewrite) when they say that agents armed with LST-based tools make extremely accurate changes.This is essentially the same reason why the answer to "Which is better, Vim or Emacs?" is "IntelliJ."Now consider that these models are STILL operating on text as an input and output mode! What if they were multi-modally trained on source code and docs and their syntax / semantic trees? I don't even know what this would look like, but I'd bet this would produce the most accurate coding models ever -- probably neurosymbolic in the truest sense.
chrisweekly
Great post. A few choice quotes:> Often the model isn’t flaky at understanding the task. It’s flaky at expressing itself. You’re blaming the pilot for the landing gear.> The model is the moat. The harness is the bridge. Burning bridges just means fewer people bother to cross. Treating harnesses as solved, or even inconsequential, is very short-sighted.> The gap between “cool demo” and “reliable tool” isn’t model magic. It’s careful, rather boring, empirical engineering at the tool boundary.
matheist
> Codex uses apply_patch: It takes a string as input, which is essentially an OpenAI-flavored diff, and instead of relying on a structured schema, the harness just expects this blob to follow a strict set of rules. Since OpenAI folks are without a doubt smart, I’m sure the token selection process is biased to fit this structure at the LLM gateway for the Codex variants of GPT, similar to how other constraints like JSON schemas or required tool calls work.Codex does in fact use a schema for constrained sampling, it's here: https://github.com/openai/codex/blob/main/codex-rs/core/src/...It still has to work to get an exact match, or at least I didn't read the code to see if there's any fuzzy matching used.Note the two codex models were the only ones doing worse with the author's proposed format. The author found them doing better with replace than with apply patch, but since the author appears to be unaware that they use a schema for constrained sampling, I think a more realistic benchmark should enable constrained sampling for the apply test.
socketcluster
Seeing all these 'coding' benchmarks reminds me that people still don't understand what coding means in practice. People still think one-phase puzzle-solving is coding. Real coding almost always has multiple phases which build on top of one another. There is an architectural component which is missed here - and the sheer number of phases/layers is actually where most of the complexity comes from.
clx75
During my first LLM experiments in Emacs using gptel, I also found that the LLM has considerable difficulties changing source code files with the Unix patch tool.As Emacs has a built-in tree-sitter package, I implemented this same idea. I created gptel tools like tree_sitter_list_nodes, tree_sitter_get_nodes, tree_sitter_update_nodes, tree_sitter_insert_before_node and tree_sitter_insert_after_node. The "list" tool returns a list of AST nodes with first line number, first line content and node hash. The LLM can then use "get" to collect interesting nodes in their entirety and "update" to update a list of nodes identified by hash with new content (var/function bodies).Worked like a charm.
XCSme
Google banning you for benchmarking is crazy, are you sure that's the cause? How would they even know you are benchmarking?
joshuamoyers
I think this is the right take. I usually am aligned with most of what Anthropic is doing, but cutting off OAuth login from open harnesses was a bad move. My guess is there is some serious worry/overlap with the Cursor's of the world - e.g. folks who will be competitors in the future who are taking advantage of cheaper Opus rates/loss leader from them while simultaneously building a competitive model (Composer).Also, nice clever optimization here. Lots of low hanging fruit in harness land.
jahala
I implemented this hash (read and edit) approach in tilth if you want to test it out.https://github.com/jahala/tilthits on npm and cargo:- cargo install tilth- npx tilththen tilth install claude-code/windsurf/cursor --edit(--edit flag is needed)I made "tilth" a few days ago, since I'm consistently trying to get the LLMs to use tools more efficiently and spend less tokens doing it -- original tilth post from Monday: https://news.ycombinator.com/item?id=46952321
mehdibl
You can improve a lot the success rate with providing HELM and clear instructions with the tool description.Over a year ago had a lot of issues and the description and example was the difference between 30-50% failure to 1%!So I'm surprised a bit about the point. May be I'm missing it.
woeirua
The harness matters far more than most people think. This post about the CORE benchmark where Opus’ score almost doubled when they switched to Claude Code from their own harness. https://x.com/sayashk/status/1996334941832089732
tosh
Shows how much room for improvement there is on the harness level.Agents waste a lot of tokens on editing, sandboxes, passing info back and forth from tool calls and subagents.Love the pragmatic mix of content based addressing + line numbers. Beautiful.
kachapopopow
My personal notes (not the author): have been way faster performance wise which is honestly the biggest improvement over correctless. I've posted https://github.com/can1357/oh-my-pi before, but didn't seem to gain traction. It's a great little agent.
rao-v
I’d really like to see this optimized for the 50-120B parameter open source models that are local viable (gpt-oss-120b, qwen3-80b-3a etc.).For them I think it would be optimal to provide a tag per function and trust the llm to rewrite the function. As the article notes full reproduction is generally more reliable than edited for short code.The token and attention overhead from a per line hash I suspect limits this approach for smaller models
indubioprorubik
My guess always was that - if you took the source of training data- meaning the authors of the "best" answers and solutions on stackoverflow or github- and got the question reformatted, to sound like it was created by these experts- the created code, would try to hug these sources of truth while getting created.So, the challenge is actually to find a map of "problem" to "author" and then from "author" to "related code" and from their to a solution.
ianbutler
It’s funny to see where we are on model improvements.Back when I was maintaining a coding harness around the time of Claude 3.5 we tried hash prefixes we tried line number prefixes we tried a lot of different approaches to making the model better at selecting edit blocks and ultimately at-least then fuzzy string matching won out.
Bolwin
You forgot to mention your tool does worse for 8/16 LLMs compared to replace?Problem is, replace has been around for so long, most LLMs are tuned for it now
animan
What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?
znnajdla
My experience as well. People worry our profession is being reduced to "prompt engineer", but actually I get the feeling that programming will soon be mainly about designing and building harnesses for specific tasks.
tgtweak
When you're in the business of selling tokens - you look at technology that reduces that as a threat. If they were selling services that USE tokens, then reducing them would be welcome... so they'll likely steal this and incorporate it into their proprietary CLIs like claude code...
fcanesin
The harness is the model "body", it's weight the cognition. Like in nature they develop together and the iteration of natural selection works at both.If smaller labs (Zai, Moonshot, deepseek, mistral..) get together and embrace a harness, like opencode for example, as a consortium just by the power of "evolution across different environments" they might hit jackpot earlier than bigger labs.
christophilus
Has any harness matched the effectiveness of Claude Code yet? I haven't experimented much recently, but every time I have in the past, I wasn't able to get any other tool to approach how effective I am in CC.I'd love to use a different harness-- ideally an OSS one-- and hook it up to whichever LLM provides the best bang for the buck rather than being tied to Claude.
parhamn
On first principles it would seem that the "harness" is a myth. Surely a model like Opus 4.6/Codex 5.3 which can reason about complex functions and data flows across many files would trip up over top level function signatures it needs to call?I see a lot of evidence to the contrary though. Anyone know what the underlying issue here is?
jbetala7
I switched from a basic prompt wrapper to structured tool use with Claude Code and the quality of output jumped overnight. Same model, completely different results.
uriegas
I do agree with his identification of the problem: sometimes agents fail because of the tools around it and not because of the model's reasoning. However, for the failing tests I think he is not making the distinction between a failed test due to a harness failure or due to a reasoning failure. It would be nice if someone analyzed that from the data set.
aszen
So the new implementation always operates at the line level, replacing one or more lines. That's not ideal for some refactorings like rename where search and replace is faster.EditChecking ohmypi The model has access to str replace too so this is just a edit till
benreesman
The logical end state of this line of reasoning is a collective action problem that dooms the frontier lab establishment. You can't devote model capacity to having an attention transformer match nested delimiters or cope with bash and be maximally capable, you can't mix authentication, authorization, control plane, and data plane into an ill specified soup and be secure enough for any that isn't a pilot or toy ever.If you run this out, you realize that the Worse is Better paradox has inverted, it's an arbitrage, and the race is on.
pcwelder
Great work, but concurrency is lost.With search-replace you could work on separate part of a file independently with the LLM. Not to mention with each edit all lines below are shifted so you now need to provide LLM with the whole content.Have you tested followup edits on the same files?
jcims
I ran into this from the other direction. I built a small SRE agent for my cloud infra and just kind of walked into hand-rolling some of the tools rather than using what exists today. I provided an edit_file tool that felt like it was of reasonable capability, but in practice the agent was regularly 'trying' to do a one line change and submitting PRs that hallucinated 3/4s of the file.Seeing how bad the results are when you're casually approaching something makes it very evident that it's a topic that can be optimized.
giancarlostoro
One of the first things I add to my claude instructions file is to stop using grep, its awfully slow, just use ripgrep instead, you can just type the word of what you're looking for from the project root and find it all in one shot. Claude likes to go folder by folder with grep and it drives me crazy."You're absolutely right!"At this point I'd take a contract with Anthropic to have Claude code pick better tooling.
rafaelmn
I wonder if we'll get to "VI for LLMs" - if the model was trained on using that kind of text navigation and you show context around cursor when it navigates.Would also be worth having special tokens for this kind of navigation.
the_harpia_io
the harness bottleneck is real - I've been working on ai code security stuff and the biggest issue isn't model capability, it's that most tools treat the output as gospel. they'll take a suggested fix and apply it without checking if it even compiles, let alone if it introduces new vulns. I've seen fixes that patch one CVE but break auth logic entirely.the edit tool point hits though. when you give the model a better interface to express changes (structured diffs vs free-form patches), error rates drop. but nobody talks about this because benchmarks measure "did it solve the problem" not "how many attempts" or "what's the blast radius when it fails". idk maybe I'm just jaded from debugging too many of these.
softwaredoug
Underrated is how much improving harnesses, not just models, has a lot to do with productive uses of LLMs at tasks like coding in the last year.
nekitamo
Getting banned from Gemini while attempting to improve Gemini is the most Googley thing ever :D imagine letting your automated "trust and safety" systems run amok so that they ban the top 0.01% of your users with no recourse. Google really knows how score an own-goal.
visarga
Yeah I invented a similar method for information extraction attribution around 2022, I would place custom markers in a document so the extraction model can reference them together with the answer and be unique on the document to be able to locate it.
0xbadcafebee
Putting it out there: if any frontier model provider starts allowing any agent to use their $20/month plan, we will all switch to you. We don't want to be forced into 1 harness, we want OAuth, and we want respectable limits without excessive budgets.
notsylver
I feel like cursors solution is still the best answer. Let the model suggest edits in whatever format it prefers using as few "extra" tokens as possible and have a small model figure it out. I don't use cursor anymore but when I did it was impressive how consistently it worked, I think there was a single time it failed. 70b might be overkill though...
znnajdla
Yep this has been my experience with browser agents as well. One little change in the harness/agentic loop and the model suddenly becomes a whole lot smarter at navigating the web. I was also able to build a better browser agent than ‘claude —chrome’ in just a few afternoons just by tweaking the harness.
0xdeafbeef
Filed an issue for codexhttps://github.com/openai/codex/issues/11601
babkayaga
Still weird to me that most people are not just giving an LLM an access to an editor, forcing it to write shell scripts to edit files. Shrug.
energy123
I feel the baseline comparison should be relative to the intuitive and simple "line-numbers only" schema.It's less token heavy than the proposed hash approach, and I don't think frontier LLMs hallucinate line numbers if each line in the context is prefixed with them.
jwpapi
Great article and tbh I thought it would’ve been implemented that way makes sense to hash and save mainly context I don’t expect them to care about token usageHow about Kimi tho how can I play with it?
MetaWhirledPeas
> Treating harnesses as solved, or even inconsequential, is very short-sightedIs it possible that burning extra tokens is the point, since they get paid more?
jwpapi
Arguably I would think that the last year was mainly inner harness improvement instead model improvement but I could be wrong, just feels like that to me
the_harpia_io
honestly the harness thing is way more important than people realize - I've been working on code security tools and the gap between what a model generates raw vs with better structure is massive, way bigger than model versions mattering. like the security bugs I see in AI code, half of them are just because the prompt didn't include enough context or the edit format was wonkythe benchmark overselling isn't the point though - it's that we're barely using these things right. most people still chat with them like it's 2023. what happens when you combine this with actual review flows not just 'beat swe-bench'idk I think everyone's too focused on the model when tooling matters more, since that's something you can actually control
a11r
This is very nicely done. We have seen the same issue at a higher level of getting separators right when generating multiple files in a single inference call.
andai
> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.The VC economics are creating a reality distortion field where Anthropic is incentivized to burn more tokens so they can rent more GPUs so they can get more investment, and where I am incentivized to pipe the LLM inputs into `claude -p` and blast 50KB of useless proompt onto it so they don't ban me from their 95% discounted API endpoint.
evolly
My experience exactly! I’ve recently become so tired of the Claude harness that I switched to OpenCode (which is extremely good compared to Claude). However, OpenCode is also tedious to change, and it inherits all the “good stuff,” like treating agents as Markdown files and all the dancing around with hooks/plugins/skills scattered all over the place. Getting stuck again and again, I’ve ultimately come to the conclusion that this must be solved by writing my own damn coding agent, with extensibility that’s acceptable for real-world engineering.
scotty79
Harness is where the open source should shine. It doesn't require millions of dollars of compute but the search space is vast and explorable with limited budgets.
falkenstein
really enjoyed reading this, although I'm a dumb farmer and it took me a while to understand lol
avereveard
I use small model I like to give them TOC more than lines wonder how it'd stack up with the hashline approachread_toc tool:... { "name": "mcp", "qualified_name": "mcp", "type": "constant", "docstring": null, "content_point": "src\\mcps\\code_help\\server.py::17::18::python::mcp", "is_nested": false }, { "name": "handler", "qualified_name": "handler", "type": "constant", "docstring": null, "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler", "is_nested": false }, ....update_content tool:{ "content": "...", "content_point": "src\\mcps\\code_help\\server.py::18::19::python::handler", "project_root": .... }
azinman2
Why not just use line numbers?
deaux
Great article, recommend reading all of it.> Why bother, you ask? Opus may be a great model, but Claude Code to this day leaks raw JSONL from sub-agent outputs, wasting hundreds of thousands of tokens. I get to say, “fuck it, subagents output structured data now”.This is why I find the banning of using Claude subscriptions in other harnesses is so heinous. Their harness that they're forcing onto everyone has tons of big issues including wasting massive numbers of tokens. Very much in line with intentionally refusing to adhere to standards in the most IE6 way possible.
badhorseman
I feel a lot of confusion at which coding harness is best and what options to use. tbh I have mostly used standard aider and I don't know what the consensus is on this tool.I feel I want to write my own and that maybe in the future a lot of developers will have custom harnesses and have highly customized versions as each user of these models wants to use these things in a way that's unique to their brain, much like how emacs is so great for the customization but one persons emacs config is often not what another wants or only wants a subset and then write their own features.As an aside what is the feeling on all the various ai coding tools, does aider suck is that aider-ce/cecli are better or are the bespoke tools for each model like claudeCode and such better.
__mharrison__
Is there a skill file I can use for these edits?
kittbuilds
[dead]
reeddev42
[dead]
genie3io
[dead]
logicallee
I agree with this article completely, nice to see it presented quantitatively.>re "only" the harness changedIn our experience, AI's are like amnesiacs who can barely remember what they did three minutes ago (their last autonomous actions might still be in their context if you're lucky), with no chance at remembering what they did three days ago. As such, the "harness" determines their entire memory and is the single most important determinant of their outcome.The best harness is a single self-contained, well-commented, obvious, and tiny code file followed by a plain explanation of what it does and what it's supposed to do, the change request, how you want it to do it (you have to say it with so much force and confidence that the AI is afraid of getting yelled at if they do anything else) and a large amount of text devoted to asking the AI not to break what is already working. Followed by a request to write a test that passes. Followed by asking for its judgment about whether it broke what was already working on or not. All in one tiny crisp prompt.With such a harness, it's able to not break the code one time in twenty. If you use reverse psychology and ask it to do the opposite of what you want, it rises to fifty-fifty odds you'll get what you're trying to do.Don't believe me? You can watch the livestream (see my previous comments).Baby steps toward Utopia.