SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

<- Back

SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via CI

mpweiher

Comments (37)

agent5ravi
The resolve rate numbers are interesting but I keep coming back to the regression question. In my experience doing code review on a real codebase, the hard part of maintenance is not fixing the thing that broke. It is understanding whether your fix preserves the invariants the original author had in mind but did not write down.A benchmark that checks CI pass/fail captures the first part. It cannot capture the second. An agent that makes CI green by weakening an assertion or bypassing a check will score well here but create a time bomb.The monorepo point from yuyuqueen hits this. When the agent can see the full dependency graph, it is less likely to fix something locally while breaking a downstream assumption. The biggest maintenance failures I have seen are not wrong logic. They are fixes that are locally correct but violate an unwritten contract between components.
mentalgear
Claude wins by a large margin* Claude Opus 4.6 : 0.71* Claude Opus 4.5 : 0.51* KIMI-K2.5 : 0.37* GLM-5 : 0.36* GPT-5.2 : 0.23Note: later GPT versions seem to be only available within openAi's proprietary codex cli, so can't be tested - and if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.---Of course, the interesting follow-up question is: How well perform these models with added agent tooling ("harness") ?Maybe someone has tokens to burn and can run a matrix of agent tools over the top models and provide the results?
gizmodo59
Unfortunately the paper doesn’t include gpt 5.3 which was released around the same time as opus 4.6 and also gpt 5.4 few days back. Both are available via apihttps://developers.openai.com/api/docs/models/gpt-5.3-codexIMHO The harness must be used when running these experiments. The model vendors know best on giving the best harness with gpt 5.4 and codex or Claude code with opus 4.6 which makes a big difference if you are running any kind of agentic coding tasks.I see both Claude and gpt to be neck and neck in coding. Every other model+harness is definitely 3-6 months behind. Right now codex seems to be the best in terms of solving complex bugs, long running tasks, much higher limits and even speed while Claude seems to do well in front end and their cli ux seems nice! Codex app is very good though (wish it wasn’t electron as a memory hog but it’s good)
50lo
It’d be interesting to see this compared against a human baseline — e.g., a competent engineer with a fixed time budget on the same tasks.
yuyuqueen
The regression rates match what I saw early on with Claude Code on my monorepo. The fix was structural, not model-level: keeping everything in a single tree (packages, tests, docs, CI config) so the agent sees downstream effects of any change. When context is split across repos, agents cheerfully break imports because they literally can't see what depends on what.Something hard to capture in benchmarks: project-level conventions. A well-maintained CLAUDE.md at the repo root — describing architecture, naming patterns, test conventions — gives the agent context it internalizes before touching code. My regression rate dropped noticeably once I started maintaining that kind of project metadata. Model choice is only half the equation — the other half is how well you've structured the information environment the agent works in.
baalimago
Replace "Agent" with "Employee" and apply the same algorithm. Evaluate employee efficiency. Profit?
KronisLV
> The benchmark comprises 100 tasks, each corresponding on average to an evolution history spanning 233 days and 71 consecutive commits in a real-world code repository.This seems like a really cool thing to benchmark! Technically it'd be possible to take GitHub repos that the AI orgs probably already have, cross-reference the code against the issues and regressions, and train/validate on that.The dataset would need to be way bigger to get close to the likes of SWE-bench: https://www.swebench.com/original.html"Vibe coded stuff gets hard to maintain and will end up buggy." Yeah, so make models that deal with that better, optimize for maintainability and consistency.Cool to see Claude doing decently though!
rurban
The zero regression rate graph at the end is exactly my experience. Only Opus is useful right now, the rest are juniors.
smy20011
It interesting to see that the eval set becoming more and more expensive. Previously we just need to evaluate one test set, right now we need to create a lot of diffs and run a lot of tests.
jbergqvist
Would have loved to see a more detailed breakdown of performance by task type. The commit metadata is right there, seems straightforward to tag commits as feature vs refactor vs bug fix vs API change and report per-category numbers.
woadwarrior01
Interesting benchmark.I can't help but notice that they're benchmarking Opus 4.6 (Anthropic's latest and greatest model) against GPT-5.2 (which is three generations behind OpenAI's latest coding models: GPT-5.2-Codex, GPT-5.3-Codex and the latest GPT-5.4).
challengerVIE
To me using agents daily, the long term vision with maintainability in mind really makes the difference between us humans and agents, I like the idea. However evaluating long term maintainability over an average of just 500 loc changes does not sound like long term maintainability being measured here
PunchyHamster
I'm sure with benchmarks like these future LLMs will be optimized to hide regressions by "fixing" test framework too
qsera
>Alibaba Group
verdverm
Really long-term task benchmark showing significant improvements in very recent models, while also showing really bad regression rates across the board.
raphaelmolly8
[dead]
entrustai
[dead]
devcraft_ai
[dead]
coder_decoder
[flagged]