$500 GPU outperforms Claude Sonnet on coding benchmarks

<- Back

$500 GPU outperforms Claude Sonnet on coding benchmarks

yogthos

Comments (123)

bloppe
Generating big chunks of code is rarely what I want from an agent. They really shine for stuff like combing through logs or scanning dozens of source files to explain a test failure. Which benchmark covers that? I want the debugging benchmark that tests mastery of build systems, CLIs, etc.
mmaunder
I’d encourage devs to use MiniMax, Kimi, etc for real world tasks that require intelligence. The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable. Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
selcuka
It's a race to the bottom. DeepSeek beats all others (single-shot), and it is ~50% cheaper than the cost of local electricity only.> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline
memothon
I'm always skeptical because you can make it pass the benchmarks, then you use it and it is not practically useful unlike an extremely general model.Cool work though, really excited for the potential of slimming down models.
b3ing
Will open source or local llms kill the big AI providers eventually? If so when? I can see maybe basic chat, not sure about coding and images yet
emp17344
Yet more evidence that the harness matters more than the model.
electroglyph
what's with the weird "Geometric Lens routing" ?? sounds like a made up GPTism
tgiba
Despite skepticism I love to see experiments like that. If we all are able to run an open source model locally on mid-high end machines I'd be very happy.
bdbdbdb
This is the kind of innovation I love to see. The big AI companies days are numbered if we can have the same quality in house
Temporary_31337
the headline is pretty stupid - compares a model to a GPU that models run on. Somewhere in that data centre, some part of Sonnet infferencing runs on a 900$ GPU or maybe even cheaper Google tensor
riidom
Not a word about the tok/sec, unfortunately.
15minutemail
74% on LCB from a single 5060 Ti. I've been paying Anthropic per task and this guy is running it on electricity money, 20 minutes per task is rough for anything interactive though.
0xbadcafebee
This is specifically an experiment using ablation and multiple passes to improve the end result. Other techniques have been found that do this (like multiple passes through the same layers). But this technique - for this one specific model - seems to be both more performant, but also takes much longer, and requires more complexity. It's unlikely most people would use this technique, but it's interesting.
sznio
On that topic, anyone here got a decent local coding AI setup for a 12GB VRAM system? I have a Radeon 6700 XT and would like to run autocomplete on it. I can fit some models in the memory and they run quick but are just a tad too dumb. I have 64GB of system ram so I can run larger models and they are at least coherent, but really slow compared to running from VRAM.
anon
undefined
negativegate
Am I still SOL on AMD (9070 XT) when it comes to this stuff?
limoce
The title should be "Adaptive Test-time Learning and Autonomous Specialization".
superkuh
If anyone else was hoping this was using Q8 internally and that converted to Q4 it could fit in 12GB VRAM: unfortunately it's already at Q4_K_M (~9GB) and the the 16GB requirement is from other parts not a 14B@8bit+kv cache/etc you might guess.
Razengan
Claude Code has been bleh or meh at best in my experience. There's so many posts on HN fawning about it lately that it could only be a guerrilla marketing campaign.
eddie-wang
[dead]
itigges22
[dead]
wiradikusuma
[dead]
felixagentai
[flagged]
sayYayToLife
[dead]
bustah
[dead]
ozgurozkan
[dead]