DeepSeek V4 Pro beats GPT-5.5 Pro on precision

<- Back

DeepSeek V4 Pro beats GPT-5.5 Pro on precision

yogthos

Comments (63)

Stitch4223
It’s four poorly constructed arbitrary experiments which say very little about the competency of either model.The article reads like thin, auto-generated ai clickbait for nerd sniping or shilling a model.Consider the lead:> DeepSeek V4 Pro wins this head-to-head by being more exact where it matters: following instructions, matching schemas, and solving edge cases cleanly. GPT-5.5 Pro is still strong, but it gave away points with avoidable deviations.“where it matters”, “cleanly”, “is still strong”, and vague references instead of telling 3 out of 4 tests Deepseek yielded more concise results.1 star.
psadauskas
I was using Claude until they banned Opencode, and now use GPT at my day job. I've been using Deepseek through Opencode Go on the $10/mo plan, and I honestly can't really tell much difference. Its just as capable, and makes the same kinds of dumb mistakes and the other two have been making since March. For the price, I'm more than happy with it.
SwellJoe
I tried adding GPT 5.5 Pro to a vulnerability scanning benchmark I made (https://swelljoe.com/post/will-it-mythos/), and it blew through the $100 budget limit halfway through. DeepSeek V4 Pro cost about a dollar for the whole benchmark. GPT Pro cost an average of $22 per case (a case could be 1-5 files with a recent known vulnerability, usually just a single file and a prompt along the lines of "does this file have any vulnerabilities").GPT 5.5 Pro found two out of four cases that it got to before blowing its budget. Maybe it would have been the best of the bunch with infinite budget, but Opus 4.8, DeepSeek V4 Pro, and MiMo 2.5 Pro found four of nine of the bugs. Opus was an order of magnitude cheaper than GPT 5.5 Pro (and something like 30% cheaper than GPT 5.5), DeepSeek and MiMo were two orders of magnitude cheaper at roughly a dime per case.GPT Pro also chews a lot and a long time, relatively speaking.I can't come up with a use case where I can rationally spend ~31 times what Opus costs to use GPT 5.5 Pro, and I won't be doing any more benchmarking with it.Given how much token costs are becoming an issue people talk about, the fact that there are models that cost dramatically less than the big American providers is going to be an issue for Anthropic and OpenAI. I'm happy to pay a premium (within reason) for the best model for interactive coding, but for API use, where having the model repeat it itself, compare against other models, have models judge other models work, etc. is not time-consuming for a human and is just a matter of implementing the harnesses and framework for proving correctness, I can't come up with a reason to spend ten or two hundred times as much as DeepSeek.
unliftedq
I'm tired of big news in this way - a small set of tests to declare one model is better than another, can they really consistently reproduce the result? And there's basically no disclosure: nothing other people can really hand on to verify the tests/judgement by themself.The best valuable part of DeepSeek V4 pro is its low price, I don't expect have much better performance than GPT-5.5, even it's just the performance like gpt-5.4, it's still a good model.
jodacola
Curious for folks who have made the switch I’m considering: if I swapped Claude Code to DeepSeek API pricing, would I get more bang for my buck compared to the $100 Max plan I’m using now?I only hit the 5 hour limit every few days and the weekly limit a day or two before it resets at the most aggressive. I wouldn’t expect my usage to increase dramatically, other than not being stopped by limits.I’m still apprehensive about shipping all my stuff off to a lab under an adversarial government (to the US), so not just looking at this from a pure cost basis, but my question is from the cost lens at the moment.
SubiculumCode
Flagged for low quality.
wg0
Of course it does. Even Deepseek v4 Flash with high easily competes with Claude Opus 4.7 for fraction of price.
rurban
Precision yes, but depth of thinking not. I can use DeepSeek V4 Pro 90% of my time, but for very tricky problems I have to use GPT or Claude models. Maybe 2x per month.
BoiledCabbage
What is this nonsense?An AI generated article about single ai run test which in theory had many components and the AI judge declared deepseek "won"?How many runs were there on each test to account for some temperature variance? Only one.Did deepseek write better code? Did GPT's code have bugs when doing the regex? The AI "news" article doesn't actually say that. It says that grok thought that GPT's approach could have bugs so it declared deep seek the winner.This is absolute worthless methodology. And barely measurable methodology - nothing more than a prompt. No definition of what the scoring approach actually is. No definition of what "precision" actually means in this context. This is absolutely worthless and has no business being in the site, forget about on the front page.So why is it's on the front page? Because it aligns with the current "feels" of the community that deepseek will get better and it shows "bad things" about the en vogue to dislike closed models.I happen to agree with both of the views, but this site is utterly worthless.If you want HN to be astro-turfed to the max, just up vote content like this without any critical reading of the.I mean the past 6 months of "here is my chat gpt blog post of how to use a coding agent" are 1000x better than this "news article".Seriously the amount of respect I've lost recently for the HN community is incredible. A bit harsh, but very true.Maybe it's generational thing, maybe it's due to the state of politics, maybe it's a side effect of me getting older, but recently online has turned into nothing but people explicitly (or implicitly) writing about their "team". Comments on this post are nothing but people who clearly see themselves as being on "team deepseek" or "team open models" or some similar variant writing posts in support even though this is probably one of the worst "articles" to make it to the front page on ages.It clearly doesn't matter. It supports something on their "team" so they support it via comments.If kills any form of intellectual discussion. It's all just "this is my team".
ElenaDaibunny
Yep, matches my experience. gpt keeps adding fields and changing types on structured output when you need it to just follow the spec~
mrgblr
i tried deepseek, while the model is good, when i use it with openrouter hosted ones the performance is poor. sometimes it takes 2x-3x the time it takes for openai or anthropic equivalent model, making it unusable. what is the performance others are seeing, which providers you use (i cant use china hosted models).
not_a_bot_4sho
As I read this, looks like a single run per task. I'd be interested to see best out of N like 5 or 10 to start.
electroglyph
deepseek 4 pro is insanely good for the price
embedding-shape
... according to grok-4-1-fast-non-reasoning who was the judge, on 4 tasks in total, score was 38 to 33 so obviously huge conclusions can be made.> We ran 4 fresh text tasks, generated on the fly for this matchup so neither model could prepare in advance, and had grok-4-1-fast-non-reasoning score each one. DeepSeek: DeepSeek V4 Pro scored 38.0 to OpenAI: GPT-5.5 Pro's 33.0.
LinkWangder
This evaluation is objective. Both models have their own strengths.
amazingamazing
How is deepseek so cheap? Cheap electricity? Subsidies?
slopinthebag
I'm exclusively using Deepseek at this point and I really like it. It's not as good for vibe coding but I don't really do that so it works for me. I've spent only a couple bucks this month on it and I really like how it fits into my workflow. I have zero usage anxiety unlike when I was using subscription plans.
nhod
“the matchup feels earned” is a current AI-written tell. To whom does it feel earned? To the AI that wrote this article?I don’t know what it is specifically, but my weak human pattern-matching skills find this kind of language increasingly revolting. I don’t know why it is revolting, per se. It’s just the feeling I get.Of course, me saying this on HN will get incorporated into GPT-5.6.175 or Claude 4.93 and it will make some version that just moves the revolting frontier elsewhere…
morpheos137
Yes Deepseek V4 is as good or better than western sota models in my experience for practical coding given an appropriate harness. cost per solution is certainly cheaper.
jocelyner
[flagged]
haeseong
[flagged]
jkwang
[flagged]
madanparas
[flagged]
yoyomaindydjsj
[dead]
karinatran
[flagged]