GLM 5.2 beats Claude in our benchmarks

<- Back

GLM 5.2 beats Claude in our benchmarks

jms703

Comments (180)

pimeys
I have taken another look on these open models after the fiasco of Fable and GPT 5.6 this weekend and... GLM-5.2 truly is a good workhorse model for daily programming. I consider myself a heavy user of LLMs and a seasoned developer. A typical session for me with GPT is usually over a hundred dollars...This weekend I programmed a matrix bot with encryption and a Rust agent with some tools. Because I need one and OpenClaw just felt... not what I wanted. Two days later and 20 dollars poorer I have what I need: a multimodal agent written in rust that has access to my homelab.Nothing felt off with GLM. It did what I wanted, was fast, had a decent not very annoying personality and was much cheaper than Opus or GPT.I used it unquantized through Fireworks, but there are multiple other providers too.
SwellJoe
I added GLM 5.2 to my security bug hunting benchmark when it came out, and found it to be a good performer, but not the best open model. The benchmark tests whether models can find bugs Mythos found. The best open models in the initial benchmark were DeepSeek V4 Pro or MiMo 2.5 Pro. But it turned out MiMo got lucky, it's performed worse on almost every test I've done since, while DeepSeek has consistently been among the best performers and its extreme caching performance makes it cheaper than just about anything, including much smaller models.https://swelljoe.com/post/will-it-mythos/Also of note, I found giving models access to the open source semgrep as a tool makes some perform worse and none perform better, though it's plausible there's a way to wire it up in a harness that presents useful information to the model without the model having to know how to use it (my theory is that semgrep isn't heavily represented in the training data, so you're asking the model to do two things at once: figure out how to use semgrep and find security bugs, and both tasks suffer for the lack of focus...most small models, and some big models, can't do that well).Edit: But, also, more testing is ongoing. I suspect GLM 5.2 will also be a consistently strong performer. It seems to excel at most things I've tested on it.
bArray
Apparently GLM 5.2 is 753B parameters [1], what kind of hardware are people using to run this locally?[1] https://huggingface.co/zai-org/GLM-5.2
himata4113
These numbers are seem pretty low compared to what I was able to achieve specifically around windows kernel, win32k<->win32u to be exact. It honestly wouldn't surprise me anymore if china started surpassing models that US makes public, at least in specific categories such as cyber.GLM 5.2 is already capable enough to assist in self-training which is similar to what we saw happen with frontier models and they appear to be getting there at a significantly lower cost than openai/anthropic.
WithinReason
> [...] beating Claude Code (32%) at roughly $0.17 per vulnerability foundClaude Code is an agent harness, not an LLM.Claude is a brand (or group of LLMs), not an LLM.
solenoid0937
GLM export controls incoming? I predict Commerce will force OpenRouter, HuggingFace to take some open models down within the next few months.Not that it would make any sense.
croemer
They should also at least run Opus through the same Pydantic harness they used for GLM. Asi is, it's apples vs pears.Where's the cost per vulnerability for all the other models than GLM?Also, without code this isn't very trustworthy. Could all be made up as well.
jackdawed
I use GLM 5.2 via Neuralwatt and it's gotten so cheap I wouldn't mind cancelling my personal Claude subscription if work gave me one. I've spent 374M tokens this month and it only cost me $18 on energy-based pricing.
danslo
It reads like an ad.Secondly these are "just" IDORs, arguably the easiest class of vulnerabilities.Thirdly it compares to GPT 5.5 and Opus 4.8.No, we don't have Mythos at home.
anon
undefined
_s_a_m_
I tried GLM many times and it is bad, i have on clue what these people are talking about
g42gregory
If only the "cybersecurity" crowd were focused on patching the vulnerabilities.Instead of shilling for the LLM providers.
theteapot
> Constant: the IDOR dataset (the same real, open-source applications we've used in prior research) ...What we're they? Also, wouldn't one expect a more recently released coding agent (with a more recent knowledge cut off) to perform better because they have access to more knowledge about vulns in these OSS projects, and even possibly have knowledge of your own "prior research"?
anon
undefined
admax88qqq
> beats Claude in our Cyber BenchmarksBeats which model in Claude? Whenever a "benchmark" doesn't put precise model numbers in their headlines I am immediately skeptical. Either they don't know the difference (bad) or they are benchmarking against weaker models (misleading, also bad).It's like when studies say "AI is bad at X" and they used GPT-3.5 in current year.
slashdave
Advertisement
CurbStomper
people still using Claude?
veselin
Here, it appears they compare a single prompt "find IDOR", against a multi-agent system. However, one can also start far more sophisticated skills that spin up subagents and mostly do the same in Claude Code, Codex, OpenCode, Pi, etc.Which I guess makes what semgrep sells obsolete. Unless they have built a pareto-optimal point in terms of capabilities and token usage maybe?
lowbloodsugar
Felt like I was reading advertising for their harness.
Art9681
This is because of the safeguards and not the model capabilities. If these folks signed up for the proper cyber service offered by Anthropic where refusals are removed then the open weight model wouldn't look as capable.
kordlessagain
You can launch GLM-5.2 in Opencode using Nemesis8: https://github.com/DeepBlueDynamics/nemesis8#nemesis-8After installing, do a `n8 build` to build the image, then `n8 --danger --provider opencode interactive` to launch it in a container.Signup for GLM-5.2 here: https://z.ai
dools
I think Opus 4.8 is deliberately nobbled. Kimi k2.6 with Kimi code beats opus models at finding vulnerabilities, even though it produces some false positives, when I give the same issues to opus and ask it to verify most of the time it concurs it’s a real issue even though it failed to find the issue itself
laybak
how representative are Semgrep's benchmarks? everyone seems to have their own benchmark these days (guess it's good "content marketing") I'm honestly losing track
dist-epoch
Anthropic is saying other models were good at detecting vulnerabilities, where Mythos excelled was in creating functional exploits for them.This article only talks about detecting vulnerabilities, so it's unclear if it's a true Mythos equivalent.
cmrdporcupine
I like GLM 5.2... ish. It's ok.I'd be mostly fine switching to it.I just can't find a cost effective way to do that. z.AI's coding plan is both overpriced and unreliable. ollama's is also overpriced. Paying by the token for it on openrouter etc is more expensive than just having a Codex or Claude coding plan.If you have to pay by the token, it's clearly cheaper. It's not competitive with a coding plan though.
utunga
Just popping in to say that no you can't use the word "tokenomics" to mean that. Argh.
lenerdenator
The incentive to develop Claude further is to make money.The incentive to develop these Chinese models further is to trash the business case of most American AI labs.
yieldcrv
who is your favorite hosted GLM 5.2 provider? I'm looking for fastest tokens/sec and best costadditionally, reliable API, because z.ai can be finickyalso, not for Enterprise use, but I like non-US providers, I don't care if the party happens to be the one reading my information and stealing my trade secrets, if they won't respond to a US subpoena
csjh
I found it to spiral into complete nonsense a few times when I tested it out, but it's possible that was a bug in the provider
TacticalCoder
How to reconcile that with the recent, highly upvoted, article titled: "The gap between open weights LLMs and closed source LLMs"?What explains it?Is TFA lying? Is the most upvoted comment here lying?
aussinholdn
[dead]
rode1974
Hopefully i get a macbook pro soon enough to run some small or medium sized LLMs
BikiniPrince
This is a joke right? I wouldn't install this in a sandbox.