GLM-5: From Vibe Coding to Agentic Engineering

<- Back

GLM-5: From Vibe Coding to Agentic Engineering

meetpateltech

Comments (172)

CDieumegard
Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now
simonw
Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...Solid bird, not a great bicycle frame.
Aurornis
The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.
pcwelder
It's live on openrouter now.In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.[1] https://github.com/rusiaaman/chat.md
Havoc
Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.Certainly seems to remember things better and is more stable on long running tasks.
justinparus
Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
kristianp
So that was pony alpha (1). Now what's Aurora Alpha?(1) https://openrouter.ai/openrouter/pony-alpha
cherryteastain
What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...[3] https://www.reuters.com/world/china/chinas-customs-agents-to...
esafak
I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
woeirua
It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.
2001zhaozhao
GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.
beAroundHere
I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.
goldenarm
If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :Claude Opus 4.6: 65.5%GLM-5: 62.6%GPT-5.2: 60.3%Gemini 3 Pro: 59.1%
pu_pe
Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.
anon
undefined
mnicky
What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?How do you read this?[1] https://imgur.com/a/EwW9H6q
jnd0
Probably related: https://news.ycombinator.com/item?id=46974853
algorithm314
Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricingWhy is GLM 5 more expensive than GLM 4.7 even when using sparse attention?There is also a GLM 5-code model.
nullbyte
GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.Edit: Input tokens are twice as expensive. That might be a deal breaker.
ExpertAdvisor01
They increased their prices substantially
mohas
I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.
unltdpower
I predict a new speculative market will emerge where adherents buy and sell misween coded companies.Betting on whether they can actually perform their sold behaviors.Passing around code repositories for years without ever trying to run them, factory sealed.
tgtweak
Why are we not comparing to opus 4.6 and gpt 5.3 codex...Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.
meffmadd
It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.
karolist
The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0
eugene3306
why don't they publish at ARC-AGI ? too expensive?
seydor
I wish China starts copying Demis' biotech models as well soon
woah
Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?
surrTurr
we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated
dana321
Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.
ChrisArchitect
Earlier: https://news.ycombinator.com/item?id=46974853
testuser_xyz
[flagged]
petetnt
Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!