<- Back
Comments (172)
- CDieumegardInteresting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now
- simonwPelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...Solid bird, not a great bicycle frame.
- AurornisThe benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.
- pcwelderIt's live on openrouter now.In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.[1] https://github.com/rusiaaman/chat.md
- HavocBeen playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.Certainly seems to remember things better and is more stable on long running tasks.
- justinparusBeen using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.
- kristianpSo that was pony alpha (1). Now what's Aurora Alpha?(1) https://openrouter.ai/openrouter/pony-alpha
- cherryteastainWhat is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...[3] https://www.reuters.com/world/china/chinas-customs-agents-to...
- esafakI got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.
- woeiruaIt might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.
- 2001zhaozhaoGLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.
- beAroundHereI'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.
- goldenarmIf you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :Claude Opus 4.6: 65.5%GLM-5: 62.6%GPT-5.2: 60.3%Gemini 3 Pro: 59.1%
- pu_peReally impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.
- anonundefined
- mnickyWhat I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?How do you read this?[1] https://imgur.com/a/EwW9H6q
- jnd0Probably related: https://news.ycombinator.com/item?id=46974853
- algorithm314Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricingWhy is GLM 5 more expensive than GLM 4.7 even when using sparse attention?There is also a GLM 5-code model.
- nullbyteGLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.Edit: Input tokens are twice as expensive. That might be a deal breaker.
- ExpertAdvisor01They increased their prices substantially
- mohasI kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.
- unltdpowerI predict a new speculative market will emerge where adherents buy and sell misween coded companies.Betting on whether they can actually perform their sold behaviors.Passing around code repositories for years without ever trying to run them, factory sealed.
- tgtweakWhy are we not comparing to opus 4.6 and gpt 5.3 codex...Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.
- meffmaddIt will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.
- karolistThe amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0
- eugene3306why don't they publish at ARC-AGI ? too expensive?
- seydorI wish China starts copying Demis' biotech models as well soon
- woahIs this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?
- surrTurrwe're seeing so many LLM releases that they can't even keep their benchmark comparisons updated
- dana321Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.
- ChrisArchitect
- testuser_xyz[flagged]
- petetntWhoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!