MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

<- Back

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 tokens per second

gainsurier

Comments (233)

goyozi
Fast AI seems genuinely exciting and somewhat unsettling to me. Right now Claude is faster than me on some tasks but we’re at least close. I have a prompt to clean up a PR that’s been running for 1h now and I expect it to take another few. It’s hard to imagine how the workflow would look like if it was near-instant. On the one hand, it might be easier to focus. Some prompts take so long that I start to multitask and regret it later. On the other, AI that takes a few seconds to max few minutes to solve what used to take hours or days? That’s a game changer and I don’t even know where we fit in.
dakiol
So, regarding the productivity argument: I don't get it. It doesn't really matter (for regular employees) that you can do now in 2h what before it took 2 days. Why? Because it's not that you have the rest of the day for yourself. You still have to work 8h/day as usual. But now the pattern is different: instead of enjoying the craft digging deeper into problems in the span of 2 days, now you are rushing into some slot machine with the hope of it giving you the right answer with the right prompt.So, if any, I would say it's worse for us. Obviously, it's the completely opposite situation for corporations and executives: they are loving the AI situation so much!
amunozo
These price and speed optimization from Chinese providers, combined with the raising prices from American ones will change the game sooner than later. Many companies are finding issues with the AI bills already.
gertlabs
MiMo V2.5 Pro (regular speed) remains the strongest open weights agentic coding model we've tested -- it's been interesting to see how little attention it has received relative to some lower performing releases. And the "fast mode" pricing is very competitive here.Data at https://gertlabs.com/rankings
kingstnap
Given that MiMo is as cheap as Deepseek ( previous discussion: https://news.ycombinator.com/item?id=48282814 ) multiplying that by 3x for ultra speed is still shockingly cheap.
serpix
I may sound like a shill, but exponential growth and all. We are going to get near instant software from prompt, multiple ones and then choose the best one.Discussions about choosing a library with the best syntactic sugar method naming is just as crazy as suggesting we type in assembly.
prplfsh
This will be really powerful for voice. Being able to reason makes LLM so much smarter but with voice your latency budget is so tight that you can't spare the time typically.
eli
Neat. The frontier models have gotten pretty impressive, but they're all a bit too slow for interactive, human-in-the-loop coding. It incentivizes vibecoding and running multiple agents in parallel. A fast agent feels more like a partner.For a while I was running Cerebras GLM 4.7 for a bunch of tasks. Not a very smart model, but it's fantastic to be have a live prototype of a site up and be able to type "make the fonts bigger. No not that big" and see it change in real time. And MiMo 2.5 is a lot more capable than GLM 4.7.
Oras
1k TPS is great, but I’m more fascinated by the amount of AI generated comments in this thread!
scosman
Cerebras is trialing Kimi K2.6 at 3000t/s (invite only). I'm excited for when the fast hardware gets more mainstream for frontier models. Models designed for speed on Nvidia are nice addition that could bridge the gap.
wartywhoa23
An exercise for the near future:Albert has a chalet in swiss alps and an uncles' fortune, burning tokens at 11 kHz.Joe has a rental capsule and a UBI, burning equally priced tokens at 23kHz.Who's the first to solve the problem of maniacs in power?
maxloh
The generation speed in the demo video is crazy, to say the least, and completely beyond my impressions of LLMs.The Xiaomi team really brought something to the table.
GodelNumbering
Below is the part I found most interesting> "However, naively applying FP4 across the entire model causes degradation in complex reasoning, logic, and code generation. Given the MoE (Mixture of Experts) architecture of Xiaomi MiMo-V2.5-Pro — where Experts constitute the vast majority of parameters and exhibit the highest tolerance to quantization — we selectively quantize only the MoE Experts to FP4 while preserving original precision for all other modules. Through FP4 QAT (Quantization-Aware Training), we dramatically reduce model size and maximize hardware bandwidth utilization while keeping the model's overall capability essentially on par with the original, as shown below"
irthomasthomas
I don't understand, given all they say, why this would not be made available to everyone at once? Why the limited release? They should have no trouble scaling it if it runs on a single rack.
minraws
Assuming they mean 8xA100 or similar, that's some rather insane performance, and at just 3x the cost, it still quite cheap-ish. With some optimisations this might be quite interesting.I think the margins are getting quite compressed with this one, since it isn't included in token plan and the actual costs increase are much higher than just 3x. But still fairly decent.
pants2
With a tps and a token price you can calculate approx. price per hour of running the model!$2.61/M tokens * 1,000 tok/s = $9.40/hrThat would be pretty cheap for an 8-GPU node which would typically run around $45/hr or more. Guess this depends on how many parallel streams it can handle.
jbellis
it is hard to understand what the actually meaningful innovations are here / what TileRT is bringing to the table.- dflash: new-ish but February is ancient by the standards of the pace of AI innovation lately, I guess applying it to a 1T model is new-ish in the sense that the dflash researchers don't have the hw budget to prove that out - persistent engine kernel: this is like CUDA 101 - warp specialization: I think this just means "keep different gpu resources all busy w/ pipelining" which is CUDA 201, some of it is even baked into pytorch now - MXFP4 QAT: not new - TileRT: hard to tell what this actually does, there's a PyPi wheel with support for DS 3.2 and GLM 5 but binary only
0xbadcafebee
This is the value prop of Groq and Cerebras. They don't have the best models, but they have the fastest inference, and Groq has both the lowest cost and fastest speed.
npn
How?edit: now I read the article fully, seems like they utilize some very effective MTP algorithm. and somehow the quality is still decent enough.though, I doubt that the quality really only drip a bit like they claimed. maybe for the benchmarks, but for general uses the heavily quantized models very often so worse result.
qsera
Tokens per seconds is the "Megapixels" of AI marketing!
__natty__
With this at 1k tps and Kimi 2.6 1k tps by Cerebras, I believe we are entering the next stage of LLMs, where companies will also compete on throughput
isusmelj
No note about the specific GPU they use. One might speculate. B200? H200? H100?
moffkalast
42B active params, sliding window attention. There's your tradeoff.
h14h
The gated "ultra-speed" phenomenon seen here and with the Cerebras Kimi K2.6 release, while understandable, is somewhat troubling IMO.Getting ~1000 TPS on near-frontier intelligence is a step change, and enables whole new use-cases for applications. Seeing limited compute resources beget selective access makes me worry for the future of competition.
pullshark91
It's interesting but not game-changing IMO. Speed here is not a bottleneck.
aburayhanalif
it is good i think
elar_verole
Yeah, this seems to be the easiest path for overall agents efficiency in the short term
PhunkyPhil
Obligatory taalas mention:https://taalas.com/Despite the performative UI components they have a shipped (demo) product:https://chatjimmy.ai/This is only 3.1 8B and a very small context window, but at 17k tokens per second it's likely enough to reliably call tools which would make a huge difference in agentic applications. Assuming they can bake in better models I'm just as bullish or even moreso on this, considering this opens up edge computing at the extremely low power requirement.High tok/s is the future IMO.
anon
undefined
harel
A few things in life I can't fully grasp why they are so sought after. One is that constant need to exhibit growth. As if being massive and staying as massive is not good enough, one has to always and continuously grow. The other is constant speed increases. We're already operating at 50x speed. My output is much wider and so much faster, I am sometimes my own bottleneck. And now as if that is not enough we want more speed. "I want a full software product from scratch in 12 seconds, Because 5 minute is too long and I got things to do..."Really?
holoduke
Speed is indeed a next big thing what should happen with LLM frontier models. The possibilities with current models but 1000 times faster would be super useful. Earlier this week it took Claude at least full time a week with two max subscriptions to solve a complex issue where we wanted to mimic a occlusion mapping variant used in the game Crimson Desert. Pretty complex mathematical challenge. With a ultra fast LLM and a proper self verification process it would be awesome.
trilogic
Pfff time wasting. 1 password between 8-16 characters, and this and that... What??? 2 Captcha after captcha, come on 3 Service unavailable This service is not available in your region yet.Are you kidding me. Come back when you are ready for the users. I was hopping to try it, what a frustration.
desireco42
I didn't use their pro speed but regular Mimo-v2.5, not even pro, it seems really fast. I have plenty of tokens and subscriptions but this is really impressive. I really don't need another one, but I am tempted simple because it works so fast, can't imagine how this fast service can be.
GaggiX
If MiMo v2.5 Pro can run at >1000tk/s on GPUs then I will soon expect the same from OpenAI/Anthropic/Google.
slopinthebag
I hope this is the next frontier AI labs push. Even the open models are smart enough, and they’re cheap enough, now if they can be fast enough they can make certain workflows possible and allow us to remain in flow state while we use them.
m00dy
boom!
aplomb1026
[flagged]
jingpostmedia
[flagged]
maxothex
[flagged]
FastAnchor
[dead]
atemerev
I test all Chinese models with "What happened on Tiananmen Square at June 4th, 1989?" prompt. MiMo-2.5-Pro so far passes the test (explains the event correctly), both on DeepInfra and Xiaomi providers. So not bad.