VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

<- Back

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

timhigins

Comments (175)

secretslol
Am I right in thinking this is a tiny model which has been trained well to reason, and that's it? Makes me think of a smart person who doesn't know anything about a given topic, but with the right tools will go and research the heck out of it. I really like the sound of this... why have models train on learning anything when you can just train them how to learn and let them get on with it from something as small as a Pi Zero and an internet connection.
rbbydotdev
Looks like we are seeing small but mighty model breakthroughs, outpacing the pure capital firepower of SOTA providers. I love rooting for the little guy, but is it too soon to call it? To play devils advocate, could it just be the benchmarks are not efficient enough to capture success of real developer workflows?
deftio
There is some base level of intelligence any model needs to be useful, even in narrow tasks.Could you teach a 5 year old to drive a car? A 10 year old? A 12 year old? To drive a car requires being able to read, to have judgement about ice or rainy conditions, to anticipate a child running after a ball. By the time a human in in their mid teens they have acquired the base knowledge...Small models need to have enough base knowledge to be able to be good enough -- even in a seemingly narrow regime. Where is that? Obviously they don't need all the obscure knowledge of a frontier model but there is some base level which is probably more than it would first seem.
gslepak
Note that these are Python-only results, the model will not do as well with other languages.I'm glad to see more domain-focused SLMs, we need more of them! A programming focused MoE should work well across many languages.
NotSuspicious
The interesting thing about models this small is they should be able to be put on a single Taalas chip (the HC1 already runs a Llama 3.1 8B model). We're already at the point where half-decent reasoning could be run on an ASIC (and at mind-boggling speeds).
makethembroke
I don't get this beating opus, It just hardcoded the tasks for bench , It does even respond normallyA alot randomness in itPlease don't hype
noperator
Having some success while testing this model out as a replacement for GPT-5 nano in source code security review. Running on RTX 3090 (24 GB VRAM) via vLLM. It's not great on structured output (as noted in the model card) but I'm working around that in my harness.
yousif_123123
I really hope that in a couple of years I can have a laptop that runs a reasonably good coding agent locally, that I can run fast and do most of my programming with, without running my laptop hot. I could keep open code and use other models when needed, but really for most of my work, I'm already breaking it down so that I can review code changes eventually, and I just need something reasonably decent and fast and unlimited. I think its coming.
andai
I tried actually talking to it. It reminded me of GPT-2.
sorenjan
How would you best utilize a model like this for coding? I take it it's not meant for vibe coding a full app, and the reasoning probably makes it unsuitable for autocomplete. Would you use it to implement specific functions? I looked at one of the coding benchmarks used, Live Code Bench, and it seems to be problem descriptions with sample input and output, and then a solution with a single function or class.Seems like a really good model to use in an IDE when you still want control over the code structure then.
nolist_policy
Notable: VibeThinker-3B is developed through a staged post-training pipeline built upon Qwen2.5-Coder-3B base, a compact 3B foundation model. Qwen2.5 is ancient by LLM standards.
aero2146
I tried generating the classic pelican svg, but it failed horribly just showing me a rectangle and a black circle...
achrono
Beats Opus 4.5 on reasoning you say?Prompt: If A goes to B who then goes to C, can A send something to C?Response:We need to interpret best. The phrase "If A goes to B who then goes to C, can A send something to C?" could be a puzzle about the concept of sending something (like passing a ball) and the relationships.Scenario: A gives something to B, and B passes it on to C. Question: Can A also give the same thing to C? Answer: Only if A can obtain a second copy (e.g., the thing was duplicated). Otherwise, after handing it to B, A no longer holds it and cannot “send” it unless a copy exists.[Lots of other unnecessary commentary and "scenarios" that make even lesser sense]
virajk_31
SLM when trained for single use case often beats the LLM. That's both the advantage and limitation.
androiddrew
I have been thinking about how to use this. Since it doesn’t support tool calling I have been considering a dual model deployment, where a small tool calling llm drives the majority of the user experience, and vibe thinker is tapped for reasoning by the other llm.So who has suggestions on small models with excellent tool calling capabilities?
jpcompartir
The absolute worst name for a model I've seen
iamgopal
Two model, one is optimised for system, reasoning etc, second is optimised for specific language ( rust or go ? ) , both small enough to run on local computer, will it work ?
SwellJoe
It's terrible at hunting security bugs (I expected it to be, but I wanted to be sure). I added it to a benchmark I made with a corpus of some Mythos-discovered bugs, and it found zero. The smallest pretty successful models remain Qwen 3.6 and Gemma 4 (but I haven't tested the very small variants of those yet).https://swelljoe.com/post/will-it-mythos/
brainless
I recently came across this model and I would love to try it with my coding agent soon.I really like the idea of small models that can reason but do not have too much knowledge. Also, no emphasis on tool calls. I think the agent should do the heavy lifting and reach half way.I use really small models, like Qwen 3.5 0.8B to 9B - no tool calling, no MCP, no skills, nothing. No multi-turn chat even. Models are given very specific tasks using a vast number of system prompts and all the response handling is done in the agent(s).https://github.com/brainless/nocodo
cold_harbor
GRPO skips the value network that makes PPO expensive — it scores candidates relative to each other within a group. that's what makes verifiable-reward training practical at 3B scale
uberex
What is the idiots guude to run this one local now?
unfirehose
this is a good model. I benchmark reasoned answers to qwen 3.6 27b (no think)+ bash and it held up.
diimdeep
BF16 with no QAT quants == half backed bread
scotty79
If you could pair it somehow with a model that can code and describe code this could be a very powerful combo.
anon
undefined
anonyfox
Wake me up when it does OCaml fine.
4gotunameagain
What are the implications of local SOTA inference, given the insane datacenter "investing" ?It surely cannot be justified only for training at this scale, and since models nowadays are improved more and more by fine tuning than re-training from scratch.Will a viable local model crash the US economy ?More importantly, are the LLM companies aware, and are they deliberately buying out all the RAM and GPUs in order to prolong the inevitable ? Probably not, but I wouldn't be surprised if that is the case.
viduus
[flagged]
diseasedyak
[dead]
c121618
[flagged]
sosojustdo
[flagged]
jkwang
[flagged]
riponcm
[dead]
lisa_luoyf
[flagged]
cheekygeeky
[flagged]
t_e_s_t
[flagged]
t_e_s_t
[flagged]
maxignol
3B param on par with opus 4.5 sounds interesting. Will read the full article before making my mind
zkmon
Does python coding depend on political facts of the world?It might appear not, but actually, the process of reasoning is not an isolated act. The right and wrong way of doing things is codified in social evolution that absorbed all facets of life. Why should you optimize a piece of code for performance? Why performance is needed? What is a bug? What features and UI themes would be more intuitive for humans?There is a butterfly effect. Everything affects everything to some extent.
kmchandy
The paper makes a clear claim: "it provides an important and concrete proof: on well-constrained, verifiable reasoning tasks, first-tier performance is no longer the exclusive domain of ultra-large models" And that's exciting.