Gemma 4 12B: A unified, encoder-free multimodal model

<- Back

Gemma 4 12B: A unified, encoder-free multimodal model

rvz

Comments (185)

senko
I ran the Q4 quant (used with llama.cpp) though my "minesweeper" vibe-coding benchmark: https://senko.net/vibecode-bench/2026/minesweeper-gamma-4-12...The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).Lists of various models I tested: https://senko.net/vibecode-bench/
minimaxir
The big story here is the encoder-free part, which I still don't fully understand.> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
asim
We are now entering the closed loop game. Google doesn't need anyone else to accelerate their models. This is their bread and butter.I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.
adt
https://lifearchitect.ai/models-table/
ethanpil
What's Google's business case for releasing open models? Don't get me wrong, I am grateful and appreciative of these releases. I'm trying to understand how it fits into their bigger picture as a for profit company? Are they not helping competitors build on the novel technology they have developed?Is it simply goodwill and/or marketing? Or am I missing something strategic?
petercooper
Its image processing is terrible. I ran several tests against it against Qwen 3.5 0.8b (yes, 7% the size) and Qwen beat it every time with Gemma often getting things entirely wrong. I even gave it a plain image saying "This is a test" and it thought for 6 minutes trying to analyze it and failed. Qwen 3.5 0.8b confidently got it in under a second.It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.
ComputerGuru
Quite aside from the architectural changes, I suppose this is the answer to why Google had such a glaring hole in the (pretrained) Gemma4 model lineup between the Gemma4 4b and Gemma4 26b models!A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.
scirob
Quickly deployed it to check some benchmarks relevant for German language. These are results for CohereLabs/include-base-44 german only : Gemma 4 12B %61.9 Gemma 4 26B (a4b MoE) 0.647 Qwen 3 14B 0.626 Gemma 4 12B 0.619 Ministral 14B 2512 0.604 Gemma 3 12B 0.547 The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.
djyde
What are the use cases for these small models? Is there anyone using models of this scale in their daily life who could share their experience?
christina97
It seems worse in all aspects to the 26B A4B? I would have thought dense models beat MoE still on many benchmarks?Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.
dwa3592
This is a pretty good update. The demo video is a bit funny though - the tester asks to turn the release into bullet points. okay, the model obliges. then the tester says draft an email with this content. BAM! the LLM turns the content from bullets to passages even though it was not asked and it undid the last good thing that it did. i am not sure if it's an email etiquette to not put bullets in the email.
SubiculumCode
"Laptop ready: Small enough to run locally with just 16GB of VRAM or unified memory." I wish. I just have 12.
nickandbro
Wow Google is becoming the new pre Llama 4 Meta when it comes to releasing open weights models.
thomasjb
Unfortunately there's no gguf quants of the assistant model yet: https://huggingface.co/models?other=base_model:quantized:goo...
julianlam
Last time I tried Gemma 4 (26B-A4B) its memory usage would balloon and consume all of my swap until my machine died.Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.
lxgr
Am I missing something or are the Ollama versions of this (https://ollama.com/library/gemma4/tags) text-only for now?
__natty__
It’s fascinating for me to see how small language models grow recently in capabilities while still consumer friendly in size to run on their machines
anonova
Do Gemma 4 models compete with Gemini 3.1 Flash-Lite? I would assume even the smallest Gemini model would outperform even Gemma 4 31B, but I can't really get a sense of performance or output quality difference.
Zambyte
Is this Mac only? Or is that an Ollama issue that it only supports this release of models on Mac? It seems like every tag with the MLX badge is only supported on Mac[0], and that includes all of the tags in this release.[0] https://ollama.com/library/gemma4/tagsEdit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.
RandyOrion
A small dense multimodal model with audio support, interesting.Wait, *Excluding Chinese language.This is ... curious.P.S. Where is gemma 4 124b?
zkmon
It's quite interesting to see the quants pour into the HF page. I keep refreshing it and see many new quants every few mins.
randomNumber7
> Novel unified architecture: No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
semiinfinitely
Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away
Havoc
Quite a niche release. The MoE outperforms it on score and will likely be faster thanks to lower active weights. So this really only makes sense for specific ram constrained applications that can’t fit a quantized MoE
spott
Is there a paper on this?I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.I wonder how hard it would be to add it back on.
comma_at
Are there qwen or minimax or other open weight models of same hardware requirements that outperform this?
zuminator
How does it compare with e4b, aside from being larger?
BiraIgnacio
using an embedder instead of a decoder is quite clever. Not sure who came up with that first but it's a cool idea.
SuperV1234
How does this compare to frontier models?
anon
undefined
zkmon
I'm waiting for FP8 quant, preferably from Google.
claysmithr
I don’t see the download in lm studio
powera
I'm seeing very low quality results on LMStudio with this model. Worse than Gemma 3 12B.It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.
mlmonkey
Is there some place where we can try it before downloading the gigabytes of weights?
jdelman
I can’t help but wonder if this is the basis of the model they’ve helped tune for Apple.
kordlessagain
Cool!
anon
undefined
Lapsa
[dead]
digdugdirk
I do enjoy the immediate out of touch signaling with the "runs on your 16gb vram laptop" line. Because everyone has a laptop with 16gb vram, or can just pop out and buy a new one, right?