What's in a GGUF, besides the weights – and what's still missing?

<- Back

What's in a GGUF, besides the weights – and what's still missing?

bashbjorn

Comments (40)

Philpax
I regret that the projection models ended up separate, and I too would have preferred for them to be in a single file. I'm not entirely sure why that ended up happening, but it very much runs counter to the single-file ethos I had in mind when I designed GGUF.Hoping that someone will shepherd the cause of merging the two; I think I'm too out of the loop to do it this time around :-)
uyzstvqs
GGML & GGUF have been extremely important to the open-source ML/AI space. Projects like llama.cpp, whisper.cpp, and stable-diffusion.cpp tend to just work perfectly, across a whole bunch of different platforms and hardware backends.
anon
undefined
Sharlin
> The really neat thing about GGUF is that it's just one file. Compare this to a typical safetensors repo on huggingface, where there's a pile of necessary JSON files scattered around [...]Funny, to me AI models have "always" been single files, as that's what has been the norm in the local image gen business. Safetensors files allow stuffing all kinds of stuff inside them too, no GGUF needed for that. Though given that the text encoders of modern models are multi-gigabyte language models themselves, nobody includes redundant copies of those in every checkpoint.
theapadayo
IMO the biggest thing still missing is an actual way to define the model architecture outside of being hard coded into the current build. It doesn't need to be a 1:1 performance parity with the fully supported models. Having proper, vendor validated support for day 1 is what is the difference between people thinking a model is amazing vs horrible. See recent Gemma vs Qwen releases.Not sure what the solution is, other than writing a DSL to describe the model graphs which you then embed in the GGUF. The other fallback is to just read the PyTorch modules from the official model releases and convert that to GGML ops somehow.
amelius
> <|turn>user Hi there!<turn|><|turn>model Hi there, how can I help you today <turn|>Good lord, they managed to invent a format that is even less readable than XML.
badsectoracula
> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P[0] https://i.imgur.com/GiTBE1j.png
ge96
Nice, I recently pulled down TheBloke 7B mistral to try out I have a 4070.
monocasa
I mean, one if the big issues I've had is that it doesn't really store the compute graph. It only stores a string of the foundational architecture, along with parameter metadata to allow you to rebuild the compute graph.That means that every foundational model architecture requires new code in whatever is consuming the gguf to support that model.
kenreidwilson
>Published May 18, 2026hmmm...
halyconWays
Fun lore, GGUFs were once called GGJTs until I caught the "JT" (Justine Tunney) stealing the memory map code from a user who did 99% of the work in a draft PR (slaren) and lying about it, and misrepresenting or not understanding how memory map worked. She wanted her initials in the file format for bragging rights because it was claimed that it caused 90% memory reduction (actually it was just lazy loading into memory). Gerganov was quite angry when he found out what happened. Jart (JT) was then banned from the llama.cpp repo but managed to get back in a year or so later.
anon
undefined