<- Back
Comments (36)
- mingodadI'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GBAnd no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.
- antirezMy private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations. 1 │ DeepSeek API -- 100% 2 │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5% 3 │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0% 4 │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3% 5 │ qwen3.5:27b-q8_0 (thinking) -- 75.3% I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.
- moqizhengzRunning 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.
- CuriositryQwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
- b89kimI’ve been benchmarking GGUF quants for Python tasks under some hardware configs. - 4090 : 27b-q4_k_m - A100: 27b-q6_k - 3*A100: 122b-a10b-q6_k_L Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.
- vvramWhat would be optimal HW configurations/systems recommended?
- TwirrimI've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
- KronisLVI had an annoying issue in a setup with two Nvidia L4 cards where trying to run the MoE versions to get decent performance just didn't work with Ollama, seems the same as these:https://github.com/ollama/ollama/issues/14419https://github.com/ollama/ollama/issues/14503So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!
- sieste> you can use 'true' and 'false' interchangeably.made me laugh, especially in the context of LLMs.
- krasikra[dead]