How to run Qwen 3.5 locally

<- Back

How to run Qwen 3.5 locally

Curiositry

Comments (36)

mingodad
I'm still a bit confused because it says "All uploads use Unsloth Dynamic 2.0" but then when looking at the available options like for 4 bits there is:IQ4_XS 5.17 GB, Q4_K_S 5.39 GB, IQ4_NL 5.37 GB, Q4_0 5.38 GB, Q4_1 5.84 GB, Q4_K_M 5.68 GB, UD-Q4_K_XL 5.97 GBAnd no explanation for what they are and what tradeoffs they have, but in the turorial it explicitly used Q4_K_XL with llama.cpp .I'm using a macmini m4 16GB and so far my prefered model is Qwen3-4B-Instruct-2507-Q4_K_M although a bit chat but my test with Qwen3.5-4B-UD-Q4_K_XL shows it's a lot more chat, I'm basically using it in chat mode for basic man style questions.I understand that each user has it's own specific needs but would be nice to have a place that have a list of typical models/hardware listed with it's common config parameters and memory usage.Even on redit specific channels it's a bit of nightmare of loot of talk but no concrete config/usage clear examples.I'm floowing this topic heavilly for the last 3 months and I see more confusion than clarification.Right now I'm getting good cost/benefit results with the qwen cli with coder-model in the cloud and watching constantly to see when a local model on affordable hardware with enviroment firendly energy comsumption arrives.
antirez
My private benchmarks, using DeepSeek replies to coding problems as a baseline, with Claude Opus as judge. However when reading this percentages consider that the no-think setup is much faster, and may be more practical for most situations. 1 │ DeepSeek API -- 100% 2 │ qwen3.5:35b-a3b-q8_0 (thinking) -- 92.5% 3 │ qwen3.5:35b-a3b-q4_K_M (thinking) -- 90.0% 4 │ qwen3.5:35b-a3b-q8_0 (no-think) -- 81.3% 5 │ qwen3.5:27b-q8_0 (thinking) -- 75.3% I expected the 27B dense model to score higher. Disclaimer: those numbers are from one-shot replies evaluations, the model was not put in a context where it could reiterate as an agent.
moqizhengz
Running 3.5 9B on my ASUS 5070ti 16G with lm studio gives a stable ~100 tok/s. This outperforms the majority of online llm services and the actual quality of output matches the benchmark. This model is really something, first time ever having usable model on consumer-grade hardware.
Curiositry
Qwen3.5 9b seems to be fairly competent at OCR and text formatting cleanup running in llama.cpp on CPU, albeit slow. However, I have compiled it umpteen ways and still haven't gotten GPU offloading working properly (which I had with Ollama), on an old 1650 Ti with 4GB VRAM (it tries to allocate too much memory).
b89kim
I’ve been benchmarking GGUF quants for Python tasks under some hardware configs. - 4090 : 27b-q4_k_m - A100: 27b-q6_k - 3*A100: 122b-a10b-q6_k_L Using the Qwen team's "thinking" presets, I found that non-agentic coding performance doesn't feel significant leap over unquantized GPT-OSS-120B. It shows some hallucination and repetition for mujoco codes with default presence penalty. 27b-q4_k_m with 4090 generates 30~35 tok/s in good quality.
vvram
What would be optimal HW configurations/systems recommended?
Twirrim
I've been finding it very practical to run the 35B-A3B model on an 8GB RTX 3050, it's pretty responsive and doing a good job of the coding tasks I've thrown at it. I need to grab the freshly updated models, the older one seems to occasionally get stuck in a loop with tool use, which they suggest they've fixed.
KronisLV
I had an annoying issue in a setup with two Nvidia L4 cards where trying to run the MoE versions to get decent performance just didn't work with Ollama, seems the same as these:https://github.com/ollama/ollama/issues/14419 https://github.com/ollama/ollama/issues/14503So for now I'm back to Qwen 3 30B A3B, kind of a bummer, because the previous model is pretty fast but kinda dumb, even for simple tasks like on-prem code review!
sieste
> you can use 'true' and 'false' interchangeably.made me laugh, especially in the context of LLMs.
krasikra
[dead]