Qwen 3.6 27B is the sweet spot for local development

<- Back

Qwen 3.6 27B is the sweet spot for local development

stared

Comments (414)

iagooar
I love my MacBook Pro M5 128GB RAM and I love qwen3.6.BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.Thank me later.
bensyverson
The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
onion2k
None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
doodlesdev
I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
mips_avatar
I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.
XCSme
Considering the cloud version, all three models compared in the article (Qwen 3.6 35BA3b, 3.6 27B and DeepSeek V4 Flash), have very similar performance[0], BUT on cloud, for some reason DeepSeek V4 Flash is 10-20x cheaper than the Qwen models.If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol
beastman82
FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.QAT, MTP, 128k context.I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
cpburns2009
Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.
SamInTheShell
This is probably the first small model I got through some simple web game tests with. It tends to opt to overwrite an entire file instead of doing edits... which editing is where most of these small models fall apart along with getting stuck in repeating loops. Only 24k tokens in so far, it did some decent newbie work.
ctkhn
I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.
zx76
I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.
0x0000000
> ... on my Macbook Max M5 128 GBLocal development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
starefossen
We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace
cloudengineer94
I'm using Qwen and Gemma 4 locally and it's pretty good stuff, not frontier level but gets the job done.
mark_l_watson
I can come close to agreeing because queen-3.6-27b is my second favorite for local coding. I am using gemma4:26b-a4b-it-qat-48k (the "-48k" is from my modifying a model run with Ollama to always use a 48K context size). On a 32G Mac I use gemma4:26b-a4b-it-qat-48k and OpenCode and on my 16G MacBook Air I use gemma4:12b-it-qat-16k ("-16k" is my resizing context size) and little-coder. I break up projects into small libraries because local coding works better for me using small code bases.I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.
simplyluke
The open source models have gotten heavily conflated with local development. While that is cool and I'm excited about the future of local LLMs, it is not necessary to play around with these models. Without shilling for companies I don't have a relationship with, there are a number of companies who will give you an API just like Anthropic/OpenAI and you pay per token, albeit much cheaper than the frontier labs.I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.
ljosifov
Running 27B dense model on M5 128GB is ok, but one can do better.On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.
pkroll
Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)
RedCinnabar
Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
rhgraysonii
I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?
christoff12
I just burned 20 minutes because I wanted to play hex minesweeper: https://hexabomb.pgpln.appSource: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...
jjcm
I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
hoppp
Its feasible but that laptop is not available for most devs.I do have access for a 64 gb ram mac mini but most people don't.
recursivedoubts
I would like to offer someone the next openclaw: a GUI for the mac that allows people to manage and install local models with a single click, provides GUI tools for tweaking important aspects of them, and also provides a good command line interface to those models.
marcuskaz
When is Amazon Bedrock going to get these newer models?Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.
kpw94
> What it does:>> --jinja for tool calling supportPretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
Otternonsenz
Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
IronWolve
I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.
zedascouves
Just tried on some arduino code. after 10 minutes i got a list of improvements to my code.I ran those throu opus saking if it was good advice and was not impressed:I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already in your file, and its headline "critical" claim misreads what the code does. Going point by point:...
jboss10
I don't understand the talk about how expensive the hardware is. These models can run on very old or old and low end. I've been running Qwen3.6-35B Q4 on an old 1080 GPU(8GB vram) with 32GB sys RAM. I have a i7-12700.It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.
MangoCoffee
Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.
hollowturtle
> Real workOk that's the part I'm interested in, don't care about minesweeper clones....> Make a landing page selling candles for women that are into wellbeing and SPA.can't be serious...
blopker
I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
diseasedyak
I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.For anything else local, including writing some automation scripts and such, it works great.
seemaze
I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.https://pi-local-coding-bench.dev
blueside
i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.
HotGarbage
And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.
cdnsteve
Checkout details on what this runs on for local AI here: https://tokenstead.ai/models/qwen3-6-27b
zerolines
Yup, been rocking theQwen3.6-35B-A3B-MTP-GGUF locally with 88tk/s it's amazing.
aand16
I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!
v3ss0n
3.5 122B is much better. 27 B is bad at Long context and Svelte
markdog12
I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.
narrator
In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.
dom96
What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.
alansaber
Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?
felooboolooomba
What's the minimum requirement for a Nvidia card to run it? For let's say 10 t/s.
mbgerring
Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?
blobbers
How does llama.cpp use the GPU efficiently as opposed to MLX?Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.If I can generate voice at the same time as video, that would be useful.
devin
If I have 10k to spend, what should I buy for the best local model experience?
SkitterKherpi
27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.
drillsteps5
I honestly don't get the hostility against local models in this thread (and in some other threads recently).I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).What exactly is wrong with any of that?
prasanthabr
Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?
anonym29
Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
cat_plus_plus
Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)
verdverm
Qwen's new AgentWorld model is good too: https://huggingface.co/Qwen/Qwen-AgentWorld-35B-A3BI'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
ascii0eks84
Very capable lora adapters are surfacing but it seems they are very niche.
mikert89
none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
dmezzetti
Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.
rusk
Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.Qwen on the other hand got straight to work with astonishing competency on the same system.From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
anon
undefined
ShizuhaLabs
[flagged]
Getchowned
[dead]
suthakamal
[flagged]
CurbStomper
[dead]
217
This is kind of like saying grass is green to be honest
mannyv
FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?