Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

<- Back

Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

cloudking

Comments (351)

Greenpants
I have! I care about data privacy and LLMs being free. I'm using the Pi coding harness but containerized and sandboxed, to make sure it's running completely offline. On my Mac Studio with 128GB RAM (or MacBook with 36GB RAM) I'm using Qwen3.6 35b, with only 3b active parameters so that it runs really fast. I've done a complete redesign for my website's homepage and blog with Django + Wagtail. The latter is interesting, because Wagtail is a bit less well-known, so the agent, without giving it internet access, doesn't always know how to develop for Wagtail. I've used Qwen3.5 122b for when things get more complex. At 10b active parameters, it's significantly slower though.I've noticed a few things compared to large models like Claude. For starters, you really need to know what you're asking, and be precise; it doesn't do much thinking for you. Any assumptions left open, and it'll take the easiest route to reach the goal (e.g. CSS in HTML), often not the best in terms of architecture.It gets into loops quite often, and surprisingly often gets the edit tool call wrong, after which it will spend lots of thinking tokens and re-read files instead of retrying (despite the system prompt suggesting so).Comparing agentic Qwen3.6 35b to Claude Opus is like a junior with knowledge across the board, that you really need to guide, versus a senior that thinks with you on architecture. If Opus gives a 15x speedup, local and fully offline Qwen gives a 5x speedup. Which, given that it's completely free, is still mind-boggling to me :)
horsawlarway
For personal use, yes.I replaced a $100/m subscription to claude in favor of running pi harness pointed at unsloth studio, using both qwen (unsloth/Qwen3.6-35B-A3B-MTP-GGUF) and gemma (unsloth/gemma-4-26B-A4B-it-GGUF) models, depending on my mood.I have a machine I built about 5 years ago with dual RTX3090s in it (I was going to build a new gaming machine anyways, and the llama release had just dropped so I tacked another used 3090 onto the build), and I get ~150tok/s on either of those models (at UD-Q4_K_XL quant) and can use the entire 300k context length without having to exit VRAM.To be very clear - it's not as good as claude. But it's free and not so much worse that it matters significantly.For my personal needs, free beats $100/m.I also have an openclaw instance pointed at the same inference server, and it's great for that (genuinely solid use-case for local models).Some example projects- Replacement launcher for android tvs (with usage monitoring and tracking for kids)- Custom admin portals for my k8s cluster services- Custom home assistant integrations/automations (recently some shelly devices for power monitoring and switching)- Grocery list management and meal planning (mostly via openclaw)- some custom workflows for 3d asset generation in comfyui.---Long story short, if you're trying to make money via software... I'd probably still recommend using a paid provider. But the local models are very capable of cool stuff.
bluejay2387
About 90% of my coding is on Qwen 3.6 27b and Open Code with some custom skills and Semble. It is NOT as smart as CC or Codex but its enough to get most of my work done. I didn't set out to replace CC and Codex (I have an RTX 6000 so the TPS is faster than I care about, but the RTX 6000 was originally for other work). I only tried this just to see how close you could get to a frontier model for coding as an experiment, but it was good enough that I stuck with it. I still fall back to Codex for really complicated stuff and to polish UI's as that seems to be the weakest element to working in Qwen.This isn't a recommendation because I don't think most people have an RTX 6000 laying around and the cost would be many years of MAX CC or Codex subscriptions, but at least this seems possible. Maybe in a few more years it will even be practical.Other Notes: I have had to set the compact target to 75% on a 256k context window as once the conversation length goes about 100k I start seeing a drop in the quality and speed. This becomes very problematic after about 150k. I tried Qwen 3.5 122b too but it actually seems much worse at coding than 3.6 27b even though its much larger. Maybe because I am using a 4bit quant or maybe I just don't have it configured correctly? I know 3.6 is newer but I didn't expect it to out perform a model that is much larger from the prior generation. Gemma 4 31b is a good model for other tasks but at least my personal experience is that Qwen outperforms in coding. Nemotron Super 120b is great at a lot of stuff but it also seems to be not as good at coding as Qwen. This was very surprising to me.
codinhood
I don't think you're going to get many "true" answers to this. The opportunity cost of not using the latest and best models is just too much right now.Every month I research this and come to the same conclusion: the time, effort, and cost required to get local models (and the coding tools around them) to perform even close to Claude Code with sonnet/opus just not worth it right now. If it was, it would be distributive enough to be in the news.Not that I'm discounting someone hasn't already solved this, just trying to Occam razor my way out of diving too deep down rabbit holes.
pierotofy
Yes. Llama.cpp + Qwen3.6-35b (MTP) + OpenCode is quite capable and runs on a single RTX 3090 and is faster than most cloud models. Quality is like running edge models from 8-12 months ago. Setup details at https://github.com/pierotofy/LocalCodingLLM/
ozten
Yes, for client projects where privacy and security is important, but no enterprise contract:Open code against Infomaniak hosted OSS models: Qwen3.5-122B-A10B-FP8, Kimi-K2.6.I use API keys for billing. It performs like Dec 2025 in terms of my productivity back then.
sosodev
The problem with this question is that it encompasses a huge spectrum of capabilities and expectations. If you can only run an 8B model and expect it to be good at vibe coding / one shotting things you're going to have a bad time.If you're able to run a model on the scale of ~30B, you can find that with a reasonably scoped and well defined task they do very well. I've found both Gemma4-31B and Qwen3.6-27B to be the best in this range at the moment. You can swap in the MoE models for faster inference, but they are noticeably worse at most tasks. They can one-shot / vibe code tasks with small scope, but still do much better with guidance.If you really want frontier-like capabilities, you'll probably need at least 128GB of memory and either huge compute or a lot of patience. Most people just don't have either the money or the patience to make these local models work.The patience required for local model usage goes far beyond just waiting for tokens though. It takes a lot of effort to get things configured and working properly for your workflow and hardware.
arjie
Not “local” and not interactive coding but sharing since it might be helpful. I have 2x RTX Pro 6000 Blackwell running DeepSeek V4 Flash. I get 160 tok/s raw but it’s a reasoning model. For my use case, I have it auto-write code and another system auto-review the code.I occasionally use it with pi to write some code and it’s blazing fast but it’s mostly habit that keeps me with CC and Codex.
garethsprice
Using OpenCode + OhMyOpenCode + Qwen 3.6 35B-A3B Q_4_KM on an Ada 4000 (20GB VRAM) at 55 tok/sec for generation (slower than it sounds as OpenCode has a bunch of context it adds). Meaning to check out pi when I get a minute as I hear that one mentioned a lot lately.I am using Opus to generate plans that the local agent then follows, then validated by Opus. So I'm not at 100% local but these models are increasingly part of my production workflow. Probably not worth doing - yet - unless you are a hobbyist who likes spending time and money tinkering.This setup is certainly not as "good" as Opus or other frontier models but they are "good enough" for an increasing number of rote tasks. You don't need to drive a Rolls Royce to the supermarket, when a used Corolla gets you there just fine.It also enables new workflows that would be cost-prohibitive with frontier LLMs (especially as token costs rise) - eg. overnight I use the Chrome devtools MCP and have the above setup fuzz-test as a user for a number of hours and see if it can break things. Even got it working with multi-modal so it can check screenshots, which blows my mind (and not my wallet, as Claude+screenshots burns $$$).The "12-18 months behind frontier" sounds about right, it's about where I was with gpt-4o and basic harnesses back then. In another 12-18 months my bet is we have Opus-level models that can be run locally for <$5k... but the frontier models will be even further forward (unless governments have blocked them). Fun times.
stymaar
Yes, Qwen3.6-35B-A3B on a Strix Halo 128GB (Bosgame M5).I have way too much VRAM forme such a model but Qwen never released the 122B version of Qwen3.6, which is the best class of model for my hardware. But at the same time my electricity bill is negligible, this is originally a laptop chip and it shows, it consumes almost nothing while idle and a little above 120W during prompt processing.And Qwen3.6 has been surprisingly effective for me, I still use Clause occasionally but only for like 10% of my needs which allows me to stay well under the quota even with the cheapest plan.Speed: ~800tps prompt processing and 50tps for token generation (with no speculative decoding).
Kostic
For personal needs I connected VSCode with llama.cpp running Qwen 3.6 27B or Gemma 4 31B and it's good enough to cancel my cloud subscription.Qwen running on my 1st GPU at q4@176k context from 70 to 50 tok/s with MTP, pretty good for coding.Gemma on the other hand is using both GPUs, running q8@64k context, doing document sentiment analysis, summarization, proofreading and translating, at consistent 25 tok/s. Somewhat slow but usable for batched workflows. Might get some more once llama.cpp starts supporting MTP with tensor split mode.Still using frontier LLMs at dayjob since I'm not paying it and those are obviously better. Hopefully we'll have a Sonnet 4.6/Opus 4.5 level 30B model in a year or so.EDIT: Prompt processing starts from 800 t/s and drops to 400 t/s. In most cases my starting prompts are around 16k-24k of tokens and require from 60 to 90 seconds to be processed. Not great but acceptable.
jodoherty
I use pi with an RTX Pro 6000 Blackwell to run Gemma 4 31b to do all my agentic coding.I find it useful.This side project highlights a similar approach to how I scope and tackle projects at work now:https://git.theodohertyfamily.com/wg-wrap.git/tree/README.md https://git.theodohertyfamily.com/wg-wrap.git/tree/CASE_STUD...You have to apply a lot of careful architecture and TDD to your approach. Eliminate technical risk by tackling hard things early and wrapping them up in a simple, easy to use interface.I find I can get some projects done 2-3 times faster than if I wrote them by hand. It can also save about 5-10x time on mundane or broadly scoped projects by helping me consolidate and try out ideas very quickly.Setup-wise, I switch between vLLM using nvidia/Gemma-4-31B-IT-NVFP4 and llama.cpp using unsloth/gemma-4-31B-it-qat-GGUF with MTP. I throttle the GPU power usage to 400W.My current llama.cpp setup gets token generation rates between 60-150 t/s depending on MTP draft acceptance rates. Prefill is between 1500-4000 t/s depending on context length/depth.
mgsram
I have been using local LLMs for about a year and I have settled now on Qwen3.6 27b dense model in GGUF on Mac Studio with 512G of RAM with open code as the harness and llmster(LM Studio). I have also used the Qwen 3.6 35B-A3B but the dense model's accuracy is next level with the tradeoff being tokens/sec. With the Qwen3.6 27b, I usually get anywhere from 25-40 tokens/second. Initially I used them for simple tools but for the past 3-4 months, I have been actually doing production grade coding in C/C++ (Automotive Software stack) and Python (Tools) with Qwen3.6 27b.The tokens/sec may be less but that kind of helps me in going at the right pace. The workflow I use for green field development / rewrites is to pair with Sonnet for design/architecture, reasoning and a detailed execution plan. I then feed this piece by piece with precise prompting and that does the job. For brown field, it is often a judgement call. There are occasions when I have found Local models to be limited in their reach and I resort to Claude CodeSome of my recent work using Qwen 3.6: 1. Complete rewrite of Power management Service in C using the existing C++ code as reference 2. Tool to parse contents from really complex specifications in Excel format 3. Tool to translate CJK contents to english for feeding into KG
jborak
I'm using 4x RTX 5070's and first-gen AMD threadripper (1950X) to run Qwen3.6 27B (MTP) Q6_K with llama.cpp and it works great as a daily driver with Pi. Around 50-60 toks/sec. I also connect a few other applications to it such as OpenWeb UI and recently set up Bifrost, an LLM gateway, to be the primary access point for the models I serve.I've tried other models such as Qwen3.6 35B A3B and I've found that 27B works better for me when it comes to coding. It's slower being a dense model but the quality seems much better. Inference on my system for Qwen3.6 35B A3B is around 130-140 toks/sec, non-MTP, which is insanely fast!You don't need 4x 5070's to run Qwen3.6 27B, three or maybe even two will work. However, I use MTP (multi-token prediction) to speed up 27B and that eats up more memory because the draft model requires its own context.Another thing to keep in mind is that the tools you're using have their system prompts that are loaded into the model for each conversation. When I fire up Pi, working with the model is very snappy at start. When I interact with the LLM via Hermes CLI, it's much slower. That's because each prompt with Hermes is loading so much stuff (skills, tools, etc.) into the context and then it's there forever until the conversation ends.I like running models at home for privacy, but I also like how there are no quotas, usage isn't a worry. If the future is "loop engineering" then you will be burning through tokens and $$$ using a cloud models.My system idles around 200W and is around 350-450W when inference load is high. Decoding (token generation) isn't all that efficient, and your GPUs sit idle more than you think during inference. Advancements like diffusion may 1) speed up decoding and 2) let you utilize more of your idle GPU.
cuttysnark
I've had some success with local models by chaining "agents" together in a workflow. Each agent has a different prompt and uses a different ollama model based on what their role is. The project manager, schema agent(qwen3:14b), etc. doesn't use the same model as the coding agent (qwen2.5-coder:7b). Between each step is an orchestrator and with a Playwright task which attempts to surface errors to the agent who introduced the previous code block. Only error-free blocks are forwarded to the next workflow step.Probably the biggest improvement was including a backend-for-agents service definition which instructed the schema agent they were to only produce only a manifest based on the task, and to pass off that off to the next agent.In short, I split tasks up into many pieces by defining a workflow where agents are only allowed to do very specific things before their work is passed along. This keeps them grounded and capable while also creating places for me to intervene if a workflow was say 25% or 90% successful.
HappySweeney
I have an optane and lots of ram, so I tried full-fat models for writing some function overnight, as I get about 0.7 t/s. My current go-to test is to update a scalar function to transpose a bit-matrix to one using avx512. the cloud models all play with that like its nothing. Kimi 2.6 and GLM 5.1 both failed miserably.
wsintra2022
Reading through these comments, I can't tell any more whats bots posting on behalf of the AI providers trying to dissuade or whether people just have had negative experiences with local ai models. IMO, Qwen 3.6 27B 8k quants running on a Mac Studio 64g ram, incredible?. No it is not frontier general super shit, its just good. That's it, its good. Its free and private and can take an experienced engineer from being lazy to being really lazy, and that's magic right there. I use llama.cpp and opencode and have great moments of planning some code changes, and letting it run. Walk away. Chill in the hamoc, clean the dishes, have a wank, whatever. Use tmux and ssh in and check in on it. THIS is where the incredible comes in. Anyone telling you otherwise, well check their motives. I have no skin in the game. I just have an easy lazy time.
CuriousRose
An equally important issue with local AI use (not coding specific) is ensuring that the harness has fast and up to date data if recency is important in your querires (new package features, docs, etc). Hosted models do web search incredibly well and I think this is a huge part of output quality.I don't use local hosted models anymore due to hardware contstraints, but I do have some degree of search anonymisation attached to my OpenCode and OpenRouter connected open models.On my Macbook I run OrbStack that has the following docker containers set to route through a Mullvad based gluetun.- Firecrawl - fast web scraping- SearxNG - metasearch- CloakBrowser - tursile bypassing Playwright alternativeIf you wanted to get fancy with the proxy rotation, you could setup numerous instances of Playwright each with their own Mullvad wireguard key in different locations.
GodelNumbering
As someone that spends all day every day talking to LLMs, I'd say the OSS frontier models + a good harness is already a sufficient combo. For local deployments, we are missing one or two hardware generations (and may not get that soon since hardware companies are heavily favoring datacenter segment) to fully move to a local setup.
henrixd
I have been heavily relying on Qwen3.6-27B-UD-Q4_K_XL.gguf -model and Pi agent (https://pi.dev/) for local tasks and coding. I have used llama-cpp-turboquant fork with some custom cherrypicked MTP patches from another fork.I'm running this on V100 32GB (~900GB/s memory bandwidth) with 200,000 context window, --spec-type mpt --spec-draft-n-max 3 --spec-draft-n-min 0 --cache-type-k turbo3 --cache-type-v turbo3 to mention most relevant parts.I usually get somewhere 45-60 t/s. I believe that speed could be improved slightly by switching to ik_llama.cpp fork and Qwen3.6-27B-IQ4_NL.gguf -model but there's no turboquant support and it's with some other tradeoffs too.
thesuperbigfrog
Here is a nice setup that works well:https://discourse.ubuntu.com/t/use-workshop-to-run-opencode-...
blurbleblurble
My experience is that it's not the models themselves that are limiting right now, it's the clunky alternative harnesses with weird missing features making for bad ergonomics around stuff like queue management, interruption, subagents, goals, etc.
cheekygeeky
Our software dev (smartest guy I ever met) is using OpenCode and Tmux with Open Source models. He says the DeepSeek is his model of choice for coding (he call's it "pretty GOOD". He's running two 3090s on an i9 with 128GB RAM. https://www.msn.com/en-us/news/technology/china-s-open-deeps...
pianopatrick
I wish someone would do a benchmark and competition for this kind of work flow so we could figure out what works well.Like "Here's this consumer grade GPU. Using only this GPU but with whatever models and workflow you want, see how well you can do on xyz benchmark."Contestants would be given like 1 hour max and scored based on % of questions answered, % of questions correct and total time to finish.Like "The Local AI challenge"
3abiton
I think nearly everyone mentioned Qwen, so my turn I guess. Qwen 3.6 35B Q8 (MTP), on a Strix Halo, with llama.cpp. Around 40-50 t/s. Really great pefromance, I get always suprised by its capability. I used with forge-code directly in zsh. For long context 150k+) it start degrading and forgetting.
bravetraveler
I'm largely 'all natural', any of my little LLM usage is local. 128G Strix system, a not-super-dense Qwen or Gemma variant will get 50-80 tok/s output. Not subscribing to Anthropic/OpenAI/etc even in the unlikely event these are the last local models released; simply not needed. Entirely fine without and in-model tool usage covers my currency concerns.
grmnygrmny2
Just sharing my $0.02 here - I have ethical objections to using OpenAI or Anthropic products so I was a reluctant adopter of LLMs at all. Local models address most, though not all, my moral objections so I’ve been using them for work and personal projects for about a month.The hardware I have (32gb Macs and a gaming PC with 10gb 3080) can only get me to Qwen3.6-35B-A3B at various quants but that’s enough (200-400 PP, 20-30 TG).It’s taken some time to learn how to best utilize it - some things take a bit of babysitting or direction - but it’s quite useful. Not having ever used CC I can’t compare but it’s been a great assistant or pair programmer for everything from embedded C++ to Vue. I wish I could run 27B as there have been moments when this model feels like it just can’t quite figure something out but those moments are quite rare. For a lot of tasks it’s a huge time saver and has proved super capable at digging into and fixing bugs given pretty vague instructions.I’m using Pi as my harness.
acc_297
I've been wondering lately if it would help to take a medium sized model and either in cloud or some local setup actually do Reinforcement Learning from Human Feedback (RLHF) on every prompt as a chore - I don't know if trying to manually finetune a model to your use habits would ruin it or help - ideally if you were diligent you could get rid of some of the ticks that make models for the general public difficult to work with e.g. overly sycophantic, overly verbose, annoying tendency to explain via analogiesbut perhaps one individuals prompt feedback just isn't going to ever be enough I'm not sure how much you need (I know people working at big companies that have purchased in-house agents fine-tuned on internal documents etc.. and apparently these end up with bizarre behaviours not necessarily more helpful than the standard models)I'd like to be able to essentially edit every response given by an agent and then finetune on the difference between what it produced and how I edited the text. Personally I would just remove a lot of the adjectives and try to distill the responses to core responses but I worry based on some of the work done by Owain Evans and other alignment researchers that this can sometimes push agents into tricky-to-predict tendancies.
_bobm
But, guys, when you say Claude/ GPT models, do you stop to think what are these "models"?One day I thought about how can GPT send thinking parts one after another with a markdown header summary of the thinking block itself. Just think about it.As a matter of fact, think about these operations, api endpoints, observe their output.These so called SOTA models are not what meets the eye, and are not at all comparable in the infra department to local models. There is crazy orchestration going on due to the scale of these operations. But also these hard constraints lead to innovation. Innovation nobody speaks about.I wouldn't say we cannot catchup, but serving our local models through llama, vllm is just the A, B, C of it all. In reality I think what is needed is a replication of said orchestration which I hinted at above.The SOTA models are a deep orchestration of multiple models operating together it isn't a single model. As such no single model ever will catchup to them until it replicates through training first and then maybe through model architecture this orchestration.Finally, I would wager that the SOTA "models", as one of these models in this orchestration setup, as served for general consumption, are not so much more capable than qwen 3.6.I am sure that if you change your perspective you will start noticing the scale of the "magic".
nfrankel
I tried. It works in theory: https://blog.frankel.ch/tokensparsamkeit-coding-assistants/#...Results depend on the model, of course, and your computer is the limit. Mine wasn't up to the task, unfortunately.
K0balt
Pretty good results with qwen 3.6 27b dense. I’d say it’s about equal to (Claude) haiku 4.5 maybe sonnet depending on the task.
v3ss0n
Yes qwen 3.5 122b+ dgx is working wonders and I ko longer subscribed to any cloud api now. I will post a project which I accomplished in 9 days of long horizons running.
mitchell_h
Tried. The context windows just weren't big enough.
heisenbit
I think it is work to set up but I'm also learning a lot setting it up. Mainly using qwen/qwen3.6-35b-a3b mlx with my 48GB M4 MBP which leaves me just enough headroom for docker dev-container and other basics. I use LM Studio to run and am using it via VSCode. A big difference made the system prompt improving the tool integration (I asked GPT for guidance on that). Before that it was not making changes but regenerating code often messing up than helping.I mostly run my MBP on low power even when it is plugged in to avoid the noise and heat. Full power maybe doubles speed but more than doubles power.What can it do: Simple restructuring of pages. Where did it and other models fail: Splitting up Pinia store which GPT-5.4 did without fail. I think with more tuning, guidance for tool use and maybe some support tooling around it performance can increase further.
moezd
Not yet. Without pure Apple game or decent GPUs, even with a lot of RAM and threads, all you get is about 30-50 tokens/second, and that's thinking turned off. Without these optimizations your model will have a field day with your MCPs, skills and agent descriptions and you will watch the paint dry before seeing the first output token. Local model serving means you have to fight for every token in your context window, which is quite opposite of what Claude/GPT/Copilot are pushing the industry towards.
etoxin
I have not. We use openspec with our projects at work. To try and simulate a local rig without spending big cash. I use the hosted models and pay for them with the latest popular local model.Most small local models don't get tool calling right, however the larger models are now doing this correctly now.One thing local has not accounted for, is most productive engineers are running multiple cli chats at a time with git worktrees. I normally hover around 3 worktrees + cli-chats.
milchek
I’ve tried in a 36GB MacBook Pro and haven’t had much success beyond very basic work. Issue for me was the context runs out quick even with smaller models and it’s slower. To get some half decent performance I’d imagine you want 128gb memory and are spending a lot more on hardware. At that point it becomes a question on whether you’d rather have frontier models at a subscription or sink that money into your own equipment. Of course, for those with privacy in mind your only option is forking out the cash for the higher end machines.
codelion
Using qwen3.6 27b locally with Claude code, it works well for simple coding tasks
anon
undefined
bijowo1676
One of the interesting setups I saw is using expensive frontier models to write and update markdown for your app: specs, product requirements, architecture, etcbut then use cheap/local model to implement the specs.Markdown is more effective at compressing information and fits the context window easier, than hundreds of source code filesbut this requires second and third passes, to smooth out the rough edgeshas anyone tried that?
SupLockDef
Local isn't new for me. I am still coding my stuff, but Qwen3-coder:30b on my old rig with a gtx 1070 16gb RAM does wonders for me.I mostly use it as a google search if I forget a thing, or doing the boilerplates.I am using a mix of a non harness chat for the reply speed, and opencode / vim-ai for my boilerplates.$0.00 / month. That's the budget.
jderekw
Running AMD Lemonade as the daily rig, Started with Ollama then over to LMStudio and now standardized on AMD Lemonade which has been helpful to monitor cRAM, CPU, GPU and gRam. The multi-models on Lemonade make it straight forward to run a stack for LLM, Voice to Text, NPU, and Image Generation. Platform also works with Nvidia, Apple, Intel and AMD chip sets.
sj_tech
I use Qwen 3.6 35B A3B for agentic coding using GitHub Copilot Extension for VSCode. Mac Mini 128GB as the hardware. Seems reasonable for that model size, but I notice looping issue when problem becomes too big to solve. You can use it to do something that you know how to do (saves time).
anubhav200
Yes, llama.cpp, qwen27b, 35b, claude code. Llama-cpp-manager for managing llama.cpp configs (https://github.com/anubhavgupta/llama-cpp-manager)
BiraIgnacio
I tried for a bit, with llama.cpp + Qwen + Mac Pro but the results were very poor (both quality and speed).I considered investing in better hardware but doing the math, it is cheaper for me to pay for DeepSeek (yeah, I know not everyone can do that).
drnick1
- What would you say is the best model for coding at the moment that can run on a high end consumer GPU? (Assume an RTX 3090/4090 is available.)- What "stack" do you recommend? Llama.cpp + OpenCode?
zaptheimpaler
I tried gemma-4-26B-A4B just to see if it could help me read/sort my emails on a relatively under-powered setup (16GB VRAM + 32GB RAM) and it's not going well.. the model burns 24K tokens just on searching for the right tool and then dumps the email contents into context - i tried to get it to use code-mode to save context but the code-mode implementation can't save files so it was useless and im going to try to switch to "ssh-mode" into my devbox container. Still relatively new to this, so I'm probably doing something wrong
anon
undefined
NetOpWibby
I'm looking forward to having Claude Fable at home. THAT is when I'll THINK about replacing Claude (who knows what their next models will be capable of, Fable was damn good for the three days I had it).
boringg
Will the AI labs always make sure there is at least a years worth of differential? I guess the underlying business premise is that each new release has a step function change that prevents this kind of behaviour..
ndom91
Not 100%, I still fall back to Claude for most day-job stuff. But I've been trying to use Qwen 3.6 and Gemma 4 on my framework desktop mainboard (Strix Halo) as much as possible.I've been working on an ops style tool for local LLM inference. Proxying, api keys, request logging, model rewriting and much much more.https://github.com/ndom91/llama-dash
derekered
I'm using Qwen 3.6 on my MacBook Pro M5 Pro with 48BG RAM for any work that I am particularly privacy conscious about, like any work with my journaling. It's been working great! I don't have any direct comparisons, but I've been satisfied with the results.
dabinat
There’s evidence that combining models can achieve frontier-level performance (e.g. OpenRouter Fusion). I’m wondering if that’s the more realistic option: combine Opus with a local model to save on token costs.
bArray
I'm in the middle of building my own based on LiquidAI/LFM2.5-1.2B-Instruct [1]. I run it on the CPU locally and get reasonable performance. I'm currently using it to solve small problems - but expanding it daily.[1] https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct
julianlam
Of course.Qwen 3.6 35B-A3B on a Framework 13 with 32GB of memory.Running llama.cpp, 15 tokens per second. Outputs code and text faster than I can parse.
tumetab1
Not yet, tried Gemma 4 on an Apple M4 but the tok/s is significant lower than the cloud offering.Also,the lack of enterprise tooling to help selected an appropriate model and tooling to run a local LLM does not help.
anonymousiam
This was posted shortly after your Ask HN post:My Homelab AI Dev Platformhttps://news.ycombinator.com/item?id=48542433
xhinker2
Yes, I have. 1. Two RTX 3090s in Linux 22.04 2. Running Qwen3.6-27B Q6_K_XL GGUF 3. Using my own harness AZPal, I build myself, also wire it with Hermes Agent, works fine 4. Many times it solve problem that Codex can't solvehttps://medium.com/p/f237d575e861
whartung
Will the inevitable M5 releases from Apple change this equation in any meaningful way?I'm waiting to swap out my last gen Intel iMac with a new M5 mini of some kind, with the eye to hopefully be able to run some models locally. I envision a mini (heh) arms race to simply swapping out an M(X-1) for an M(X) annually as this field shakes out.
mv4
I've been using MiniMax M2.7 with vllm on my dual Nvidia Spark cluster. Slow (<20 tps) but functional for most of my use cases.
ryandrake
Always a bit disappointed in the details in these kinds of threads. When you do get answers, they're never specific enough to try out on your own. It'll be something like "I use Qwen 3.5 and get great results!" OK but what quantization are you using? What llama parameters? What context size? What GPU are you running it on, and how much VRAM does it have? Are you hosting it on a separate box, or running it locally on your dev machine? What coding agent tool are you using, and how is it configured / hooked up to the model?
627467
So, everyone has different context, but how free is free running these local models? Like having a power hungry machine always on in the cupboard?How much does this ware out the hardware?Also, if privacy is the main reason for running local models, why not use venice.ai and equivalent?
qu0b
I'm using deepseek V4 on two rtx 6000 pros and its working great. Opus is so slow that I get deepseek to do most of the work and Opus is only used to validate and help plan.
Lwerewolf
mbp16 m5 max 128gb, antirez/ds4, deepseekv4-flash. Works well for relatively dense (let's say <20k LoC per project) C codebases that are essentially a bunch of custom specialized stores, http servers, network infra, media transformers, etc.Runs through Pi with a custom prompt (basically "don't speculate blindly, isolate things, make them traceable and measurable, then verify") and behind a pretty restrictive bwrap setup - RO bind everything other than ~/.pi, cdw and a separate tmpfs, unshare almost everything other than the network - for which I use a network namespace that only allows tcp connections to a specific ip and port (i.e the inference mac) - i.e. netns exec into bwrap.Can't compare it to SOTA or higher-requirements models on what I work on - policy. That said, on a bunch of test pieces - it obviously isn't gpt-5.5, it definitely lags behind k2.6/glm/ds4-pro, but it absolutely is usable. Of course, on such codebases, forget about one-shotting or trusting it blindly or anything of the sort - you ask it, guide it, restart the context from time to time to have a "fresh dice roll" and to keep the context small and clean, etc. Compared to anything smaller (incl. all the usual local qwen models) - on a test piece, it figured out that memfd and mmap were used for setting up a ring buffer with natural wraparound handling (double mapping the first page at the end) and didn't tell me "this is for sharing memory between processes" or some other BS.Performance as described in the tables in the readme here: https://github.com/antirez/ds4 ...with a bit less than half that at "low power" (30w). Both are usable.
overgard
I haven't yet, but I just bought a 128GB M5 Max 40 core which I'm hoping can do it (if not, it's a good laptop regardless, I actually need that amount of RAM for non-LLM stuff)
kristianpaul
Qwen3.6 35B on gigabyte aitop (spark clone) but be very specif what you ask and how should be solvedNemotron super 3 110B works well for 1M context long vibecoding sessionsI also use Pi harness with no extension
jmward01
Has anyone been storing their cc sessions for future training data on their own models? I'd love to build a system that fine-tunes on cc sessions and a good first step is capturing my own sessions well.
shironnnn_
I use SpecKit to create a very detailed plan with a high amount of specificity using paid Claude plan.Then I give it to local LLM (eg: Qwen / Gemma 4) via CLI. This is possible through usage of llm-mlx on Mac (or ollama on any machine given sufficient on hardware) which serve OpenAPI endpoints compatible for Aider (CLI) or Visual Studio Code to vibe along with the agentic coding assistant.The paid products have an advantage but are not necessary if you don't mind to be more-involved with the process and have low expectations.
mark_l_watson
I would like to say I run 100% local, but I use Opus + Gemini Pro cumulatively for 3 or 4 hours a week. I also like to use DeepSeek v4 flash with OpenCode for small quick tasks.I did just publish a free to read online book "The Rise of Local Coding Agents" [1] where I document my setup that I enjoy using. I use little-coder (built on pi) and have good results for small Python and TypeScript applications. I struggle getting good results with Common Lisp and Clojure.For me, the problem with all local LLM-basic coding agents is slow runtime.[1] https://leanpub.com/read/local-coding-agents
SugarReflex
Is anyone using Aider? Is there any decent CLI alternatives to it?
agentbc9000
Kimi K2.7 is very good - i have been testing it and its very very good, Fable 5 level of goodness.
wuschel
I would like to know whether someone was able to use lower tier models for activities other than coding e.g. a limited version of a personal note manager - and what the hardware requirements in RAM for these models were.
ecshafer
I work with a few models on servers, so not local, but self hosted with ollama. gemma-4, glm 4.7 flash, and qwen 3.6. glm is the best at coding agentically. But I still don't think any of them reach the levels of gpt 5.5 or opus 4.8.
anuramat
I wonder what languages people are using; I imagine smaller models would be decent at bash/python but significantly worse at something like rust
catapart
tough ask, but since we're here: has anyone done this with 16GB of VRAM? I've been getting projects finished with LM Studio, but it definitely could stand to be more efficient. lots of time wasted with trying to get models to understand a problem with so few tokens.
redox99
Models that you can run at home (Like Qwen 35B) aren't remotely close to Opus or GPT 5.5. Not even close. The only open models that are in that neighbor are around 1T params, so forget about running at home.It's kind of like driving a shitbox. It can often drive you from A to B, and some people will try to convince you it's fine. It's not.There's no logical reason other than absolutely requiring the privacy, doing it for fun, or niche use cases like airplanes and so on. If you can't spend the insanely subsidized $20 for codex, you can use an API for chinese models which will run circles around these tiny models.
fortyseven
I use Pi and Qwen 3.6 27b locally on a 4090 for all my personal projects. I still use Claude for day job work since they pay for it, and my employer expects me to use it. I rarely touch it otherwise.
hegdeezy
I have tried locally but I find that the implicit breakeven is somewhere around 1 year of use given the high power costs where I live. Not really worth it but maybe if I move some day!
chungus
Yup, although technically not replaced because I never used either of those products because I don't like sending my code to their black box. I have 2x24GB AMD gpu's, gotten from gamers on my local marketplace, one is connected with a 40cm riser cable. Running Qwen 27B and am very happy with its performance. Q8 with 135k context (arbitrary number, I could push it to 256). I like to use qwen 35B3A for mapping out entire code paths through our relatively complicated codebase/infra at work.I think it's so good that I now scour the local marketplaces for good buys on 24GB cards that don't seem run through by miners and the likes, to build an even bigger rig for parallel execution.Power usage is also totally not an issue, AI workload is very different from gaming.tldr llama.cpp-vulkan with opencode on total 48GB VRAM AMD cards on arch btw.
_davide_
i used to mix remote and local minimax 2.7(q3) on my strix halo, it run at 30 tg and 220 tokens pp... it was a bit painful slow, but it was a good feeling i could stay offline. unfortunately m3 which is at opus .8 levels is 460b parameters and doesn't even fit in 128gb of memory, let alone a big context. strix halo feels like a toy for ai purposes. https://kyuz0.github.io/amd-strix-halo-toolboxes/
AH4oFVbPT4f8
Ollama + Hermes on M5 Max 128GB using .NET using Qwen 3.6:35b-a3b as the primary model to do the work. I might use 27b to plan what to do.
SkitterKherpi
It has so far been the kind of thing that always feels like the next version of the local models would be the one that is just good enough.
euroderf
Is anyone managing to do this on a Mac with a measly 8GB ? Asking for a friend.
jwr
I tried many, many times and I keep trying. But I just don't see this happening: those tiny models that we can run on our machines (I have an M4 Max Mac, so I can reasonably run qwen3.6-35b-a3b or gemma-4-26b-a4b-qat at this time) are NOWHERE near as smart as the huge monsters like Opus/Fable. Nowhere. I can see a lot of people deluding themselves.Sure, you can get the local models to generate plausibly-looking code for simple cases. But compared to how I solve complex design problems in a large codebase with Claude Code and Opus/Fable, this isn't worth my time.
jmichaelson
I am working on exactly this issue right now. My approach is that a highly optimized harness (pi.dev) with the right backing knowledgebase (a custom, self-updating wiki with lots of QC layers) can get close to most of my usage patterns for my Claude Max 20x subscription. I use Gemma 4 26B QAT served by a custom fork of llama.cpp, with 4-8 slots of 256k context at Q8. It's a very good model when the harness keeps it on rails. In an age of 1M context windows, 256k may seem small but it's been plenty for my work (scientific programming). A $20/month subscription to Ollama-cloud gets me good coverage of consults out to frontier models for difficult plans or debugging (again this is all woven into my highly customized pi install).I'm still optimizing it (with claude, to be clear), but my testing is very encouraging. I worry a lot about companies (and the government) controlling access to machine intelligence, so local is the way to go.
anubhav200
Yes, llama.cpp, qwen 27b and 35b, llama-cpp-manager for managing model configs.(https://github.com/anubhavgupta/llama-cpp-manager)
lowbloodsugar
If you want to try it out before dropping $$$ on a GPU, just run something that would fit on your target GPU but online.
Razengan
Related: Are there any viable distributed AI models?Like how we've had SETI at Home, Folding at Home, BitTorrent etc. People are clearly willing to donate their computer resources to distributed projects.Maybe in a dAI network anyone could submit content for training on, and each user running a "node" could have their own custom private conditions on which type of content to accept for training or inference.Like someone who dislikes anime could say "never accept anime related content or queries" so their node would basically opt-out from any data or questions about anime.
wmedrano
No, but I use GLM5.1 instead of Claude/GPT.
drnick1
Do you recommend Ollama or bare llama.cpp?
platevoltage
I run very small models locally for code completion and writing boiler plate. I still use Claude in a web browser on occasion since it's free, but the second that goes away, I'll be done with it. They get none of my money.
epolanski
Not with a local one, but I moved to DeepSeek v4.Albeit I plan to move to local ones when I will get my hands on a 256+ GB macbook.Local inference is good enough to help me with my daily job, and doesn't turn me into an assistant to the LLM.
salutonmundo
it's called your damn brain.
hacker_homie
I do qwen3.6 on an amd ai max laptop getting about 6-10tok/s it’s slow enough that I can follow along. It has issues with design and large piles of code. Otherwise it’s a good programming buddy.
devin
Anyone here running a tinygrad?
jay_kyburz
Can anybody let me know how just chatting with Qwen3.6 on a Strix Halo 128GBIf I give it a page of context, can it write a linked list or identify a bad line of CSS?Is there anywhere online I can chat with a model I could be running at home to see how good it is?
w10-1
I run many models (but mainly Gemma-4) using oMLX (for caching) on a 32GB M1 max using (gasp) Xcode. For tok/sec response times, I'd say it responds faster than I could read the prompt aloud in many cases (and I'm not constantly polling the Claude status page).For months I spent time curating the AI+harness+skills+MCP servers, but now mainly just code with it. I find myself not bothering to use Claude (but keep paying "just in case").That's feasible in part because my prompts have very specific objectives, constraints, and suggested staging, because I want the code to be exactly as I would write it, and I want to weigh in at specific moments. I would say the speed-up is 2-4X instead of the 10X of vibe-coding greenfield projects. The problem is not the coding speed, but building something complicated that's also correct and flexible (i.e., a directional accuracy). E.g., the agents help with abandoning a less-fruitful API shape instead of sticking with what works in a local maxima.One flaw there is that I'm still writing code that feels clean to humans, which now is probably a waste. LLM's might be happier with 10+ parameters on one API instead of a plethora of configuration objects and convenience wrappers.
sometimelurker
yeah I use one one the small MTP qwens and pi
system2
Until I can buy an 80GB VRAM GPU, I won't attempt to do it. A local LLM is always missing something that needs a bigger model.
christkv
Waiting for this https://github.com/antirez/ds4 to stabilize for strix halo.
syngrog66
pre-replaced it with combo of my brain, vim, an assortment of other CLI/TUI tools, etc
major505
Yes. I use Owen on my MacBook m1 (16gb) daily, running inside Ollama. Works well. Is not particularly fast, and I need to create a custom imagem that sets the temperature of the model to zero starting, so I don't get over creative with its bullshit, but it works reasonable week.
thrownaway561
I just use DeepSeekV4 Fast... It's cheap as hell. Currently my monthly usage has been67M Ouput 51M InputTotal $0.83 dollar.I honestly don't understand why people just don't use DeepSeek.
jeffrallen
I use Qwen 3.6 on a remote GPU that my work offers. Works fine. Slow and steady, works hard, gets the job done. Probably better at diagnosing than making new code, but whatever.
gigatexal
I tried to. I just couldn't get over how it made my otherwise whisper quiet M3 Max MacBook Pro 14 for the performance. The sweet spot has been adopting Claude Code to use the Chinese models. Deepseek V4 Pro is very, very good. But I am such a casual local user of AI that my 20/month Claude subscription is enough and I find myself using that more and more.
cyanydeez
never started. using wither qwne3-xoder-nezt or qwen3.6 35bif youre shoopping for a new pc, very easy to justify 128gb vram
dude250711
Yes, running a local model on a natural wetware substrate here.Recommended setup: plenty of nutrients, some caffeine and a quiet environment.Performance - not currently measured in tokens: roughly average.
daischsensor
[flagged]
kordlessagain
[flagged]
Littice
[flagged]
arggjarvs
[flagged]
hottrends
[flagged]
aplomb1026
[flagged]
anon
undefined
KaiShips
[flagged]
phlhar
[dead]
temilson
[flagged]
eugmai86
[flagged]
ericmaciver
[dead]
iluvcommunism
[dead]
tyingq
Anyone doing it with a "rent a GPU over the network" path? Is that at all cost effective for any use case?
kertoip_1
Just attach OpenRouter to your coding agent tool and try yourself. All relevant open weight models are there. Every person have different needs and expectations
dada216
Local? No. Via opencode Go subscription using GLM mainly? Yes, I still use Gemini/Claude/GPT via api from openrouter for adjacent tasks, I would say 20$ per month max in api token costs.Disclaimer: I am a Linux infra/k8s guy, I write production code but it's mainly glue code and mainly in golang.Addendum: most value we get is from "document intelligence" and that's all Gemma and Qwen on H100/H200