How I write software with LLMs

<- Back

How I write software with LLMs

indigodaddy

Comments (149)

miguelgrinberg
> One thing I’ve noticed is that different people get wildly different results with LLMs, so I suspect there’s some element of how you’re talking to them that affects the results.It's always easier to blame the prompt and convince yourself that you have some sort of talent in how you talk to LLMs that other's don't.In my experience the differences are mostly in how the code produced by the LLM is reviewed. Developers who have experience reviewing code are more likely to find problems immediately and complain they aren't getting great results without a lot of hand holding. And those who rarely or never reviewed code from other developers are invariably going to miss stuff and rate the output they get higher.
danbruc
I randomly clicked and scrolled through the source code of Stavrobot - The largest thing I’ve built lately is an alternative to OpenClaw that focuses on security. [1] and that is not great code. I have not used any AI to write code yet but considered trying it out - is this the kind of code I should expect? Or maybe the other way around, has someone an example of some non-trivial code - in size and complexity - written by an AI - without babysitting - and the code being really good?[1] https://github.com/skorokithakis/stavrobot
akhrail1996
Genuine question: what's the evidence that the architect → developer → reviewer pipeline actually produces better results than just... talking to one strong model in one session?The author uses different models for each role, which I get. But I run production agents on Opus daily and in my experience, if you give it good context and clear direction in a single conversation, the output is already solid. The ceremony of splitting into "architect" and "developer" feels like it gives you a sense of control and legibility, but I'm not convinced it catches errors that a single model wouldn't catch on its own with a good prompt.
peterweisz
Great article. I'd recommmend to make guardrails and benchmarking an integral part of prompt engineering. Think of it as kind of a system prompt to your Opus 4.6 architect: LangChain, RAG, LLm-as-a-judge, MCP. When I think about benchmarks I always ask it to research for external DB or other ressources as a referencing guardrail
cpt_sobel
In the plethora of all these articles that explain the process of building projects with LLMs, one thing I never understood it why the authors seem to write the prompts as if talking to a human that cares how good their grammar or syntax is, e.g.:> I'd like to add email support to this bot. Let's think through how we would do this.and I'm not not even talking about the usage of "please" or "thanks" (which this particular author doesn't seem to be doing).Is there any evidence that suggests the models do a better job if I write my prompt like this instead of "wanna add email support, think how to do this"? In my personal experience (mostly with Junie) I haven't seen any advantage of being "polite", for lack of a better word, and I feel like I'm saving on seconds and tokens :)
lbreakjai
It's interesting to see some patterns starting to emerge. Over time, I ended up with a similar workflow. Instead of using plan files within the repository, I'm using notion as the memory and source of truth.My "thinker" agent will ask questions, explore, and refine. It will write a feature page in notion, and split the implementation into tasks in a kanban board, for an "executor" to pick up, implement, and pass to a QA agent, which will either flag it or move it to human review.I really love it. All of our other documentation lives in notion, so I can easily reference and link business requirements. I also find it much easier to make sense of the steps by checking the tickets on the board rather than in a file.Reviewing is simpler too. I can pick the ticket in the human review column, read the requirements again, check the QA comments, and then look at the code. Had a lot of fun playing with it yesterday, and I shared it here:https://github.com/marcosloic/notion-agent-hive
takwatanabe
We build and run a multi-agent system. Today Cursor won. For a log analysis task — Cursor: 5 minutes. Our pipeline: 30 minutes.Still a case for it: 1. Isolated contexts per role (CS vs. engineering) — agents don't bleed into each other 2. Hard permission boundaries per agent 3. Local models (Qwen) for cheap routine tasksMulti-agent loses at debugging. But the structure has value.
christofosho
I like reading these types of breakdowns. Really gives you ideas and insight into how others are approaching development with agents. I'm surprised the author hasn't broken down the developer agent persona into smaller subagents. There is a lot of context used when your agent needs to write in a larger breadth of code areas (i.e. database queries, tests, business logic, infrastructure, the general code skeleton). I've also read[1] that having a researcher and then a planner helps with context management in the pre-dev stage as well. I like his use of multiple reviewers, and am similarly surprised that they aren't refined into specialized roles.I'll admit to being a "one prompt to rule them all" developer, and will not let a chat go longer than the first input I give. If mistakes are made, I fix the system prompt or the input prompt and try again. And I make sure the work is broken down as much as possible. That means taking the time to do some discovery before I hit send.Is anyone else using many smaller specific agents? What types of patterns are you employing? TIA1. https://github.com/humanlayer/advanced-context-engineering-f...
silisili
I'm not sure the notion I keep seeing of "it's ok, we still architect, it just writes the code"(paraphrased) sits well with me.I've not tested it with architecting a full system, but assuming it isn't good at it today... it's only a matter of time. Then what is our use?
thenthenthen
Haha love the Sleight of hand irregular wall clock idea. I once had a wall clock where the hand showing the seconds would sometimes jump backwards, it was extremely unsettling somehow because it was random. It really did make me question my sanity.
oytis
I find the same problem applying to coding too. Even with everyone acting in good faith and reviewing everything themselves before pushing, you have essentially two reviwers instead of a writer and a reviewer, and there is no etiquette mandating how thoroughly the "author" should review their PR yet. It doesn't help if the amount of code to review gets larger (why would you go into agentic coding otherwise?)
jumploops
This is similar to how I use LLMs (architect/plan -> implement -> debug/review), but after getting bit a few times, I have a few extra things in my process:The main difference between my workflow and the authors, is that I have the LLM "write" the design/plan/open questions/debug/etc. into markdown files, for almost every step.This is mostly helpful because it "anchors" decisions into timestamped files, rather than just loose back-and-forth specs in the context window.Before the current round of models, I would religiously clear context and rely on these files for truth, but even with the newest models/agentic harnesses, I find it helps avoid regressions as the software evolves over time.A minor difference between myself and the author, is that I don't rely on specific sub-agents (beyond what the agentic harness has built-in for e.g. file exploration).I say it's minor, because in practice the actual calls to the LLMs undoubtedly look quite similar (clean context window, different task/model, etc.).One tip, if you have access, is to do the initial design/architecture with GPT-5.x Pro, and then take the output "spec" from that chat/iteration to kick-off a codex/claude code session. This can also be helpful for hard to reason about bugs, but I've only done that a handful of times at this point (i.e. funky dynamic SVG-based animation snafu).
plastic041
I wanted to know how to make softwares with LLM "without losing the benefit of knowing how the entire system works" and "intimately familiar with each project’s architecture and inner workings", while "have never even read most of their code". (Because obviously, you can't.) But OP didn't explain that.You tell LLM to create something, and then use another LLM to review it. It might make the result safer, but it doesn't mean that YOU understand the architecture. No one does.
kleiba
I write very little code these days, so I've been following the AI development mostly from the backseat. One aspect I fail to grasp perfectly is what the practical differences are between CLI (so terminal-based) agents and ones fully integrated into an IDE.Could someone chime in and give their opinion on what are the pros and cons of either approach?
codeflo
I know the argument I'm going to make is not original, but with every passing week, it's becoming more obvious that if the productivity claims were even half true, those "1000x" LLM shamans would have toppled the economy by now. Were are the slop-coded billion dollar IPOs? We should have one every other week.
xhale
Hi, anyone has a simple example/scaffold how to set up agents/skills like this? I’ve looked at the stavrobots repo and only saw an AGENTS.md. Where do these skills live then?(I have seen obra/superpowers mentioned in the comments, but that’s already too complex and with an ui focus)
prpl
I am enjoying the RePPIT framework from Mihail Eric. I think it’s a better formalization of developing without resulting to personas.
sdevonoes
Agent bots are the new “TODO” list apps. Seems cool and all, but I wish I could see someone writing useful software with LLMs, at least once.So much power in our hands, and soon another Facebook will appear built entirely by LLMs. What a fucking waste of time and money.It’s getting tiring.
zapkyeskrill
What's the point of writing this? In a few weeks a new model will come out and make your current work pattern obsolete (a process described in the post itself)
neonstatic
> Before that, code would quickly devolve into unmaintainability after two or three days of programming, but now I’ve been working on a few projects for weeks non-stop, growing to tens of thousands of useful lines of code, with each change being as reliable as the first one.I'm glad it works for the author, I just don't believe that "each change being as reliable as the first one" is true.> I no longer need to know how to write code correctly at all, but it’s now massively more important to understand how to architect a system correctly, and how to make the right choices to make something usable.I agree that knowing the syntax is less important now, but I don't see how the latter claim has changed with the advent of LLMs at all?> On projects where I have no understanding of the underlying technology (e.g. mobile apps), the code still quickly becomes a mess of bad choices. However, on projects where I know the technologies used well (e.g. backend apps, though not necessarily in Python), this hasn’t happened yet, even at tens of thousands of SLoC. Most of that must be because the models are getting better, but I think that a lot of it is also because I’ve improved my way of working with the models.I think the author is contradicting himself here. Programs written by an LLM in a domain he is not knowledgable about are a mess. Programs written by an LLM in a domain he is knowledgeable about are not a mess. He claims the latter is mostly true because LLMs are so good???My take after spending ~2 weeks working with Claude full time writing Rust:- Very good for language level concepts: syntax, how features work, how features compose, what the limitations are, correcting my wrong usage of all of the above, educating me on these things- Very good as an assistant to talk things through, point out gaps in the design, suggest different ways to architect a solution, suggest libraries etc.- Good at generating code, that looks great at the first glance, but has many unexplained assumptions and gaps- Despite lack of access to the compiler (Opus 4.6 via Web), most of the time code compiles or there are trivially fixable issues before it gets to compile- Has a hard to explain fixation on doing things a certain way, e.g. always wants to use panics on errors (panic!, unreachable!, .expect etc) or wants to do type erasure with Box<dyn Any> as if that was the most idiomatic and desirable way of doing things- I ended up getting some stuff done, but it was very frustrating and intellectually draining- The only way I see to get things done to a good standard is to continuously push the model to go deeper and deeper regarding very specific things. "Get x done" and variations of that idea will inevitably lead to stuff that looks nice, but doesn't work.So... imo it is a new generation compiler + code gen tool, that understands human language. It's pretty great and at the same time it tires me in ways I find hard to explain. If professional programming going forward would mean just talking to a model all day every day, I probably would look for other career options.
imiric
Ah, another one of these. I'm eager to learn how a "social climber" talks to a chatbot. I'm sure it's full of novel insight, unlike thousands of other articles like this one.
justboy1987
[dead]
biang15343100
[dead]
indigodaddy
This was on the front page and then got completely buried for some reason. Super weird.
ForgotMyUUID
TL DR; Don't, please :)