LLMs work best when the user defines their acceptance criteria first

<- Back

LLMs work best when the user defines their acceptance criteria first

dnw

Comments (145)

pornel
Their default solution is to keep digging. It has a compounding effect of generating more and more code.If they implement something with a not-so-great approach, they'll keep adding workarounds or redundant code every time they run into limitations later.If you tell them the code is slow, they'll try to add optimized fast paths (more code), specialized routines (more code), custom data structures (even more code). And then add fractally more code to patch up all the problems that code has created.If you complain it's buggy, you can have 10 bespoke tests for every bug. Plus a new mocking framework created every time the last one turns out to be unfit for purpose.If you ask to unify the duplication, it'll say "No problem, here's a brand new metamock abstract adapter framework that has a superset of all feature sets, plus two new metamock drivers for the older and the newer code! Let me know if you want me to write tests for the new adapters."
grey-area
I find they work best as autocomplete -The chunks of code are small and can be carefully reviewed at the point of writingClaude normally gets it right (though sometimes horribly wrong) - this is easier to catch in autocompleteThat way they mostly work as designed and the burden on humans is completely manageable, plus you end up with a good understanding of the code generated.Having the AI produce the majority of the code (in chats or with agents) takes lots of time to plan and babysit, and is harder to review, maintain and diagnose; it doesn't seem like much of a performance boost, unless you're producing code that is already in the training data and just want to ignore the licensing of the original code.
D-Machine
This article is great. And the blog-article headline is interesting, but wrong. LLM's don't in general write plausible code (as a rule) either.They just write code that is (semantically) similar to code (clusters) seen in its training data, and which haven't been fenced off by RLHF / RLVR.This isn't that hard to remember, and is a correct enough simplification of what generative LLMs actually do, without resorting to simplistic or incorrect metaphors.
flerchin
Yes plausible text prediction is exactly what it is. However, I wonder if the author included benchmarking in their prompt. It's not exactly fair to keep hidden requirements.
seanmcdirmid
I'm using an LLM to write queries ATM. I have it write lots of tests, do some differential testing to get the code and the tests correct, and then have it optimize the query so that it can run on our backend (and optimization isn't really optional since we are processing a lot of rows in big tables). Without the tests this wouldn't work at all, and not just tests, we need pretty good coverage since if some edge case isn't covered, it likely will wash out during optimization (if the code is ever correct about it in the first place). I've had to add edge cases manually in the past, although my workflow has gotten better about this over time.I don't use a planner though, I have my own workflow setup to do this (since it requires context isolated agents to fix tests and fix code during differential testing). If the planner somehow added broad test coverage and a performance feedback loop (or even just very aggressive well known optimizations), it might work.
einrealist
> SQLite is not primarily fast because it is written in C. Well.. that too, but it is fast because 26 years of profiling have identified which tradeoffs matter.Someone (with deep pockets to bear the token costs) should let Claude run for 26 months to have it optimize its Rust code base iteratively towards equal benchmarks. Would be an interesting experiment.The article points out the general issue when discussing LLMs: audience and subject matter. We mostly discuss anecdotally about interactions and results. We really need much more data, more projects to succeed with LLMs or to fail with them - or to linger in a state of ignorance, sunk-cost fallacy and supressed resignation. I expect the latter will remain the standard case that we do not hear about - the part of the iceberg that is underwater, mostly existing within the corporate world or in private GitHubs, a case that is true with LLMs and without them.In my experience, 'Senior Software Engineer' has NO general meaning. It's a title to be awarded for each participation in a project/product over and over again. The same goes for the claim: "Me, Senior SWE treat LLMs as Junior SWE, and I am 10x more productive." Imagine me facepalming every time.
gormen
Excellent article. But to be fair, many of these effects disappear when the model is given strict invariants, constraints, and built-in checks that are applied not only at the beginning but at every stage of generation.
88j88
100% I found that you think you are smarter than the LLM and knowing what you want, but this is not the case. Give the LLM some leeway to come up with solution based on what you are looking to achieve- give requirements, but don't ask it to produce the solution that you would have because then the response is forced and it is lower quality.
comex
Based on a search, the SQLite reimplementation in question is Frankensqlite, featured on Hacker News a few days ago (but flagged):https://news.ycombinator.com/item?id=47176209
jqpabc123
LLMs have no idea what "correct" means.Anything they happen to get "correct" is the result of probability applied to their large training database.Being wrong will always be not only possible but also likely any time you ask for something that is not well represented in it's training data. The user has no way to know if this is the case so they are basically flying blind and hoping for the best.Relying on an LLM for anything "serious" is a liability issue waiting to happen.
lukeify
Most humans also write plausible code.
sim04ful
I've noticed a key quality signal with LLM coding is an LOC growth rate that tapers off or even turns negative.
helsinki
That's why I added an invariant tool to my Go agent framework, fugue-labs/gollem:https://github.com/fugue-labs/gollem/blob/main/ext/codetool/...
mmaunder
But my AI didn't do what your AI did.Cherry picked AI fail for upvotes. Which you’ll get plenty of here an on Reddit from those too lazy to go and take a look for themselves.Using Codex or Claude to write and optimize high performance code is a game changer. Try optimizing cuda using nsys, for example. It’ll blow your lazy little brain.
anon
undefined
FrankWilhoit
Enterprise customers don't buy correct code, they buy plausible code.
raw_anon_1111
The difference for me recentlyWrite a lambda that takes an S3 PUT event and inserts the rows of a comma separated file into a Postgres database.Naive implementation: download the file from s3 and do a bulk insert - it would have taken 20 minutes and what Claude did at first.I had to tell it to use the AWS sql extension to Postgres that will load a file directly from S3 into a table. It took 20 seconds.I treat coding agents like junior developers.
marginalia_nu
I tried to make Claude Code, Sonnet 4.6, write a program that draws a fleur-de-lis.No exaggeration it floundered for an hour before it started to look right.It's really not good at tasks it has not seen before.
codethief
> Your LLM Doesn't Write Correct Code. It Writes Plausible Code.I don't always write correct code, either. My code sure as hell is plausible but it might still contain subtle bugs every now and then.In other words: 100% correctness was never the bar LLMs need to pass. They just need to come close enough.
ontouchstart
I made a comment in another thread about my acceptance criteriahttps://news.ycombinator.com/item?id=47280645It is more about LLMs helping me understand the problem than giving me over engineered cookie cutter solutions.
nprateem
In the last month I've done 4 months of work. My output is what a team of 4 would have produced pre-AI (5 with scrum master).Just like you can't develop musical taste without writing and listening to a lot of music, you can't teach your gut how to architect good code without putting in the effort.Want to learn how to 10x your coding? Read design patterns, read and write a lot of code by hand, review PRs, hit stumbling blocks and learn.I noticed the other day how I review AI code in literally seconds. You just develop a knack for filtering out the noise and zooming in on the complex parts.There are no shortcuts to developing skill and taste.
anon
undefined
anon
undefined
riffraff
To be fair, people do too.
gzread
Early LLMs would do better at a task if you prefixed the task with "You are an expert [task doer]"
anon
undefined
graphememes
bad input > bad outputidk what to say, just because it's rust doesn't mean it's performant, or that you asked for it to be performant.yes, llms can produce bad code, they can also produce good code, just like people
skybrian
You can ask an LLM to write benchmarks and to make the code faster. It will find and fix simple performance issues - the low-hanging fruit. If you want it to do better, you can give it better tools and more guidance.It's probably a good idea to improve your test suite first, to preserve correctness.
bamboozled
I'm sure this is because they are pattern matching masters, if you program them to find something, they are good at that. But you have to know what you're looking for.
cat_plus_plus
That's very impressive. Your LLM actually wrote a correct code for a full relational database on the first try, like it takes 2.5 seconds to insert 100 rows but it stores them correctly and select is pretty fast. How many humans can do this without a week of debugging? I would suggest you install some profiling tools and ask it to find and address hotspots. SQL Lite had how long and how many people to get to where it is?
STARGA
[dead]
jeff_antseed
[dead]
thisguySPED
[flagged]
thisguySPED
[flagged]
user3939382
I have great techniques to fix this issue but not sure how it behooves me to explain it.
serious_angel
Holy gracious sakes... Of course... Thank you... thank you... dear katanaquant, from the depths... of my heart... There's still belief in accountability... in fun... in value... in effort... in purpose... in human... in art...Related:- <http://archive.today/2026.03.07-020941/https://lr0.org/blog/...> (I'm not consulting an LLM...)- <https://web.archive.org/web/20241021113145/https://slopwatch...>