When I reject AI code even if it works

<- Back

When I reject AI code even if it works

vnbrs

Comments (99)

Aurornis
Even using Fable (while it was briefly available), having it refine a plan, and directing it to make only small incremental changes, I still found reasons to reject its first pass at a lot of work. There was a lot of “You’re right to push back” responses. A lot of incidents where it would creat some giant complex set of abstractions to accomplish something that I could find ways to do much more elegantly and in a more maintainable manner.It’s really eye opening to work with these tools on a codebase you know deeply because these problems are everywhere.However if I opened an unfamiliar project in another language and I wanted to add a little feature with no intention of maintaining it, I’d happily accept the changes and loop until it worked well enough for my temporary needs.The scary middle is when you’re dealing with coworkers who don’t care about anything other than closing tickets and collecting credit. With enough of a token budget you can now wrap loops around an LLM and have it try things until the program appears to work. Ask it to do a code review and then submit the PR without having understood what it was doing. There are a lot of workplaces where there isn’t a good mechanism to push back on this and the tech debt just keeps growing.
ecshafer
If we rephrased this to "When I reject my coworkers code even if it works" and give the same reasons there would be zero dissent. There is this weird idea that seems to come up with AI that any solution must be good and adequate. Software Engineering is all about rejecting code that works for the right code that works.
jdw64
Coding with AI eventually comes down to two paths, I've realized. One is using AI exclusively for everything. The other is not using it at all. There is almost no middle ground. The reason is that as the complexity and depth of the problem increase, the code AI generates increasingly follows enterprise level patterns. The deeper the meaning of what I input, the more AI tends to produce code that goes beyond my own area of expertise. For example, a human expert's code is very powerful and deep within their own domain, but when you look at the entire codebase, it's often shallow and uneven outside that domain. But the moment you write code with AI, once you go deep in one part, AI tries to standardize the rest accordingly. This means the entire codebase converges toward enterprise level standard code, which essentially reflects the average patterns of senior programmers who built large scale systems.The problem is this. Human cognitive resources are finite, so we inevitably become shallow outside our own expertise. There is no programmer who can do everything well. And as systems grow in scale, they become more modularized and fragmented, making it impossible to understand the whole system. So what should we do about this? That's always the question.In the end, do I choose not to use AI, finish the project with uneven code outside my domain, and deliver it? Or do I use AI and deliver a program that is uniform and consistent, but not in my own style? I still don't know. I haven't found the answer yet.
summerlight
My personal rule of thumb: I am usually okay with agents driving e2e implementations if this won't make life noticeably worse when it does not work. Some analytical code? Perfectly fine. Hobby projects? Fine, though I prefer doing a fun part myself. Refactoring production code generating 10x more revenue than my salary? You'd better be at least understanding what it does.
whilenot-dev
Titles like these make me always point out the obvious: A working state is the absolute minimum requirement for any code to be merged, isn't it? ...imagine to merge something even though you know that's not working.Besides, this post has nothing specific to code produced by an LLM, and placing AI in the stated reasons feels completely arbitrary, or is rather a fallacy of our times:- I reject [AI] code when I can’t explain the approach in my own words.- I reject [AI] code when the diff is bigger than the problem.- I reject [AI] code when it introduces abstractions before proving they’re needed.- I reject [AI] code when it works locally but makes the system harder to reason about.- I reject [AI] code when I’m trusting the output more than my understanding.
SunboX
I unterstand the reasons, but I don't think so. I have experience in software development over 20 years now and still developing software daily. Nowadays it's nearly 100℅ AI written. It looks good and works. Sure, you have to guide the AI. But this can be done with custom skills, angent files, code quality guards test cases and so on. Maybe the code looks at the end not as I would have written it, maybe something is too complex implemented. But that's true for large developer teams also. At the end it's way faster and it works. I think, everyone who does not adapt to this new workflow is left behind in professional development soon.
osigurdson
Its hard to find a middle ground between fully understanding everything in a PR vs a vibe coding type approach. Can you understand "just a little bit" of a PR and merge it into a code base you really care about? Is it maybe fine to "mostly understand it" on the other hand? Its definitely a tough call and its impossible to argue that no trade off is being made.LLMs are perfect for quick prototypes, speed runs, learning, etc., but if the code really matters its still not clear cut. I think the definition of what "really matters" is very project dependent of course As an extreme example you would want to understand every line of the code for the control system runs an MRI machine or a jet engine since bugs might mean life or death. Depositing money into the wrong account might not kill anyone but could lead to severe economic losses. But, then again, even problems in far less consequential software may be drastically sub-economic (i.e. saving $1000 on the implementation might cost $10000 if customers aren't happy and fails to re new). Pick your scenario I guess.The problem is, this isn't going to change regardless of how well a new model scores on a benchmark. It seems actually AGI is needed.
edanm
Not that I disagree with anything here, but...I wish it were clearer in these kinds of posts how "I use AI code I don't understand" is so different from "I use libraries written by other people I don't understand", or "I work in a large codebase which was 99% written by other people, and I haven't seen all of it", or even "I use software written by other people I don't understand".
krupan
And again this makes me wonder, is AI really helping if this much review and rework is needed for all the code it writes?
wwind123
I use 3 AI's (Claude, GPT and Gemini) to review each other's design plans and implementation on the same code base. Each often catches problems the others miss.I try to make sure the architecture docs of the code base are refreshed regularly based on recent changes, so it's easier for humans and AI agents to make sense of the code.I also regularly stop all other developments and just focus on auditing the code base with these AI's to make sure they are secure, robust, clean, and well structured and well tested -- some refactoring would be needed most of the time, and it's well worth it.With this approach, nowadays I often merge code from AI without completely understanding what it's doing, but seems the code has been working so far. :)
datadrivenangel
"The reality is that code that runs and makes the CI green can still be a bad solution, and engineering has always been about implementing adequate, scalable, and extensible solutions."Adequate often means done and cheap
moezd
If it's code that you can tolerate being somewhat messy and suboptimal, you can run agents e2e. If it's critical piece of code that has become part of your identity, better do the PR work and scrutinize it well. LLMs are still next token predictors, no matter how much harness, hooks, skills and tools is attached to them. LLMs will only know that these are callable, interpretating the state and mitigation are still best effort.
philbo
Yesterday I started working on an agent harness that tries to address some of the issues here.What I'm hoping to build ultimately is something that works more like a pair-programming partner than existing harnesses do. I want the user to be an engaged part of the development process all the way through, I don't want the agent disappearing to work on its own. I even want to make it possible for users to swap into the driver role and have the LLM automatically assume the role of navigator when that happens.There's more info in the readme (actually the readme is all that exists so far, I wanted to get the idea straight in my head first):https://gitlab.com/philbooth/opairEven if nobody else uses it, I hope it will be a useful tool for myself and help me find a way to work with LLMs that doesn't harm my mental models, which is what I feel current harnesses do.
julianlam
I think a particular failing with developers embracing AI is fighting the sunk cost fallacy. While you might not have spent as much time putting together a non-working solution, you still did spend time working with the agent to slap together a non-working solution.Being able to step back and say "this was a failure and we need to discard the day's work and start over" is still hard with LLMs.
piterrro
I feel the same way, reading AI built feature entire output makes me cognitively overloaded as well - I can only do so many throughout the day.What I found myself doing is operating in two modes: 1. For projects that require my attention, I plan and instruct LLM, when needed will draft some code and ask agent to make it better or finish the mundane part (write code and leave gaps with comments asking agent to finish) 2. Full automode where I use spec driven development and TDD - I only ask for changes based on existing PRD, which agent also have to update. Here I do not look at the code at all.Seems to be working just fine.
AmareshHebbar
If I can't explain the code without rereading the diff, I probably shouldn't merge it.
anon
undefined
danfritz
This resonates a lot with me. I often use AI for the plan and let it propose multiple possible implementations, I often have to point out the glaring easier / logical solution.When implementing its often a lot of misses with a few golden hits. The other day it used flex for a table layout while our app uses tables everywhere sigh.Another typical one is that it tends to prefere frontend aggregation and looping of data instead of letting the database and backend deal with it.Using mix of claude, cursor composer and codex.
eranation
LLMs diverge, not converge. They slightly increase entropy if not controlled. While you can have DRY skills and use AI to organize AI (in loops(tm) like Boris does) but eventually if you don’t understand the code, you are taking yourself out of the loop. And not just the job security that’s on the line, it’s the increasing cost for AI to babysit AI. If you or your “loops” (or paperclip, Hermes, gastown, or next in class agents of agents that runs your entire company) let it gradually sneak in slop-debt, the cost to fix it later will become prohibitive. (You can always just rewrite it, but as the race for “feature complete” and “zero backlog” continues, rewriting an ever growing set of new daily table stakes will become an economical moat)TLDR: Keeping your codebase human readable and reason-about-able is not just helping humans to stay relevant. It will save costs for LLMs to maintain it.
rvz
> Before coding agents, when given a task, I would explore the codebase, think of different solutions, experiment, and only then implement. That could take days of consolidating all that context. When I finally submitted that PR, confidence was higher, and explaining each of my changes to my coworkers was easier.Now we are getting to the point where we are speed-running the deskilling of engineers into comprehension debt and they themselves rapidly losing confidence in reviewing code they did not write.I think this blog post [0] is the best example of what could go entirely wrong and even worse when you do not know the technology.If you cannot explain a change even when "the CI is green" or "all tests passing", I will immediately reject it.Maybe great for vibe coding prototypes, but it all changes when that code is deployed onto mission critical systems. Just ask Amazon with Kiro. [1][0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...[1] https://www.reuters.com/business/retail-consumer/amazons-clo...
_wire_
"Even if it works?"How do you verify that it works?
aimattb
[flagged]
cws_ai_buddy
[flagged]
YongHaoHu
[dead]
monkamonme
[flagged]
codelong888
[flagged]
OffBeatDev
[dead]
panchtatvam
You must accept AI code only if you deem yourself dumber than AI.
cadamsdotcom
If you reject AI code that works then your mindset is still too hands on. Put another way - you still have some loops to work on taking yourself out of. The agent should’ve delivered code that was acceptable as a first pass.Agents respond really well to feedback! They have no ego and they’ll happily improve code if told where and how. But you need to provide the tools that provide that feedback without your involvement - otherwise you can’t scale.All the linting and autoformatting you can put in, is a good start. Next, create custom scripts that check for every single dumb AI-ism you can think of, tell the agent about them, tell it to use them to check its work, and put them in hooks so the harness refuses to let the agent stop until all your linters show no errors.Then, keep iterating basically forever. Any dumb AI-ism you see, make a linter for it, give it to the agent, and enforce it using the harness.I’ve spent months doing this. When I review a PR - which was built by the agent with TDD so it definitely works - I’m no longer asking if it did dumb stuff or confirming it conformed to the architecture or duplicated code or missed opportunities for reuse. That’s all linted for. I don’t worry about duplication or outdated docstrings/comments because the self review caught all that. I mostly read it to look for opportunities to make the feature even better & more useful.If this makes no sense or you disagree it’s possible, my contact details are on my profile and I’ll be happy to give a demo.