The short leash AI coding method for beating Fable

<- Back

The short leash AI coding method for beating Fable

Riseed

Comments (194)

sothatsit
This “short leash” seems like more of a crutch to me, and a sign of not giving the AI enough detail on the problem to begin with, or not reviewing and iterating on its output.Hand-holding great models like Fable through implementation is a waste of time, and a waste of Fable. You can have increasingly nuanced discussions with stronger models, and they write a lot better code than they used to. The process of discussing designs and their implementations, questioning things that look weird to you, and actually reading the AI’s responses also helps to find better solutions.For example, one time I wanted to write a greedy solver for a problem, and in my discussion with Opus on the idea it suggested using an existing MILP library to solve the problem exactly. I’d never even heard of MILP, but my final implementation ended up being better and simpler than what I’d have done alone.
chewbacha
I did this for two weeks on a side project and still ended up in a situation where I did not have a mental model of the codebase.There’s no way build that model without building it yourself. I’m more convinced then ever of this.
ramon156
One problem I have with "how to do X with AI" is that every situation is different. For example, I'm bumping Symfony projects from 3.1 to 8.1. There's a clear path here- Follow the written up migration guides PER major version- test all routes, authorised, etc. You can even hand-curate these tests. some might return 200, some might return 302- Maybe optionally start with writing a safety net so you do not need to do these test manually, have e.g. a PHPStan baseline, etc.You're done when the routes are e2e functionally working as intended. You could even use snapshot testing here.I do not need to look at the AI here. I can review the code at the end, but I do not need to manually approve stuff here, hence safety features are off.
jonplackett
I thought this was how everyone who can actually code uses AI for anything that’s actually important.Am I wrong? Are you guys just YOLOing everything these days?
fny
AI is a junior to mid-level engineer. If you treat it as such, you get the best of both vibe coding and rigorous engineering without all this paranoia.Since the very beginning I've ran Claude from an isolated VM on yolo mode. This is just like giving an engineer their own laptop. Claude works on a feature up to a PR worthy point. I review the diff, just like I would with another engineer, and massage it to get it in the right shape and move on.Inexperienced engineers make the same mistakes described I've even seen rm -rf albeit not from root! I would have lost my mind micromanaging someone with all permissions denied.
moezd
LLMs are still next token predictors, just because you can give it more vague instructions and it still finds the right steps to follow, it doesn't mean it's intelligent. It means you're speaking the same language as the harness they trained your model on.And that has a limit. If you are stuck at PoC level or simple apps, you have no idea how limited the current models still are. There you really need to break tasks down, not just trust a token predictor to list steps that sound good. There has to be a human in the loop somewhere, because by the time you start skipping permissions, best case you get the jackpot, more likely is you get a suboptimal solution and token waste and what's genuinely still terrifying when the model ignores instructions and does some stupid nonsense, ruining your day. It really is as sharp as a CNC machine. It's not not useful, but could be dangerous, so maybe don't try to carve wood with a monster machine, or park your Ferrari in that crammed neighbourhood if you don't know how to parallel park.
andai
I call this method semiauto. The main benefit is keeping your mental model synchronized.The process becomes real-time instead of asynchronous, and active instead of passive.And you don't have to spend extra time catching up on the code later.You can also use much smaller faster cheaper models, because the scope always stays bite sized.
ed_mercer
I feel like OP is still in the year 2025.> The AI will have gone off the rails multiple times and you will only notice it later when you actually try to use the software.Except that said AI can now themselves use your software and find and fix bugs themselves, not to mention drive new features.>Your agent might go “off the rails” and start doing something you don’t want it to doThis happens but far less often than it used to, and the case for full autonomous agents is getting stronger, not weaker.>It is humanly impossible to build your own understanding of a codebaseThis again feels outdated. I think we're mving towards humans no longer needing to understand a codebase, and letting AI drive it.
entropyneur
This reminds me of the workflow I had a year ago. Miss Aider so much. Are there any good open source agents right now? Might be a good time to try one soon as Fable switches to token-based billing, which Code is designed to maximize.
fathermarz
I’m not sure I understand. Babysitting models is not a multiplier IMO. If you have done 1000s of turns your harness should get sharper and less likely to go off the rails.Also I find that on greenfield, babysitting is a must, but once you have established your house style of patterns, abstractions, and baselines, you can let any of them roam free cause they will look for examples before going forward.I agree with the sentiment though that if you let a swarm design and code your whole codebase, you will be lost in how it fits together. More feature bloat than code bloat though from my experience
sscaryterry
There really wasn't much substance to this article.
afro88
Maybe I'm too optimistic, but given appropriate skills and references (not just for writing but also reviewing) and intelligent use of subagents for isolated reviews and checks, you can lengthen the leash a bit.But you still need to properly review plans and PRs to keep a good mental model of the codebase. This effectively limits the number of tasks being done in parallel to maybe 2-3. Though you'll be mentally exhausted and probably start to make mistakes or take shortcuts in reviews yourself.
steezeburger
I find it hard to stay engaged doing this. I do get good results, but it's just hard to not get distracted when it's doing the work.
zmmmmm
To me a lot of the anti-short leash sentiment is reflective of the low accountability SWE have always had for their output. Software devs seem to strongly reject the concept that it isnt ok to ship defective products and fix later. It will be interesting to see if it persists as incidents start to occur due to fully automated code.
nateburke
Seems like a common-sense approach. I appreciate the emphasis on understanding, humans will eventually be held accountable, blaming Claude for an outage is not going to get Claude fired.
jwpapi
To me it’s even simpler than that. You use so for exploration and review. Not for writing code.
jdthedisciple
aka common sense for any seasoned software engineer
giancarlostoro
Here I thought this was about Fable the video game, then I remembered Anthropics model got named Fable. It's going to be painful to google one of my favorite game series, just like googling "Rust server" does not give you Rust programming results, but Rust the video game results. I wish google would have fixed this problem long ago, it seems like something trivial for them to fix.
visarga
I thought it was going to be even shorter leash - code autocomplete with smaller local models. That raises the level of interactivity and leads to better code knowledge.
bonsai_spool
I'm curious whether Opus4.8 or similar can attain Mythos level through good system prompting and steering? You would expect this to work if it's true that the strength of Mythos is its unwillingness to quit before it gets a desired outcome
heohk
They can generate stuff outside their training by consuming and regurgitating documentation. Thunkign
codyswann
Nothing I haven’t read 1,000 times before.
swader999
Fable already feels like it has a very good harness.
rybosworld
I'm convinced that even if/when ASI is achieved we will still have mediocre engineers writing blog posts about how they have uncovered the secrets to using these tools "effectively".
WhitneyLand
This post seems like some decent advice mixed in with a lot of overconfidence and unverifiable claims.“expert developers whose skills have reached the point where they outclass any and all “frontier AI models” in their area of expertise”Are any developers saying they outclass any and all frontier models? I’d say at best it’s mixed at this point. The best developers still do certain things better, but not even close to all things.“The problem is that even code written and/or reviewed by Fable 5, will stink”I’m skeptical. Example prompt and output please.
YuechenLi
I mean, the key is to stop trying to one-shot everything: The main problem I found with LLM code is more that they always try to take the shortest path to the solution possible, so a lot of time Codex would write code that meets the requirements of the prompt but misses something that cause it to not work in the non-ideal scenario.The solution for that is pretty easy too, it's just iteration: you describe the exact problem you have with the code and why it is not running correctly and ask them to provide a narrow fix that addresses the bug. It's not that complicated.
hungryhobbit
I <3 how everyone and their brother feels qualified to write advice to hundreds? thousands? of other developers about AI ... based on a couple months of experience as a personal user.I mean, it's like writing a book about how to use React or Django or some other major software ... after you used it for one project for a month!Authors: I know this is the Internet, and I know bloggers blog about whatever pops into their head ... but if you are going to act like an authority, how about you learn more than the average reader before you start telling them authoritatively what to do?
agrippanux
I have found a different model should be used to do the review - like if Claude did the code, Codex should review. Models reviewing their own code is a recipe for disaster.
kissgyorgy
This is probably slower than writing the code yourself. Doesn't make sense to me. Using an agent without YOLO mode is not wort it.The way I rather do it is tightly control the output by skills written yourself, prompts, plans, etc. and have the closest possible outcome you would write yourself.
codemog
Why not just write the interfaces yourself and let the AI do the implementation at that point?
8note
... fable on the restart seems to be more like opus and very turn limited?if you want to beat it, give it more turns before it has to "wrap up a session"
CamperBob2
FTA: Contrary to marketing statements made by certain CEOs, these models are not able to think beyond their training data.The sheer cognitive dissonance needed to say something like that at a time when AI is delivering novel math proofs is... well, not actually impressive. Mostly, it's just sad.Some part of him must know such a statement is not true, or more properly, that it's meaningless. But he says it anyway, because he thinks it makes an impression of insight and erudition on the listener.If you think what it does is brilliant, you're not ready (to use AI.)At some point in one's journey to engineering enlightenment, one recognizes how rarely "brilliance" is actually called for, and indeed how counterproductive such self-judged "brilliance" often turns out to be in the long run.Clearly the author is still striving to reach this particular stage.
programmarchy
Good luck with that. I used to be an OCD freak about code before LLMs, but AI coding has largely freed me of that limitation. I've become very comfortable giving AI a long leash, but only after being meticulous about curating the context.These days I spend most of the day in discussions and planning, producing documentation, agonizing over architectural decisions, edge cases, and naming conventions. Once that's all settled I'll hand off implementation work to run overnight. In the morning, I'll review and fix, but I'm usually pleasantly surprised with the results.One pitfall is long leash without a curated context, which is more like "slot machine" coding. Usually not effective, and may have addictive effects since it does occasionally work.To spice things up lately, I've been encouraging the model to produce its own "capstone" -- a feature it decides to build on its own, however it wishes, with the tools at its disposal. So far it's been conservative, creating useful tools for development rather than customer facing features, but I'm curious to dial up the temperature to see what it might come up with.
aivisibility96
[flagged]
1105714
[flagged]
claud_ia
[flagged]
luoshi
[flagged]
nonbind
[flagged]
cws_ai_buddy
[flagged]
roshandxt
[flagged]
avereveard
Seems hella inefficient.Better method start to realizing that everything that every program do is data transformations and or movementThen you ask llm to subdivide data in a tree along the domain model, classifing streaming vs storing nodesThen for each node you discuss with the ai for the best data structureThen you ask for an interface that fully encapsulate the structure and every mutation only allows to go from a valid state to a valid state and bidding else is allowed to touch the stateAnd that's mostly it just connect all the interfaces until input goes to monitor or to storage or to api or wherever the destination is