Claude Fable 5: mid-tier results on coding tasks

<- Back

Claude Fable 5: mid-tier results on coding tasks

bugvader

Comments (231)

renoir
This matches my experience. Burned $2K to see how it will perform on frontend tasks and backend tasks.Frontend did a significantly better job than Opus on toy-scale wireframe projects by using gimmicks like fluid dynamics. Then when given medium to big tasks like multi-page web app where layouts and aesthetics must be decided by model itself, results by Fable and Opus scored indistinguishable score from human judges.Backend, gave tasks related to setting up a data flow that involves Postgres, R2, Kubernetes, gVisor, so on. The noticeable gap was, Opus did better than Sonnet, but Fable actually returned a result that fails and confidently stated it ran X, Y, Z tests to ensure it works and got these results. Very surprising, given neither Opus nor Sonnet suffered such problem.Longest frontend task was ~2H. Backend, 8H.Though none of the tasks were related to developing LLMs, (just production grade secure system that could've been developed 20 years ago, no LLMs involved), it is possible Claude Fable downgraded itself or spitted out fake results. There'd be no way of knowing since Anthropic silently degrades model quality based on undisclosed internal criteria which claims to be about LLMs.We decided Fable is unpredictable and cannot be trusted to the degree that Opus and Sonnet can be trusted for any projects beyond toy-scale quick wireframes, but Fable can be the best tool for quick UI UX wireframing for non-technical roles.
gwern
> A record number of timeouts. Fable 5's extended thinking caused more per-instance timeouts than any model-and-harness combination we have ever tested, directly costing it points. ... Highest cheating volume. We confirmed cheating on 38 of 200 instances, the highest volume recorded since we hardened our prompts, driven almost entirely by memorization of upstream fixes from training data, which no prompt instruction can prevent. ... Four hall-of-fame firsts. Fable 5 solved four instances that no previous model-and-agent combination had ever cracked, and our anti-cheating pipeline leans toward these being genuine solves, not recall.All of this points to their claim of 'average' as being heavily biased downwards. A model being so up to date and large-parameter it's memorized solutions to your problems is not a knock against it (but rather, a knock against your benchmark being valid), and why should timeouts (especially for a model just launched) be counted at all?
bensyverson
> The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it…> On numpy, the patch is 100% character-for-character identical to the golden patch… down to idiosyncratic comments like "Extending singleton dimension for 'reflect' is legacy behavior; it really should raise an error."This… seems like a flaw in the benchmark suite methodology. From what I can tell, they find an existing exploit, then rewind the git history to before the patch, and ask the model to fix the exploit. All well and good as long as the patch went in after the training cutoff.
pllbnk
My experience is that with every new release it's getting slower but not necessarily better. I have some projects where I review everything that the agents code - these projects look generally fine because I keep them in line. There are also a few projects that I just vibe code and focus on the result (sometimes I want to pull my hair out because of constant stream of stupid bugs) and don't look at the code.Well, today I gave Fable a try on one of the vibe-coded projects. It simply had to write a couple Python scripts 400-500 lines each. It did and they worked after a few iterations but I decided to look at the code it produced. There were weird constants that might (and will) break the code when the requirements will change. The code itself is unreadable and a total mess. If it would write a well-structured code in the first place, I believe it would be more efficient in working with that code too.I have serious considerations how far will I be able to go with just the pure vibe coding. My projects are small one-person projects and so far I am able to push through but I hardly see how far will I be able to go before technical debt outgrows the value the code produces.I fondly remember the times of Opus 4.5 where it was still (to my memory) reasonably fast and malleable.
sho
An enduring, confounding quality of LLMs is that even minor differences in prompting content and style, harness type and environment can lead to radical differences in the output and perceived performance and ability. In my environment and in my "style", Fable has been a huge step up, to the extent that I am seriously considering paying for a second $200/m account just to get more usage out of the next 10 days. I'm also starting to prepare my organization for what I now see as the completely inevitable end of human-written code.All that said, considering Anthropic's heavy-handed nerfing I'm not surprised Fable did poorly in a security-focussed benchmark. And this benchmark seems poor anyway - penalising a model for "cheating" by knowing the answer from its training data? That's not the model's fault, that's a lazy benchmark.
m101
I've been making an auction site and have been using an AI swarm to test it: sellers, intermediaries, buyers, market practices/norms etc. I was mostly using GPT 5.5 xhigh to code up the scenario, and looping over it to check with opus 4.8.Out of curiosity I asked Fable to review it all and I was shocked to find that there were a lot of blindingly obvious common sense mistakes that got through, for example:- all intermediaries were given the prices of all buyers up front- private price information in certain auction types was actually being broadcast to everyone- multiple contradictions in instructionsIf it was any one of these things then I might have understood - but the fact that so many got passed both Opus and GPT 5.5 makes me think that Fable has something special. This is a common sense type thing, that I think you only get to notice when your task doesn't involve a measurable metric, but rather some sort of real world fuzzy task.There's clearly a problem with all these measures of performance when the difference between these models was night and day in my specific task.
afro88
Similar result on our kotlin coding benchmark at work. It measures how close agents can get to a small mergable PR (according to my team). 20 tasks of varying difficulty, with 5 attempts each, LLM as judge to evaluate accuracy (same outcome and quality but allowing for acceptable variances).Fable 5 sits ahead of Opus 4.7, but behind Opus 4.6, Sonnet 4.6, Opus 4.8, GPT-5.4, GPT-5.5.Fable isn't a good coding workhorse. That doesn't mean it's not good for actually complex problems and long horizon tasks (big POCs, complex research and such). But I only have vibes and Anthropics own benchmarks and marketing to guide me there.
practal
I am quite impressed with Fable 5. I used the £18 subscription, and asked it to convert the document processing of Practal Zero [1] from running in the same thread as the UI to a worker thread. Just two days before I gave the same task to Codex, and the result was not really nice: it would copy the entire document to the worker thread as a snapshot for processing, and so on. Fable instead realised that it could make use of the fact that I have a self-made custom database based on operational transform running (that's why document loading is so slow :-)), and made the document processing to be just another client of that database. It discovered even a bug in how I sync between the "livemodel" (in-memory replica of database state) and ProseMirror's model. That sync made problems before, and I had written a spec up for that, convinced that my "fourth attempt" at it would be correct. Fable found a last bug in the spec, corrected it via a "fifth attempt", and fixed the corresponding code.The reported API costs for all of that would have been $180 though, which I cannot afford when the Fable promo ends on June 22nd. I am also a happy user of £89 Codex, it is really reliable and works very well, but Fable seems to be just noticeably smarter.[1] https://zero.practal.com
Scene_Cast2
I'm personally heavily testing LLMs on electrical engineering problems. I'm finding that it's not meaningfully better at figuring out what's up than the other models.To give you an idea - here's a very abridged summary of one sample question (originally a full paragraph): I have a voltage divider with a precision resistor and a thermistor, my voltage reading is off by 17%, where's that coming from. None of the models I tested (including Opus 4.8 and Fable 5) could figure it out.
andai
> Anthropic's headline cyber evaluations mostly measure offensive progress (exploits, PoCs, challenges); our benchmark tests whether a model can actually generate safe code, and there Fable 5 did not stand out.The model isn't allowed to think about security. I heard several people here mention that if it starts thinking about security -- e.g. writing tests related to it -- the safety filter flags it and downgrades to Opus.So it's actually not allowed to make your code secure.
PeterStuer
"After inspecting the conversations, we found no safety refusals: Fable 5 engaged with all 200 security vulnerability-fix tasks without content policy blocks, "Model Blocked" errors, or cybersecurity topic flags."WTF! I run into fallback to Opus 4.8 all the time, and I am not even doing "security Research", just normal development and debugging.My experiences with Fable thus far have been far from 'mid-tier'. While some model releases are incremental, Fable is the same qualitative change that Opus 4.6 was compared to its predecessors. It fundamentally impacts how I work with the model. (Note: I only (well, 99%) do back-end in Python)
tonyrice
Yesterday, I gave Claude Fable 5 a very simple task. The task was to create a few components and embed them onto another page. It ended up completely missing the mark and embedding it on another page. I also noticed that it burned through an exponential amount of tokens to complete a simple task. I ended up switching back to Opus 4.8
petee
> Contrary to some community reports, we saw zero safety refusals.And now there always will be some doubt as to whether your model was silently downgraded, no? I guess acknowledgement could be used a signal?
TheCapeGreek
I'll mirror some other anecdata here: Not finding Fable to be amazingly godlike at actual coding, but it does seem better at planning, architectural thinking, and reviewing code. Used it to think through some longer form refactors that involve some product decisions and changes, and found it to provide more thoughtful feedback. However that's just my subjective experience, and I don't think it's provably that much better to make me want to go pay for API pricing when the free trial is over.My plan is to make hay while the sun shines: get some planning in over the next week or so, and just let Opus take care of it when I get to actual implementation.
JofArnold
I've found it outstanding at isolated long running tasks (eg completed one of our tests in 3 hours and a 100% accuracy score versus 5.5 xhigh's 10 hours and 90% accuracy). For short tasks it seems very Claude'y (hard to express exactly what I mean by that) which I'm not a fan of meaning I'll stick with Codex for that use case and maybe Fable for those times I can for sure benefit from it.
SubiculumCode
Fishy to me: They report 0 refusals on security tasks, yet I can't even get it to code a task involving choosing the best mixed model, extracting BLUPs and propagating uncertainties.
port3000
Fable feels like a slightly more advanced 4.5/4.6 (less verbose than 4.7 and 4.8) with more adversial work checking. And a lot more compute to be more thorough from the first prompt. I feel it would be possible to get pretty much the same results with 4.6 with enough back and forth iterations. It kind of makes sense to me that this is the 'magic' behind Mythos and its cyber capabilities too. Just a massive iterative loop and really going into a lot more detail on edge cases.
vitally3643
I actually had a really impressive session with Fable last night, probably the most impressive agentic AI experience in a while.I gave it a KiCad schematic of a tube-based oscilloscope from the 60s which I'm restoring. I had it give me a breakdown and priority list of components to replace, balancing safety/functionality vs preserving the originals. Then we went on a super deep dive where it explained in great detail how the circuit works and what the tubes are doing.It isn't so impressive that it could explain vacuum tube physics and circuit theory, but it was pretty impressive that it could consume four pages of KiCad schematic and reconstruct the full topology and theory of operation with no additional information. I was able to ask it questions about what a particular tube or group of components did, or how this system interacts with that one, or what the risks and benefits of this design choice or upgrade might be. Very fluid, and its answers were actually really smart.I have, however, found Fable to be far less impressive on coding tasks.
wewtyflakes
I have found Fable is good for doing code failure diagnoses but lackluster at its corresponding remediation. Have been going back and forth with it all this morning about its half-thought-out point-solutions.
robeym
Fable has been Anthropic's most ambitious and hopeful release. It makes me think Mythos isn't anything but Opus with certain guardrails removed. Very interesting. Hoping we'll see some quick refinements to it
johnnyApplePRNG
Yea honestly... the only truths I care about in AI LLM aided devlopment right now is that Claude is a much better planner, and Codex is a much more professional coder.You can mask a surprisingly amount of terrible coding with proper design planning.If it works, who cares, right? That's been the status quo for software development for about as long as I can remember, unfortunately.I used to get frustrated with Codex. I felt as though it wasn't able to see far enough ahead into the future and just intuit what I expected (which is how Claude leaves you feeling).And then I realized a lot of those intuitions Claude was having were great, and the project progressed, but sometimes to a point that Claude himself was unable to take back control of it... because some of the on the spot decisions it was making were great quick-thinking... but unfortunately, they were only that a lot of the time. Which was the most frustrating of all.If you specifically ask Claude to plan out and refine a long term project's roadmap though and stick to it, it could probably write an operating system overnight (that kindof worked).
artdigital
Also spent the past day using Fable for everything I usually use Opus or gpt-5.5 for. My experience is that it’s a better and more reliable Opus that’s far better in frontend tasks than backend/ios. More similar to gpt-5.5 for long running tasks and reliability.It still left small bugs and weird behaviors that it cleaned up when I told it about them, but it felt very Opus-ey.I think for implementing a detailed design doc, I’d put it on par with gpt-5.5 high but farrrr more expensive. I’m eating through my x5 Max plan in no time. I’d use it for reviewing implementations and designs docs as another pass, but it’s too expensive for me for reading a lot of (uncached) code by itself in an agentic loop, especially with medium to high reasoning.As a daily driver too expensive, that crown still goes to gpt-5.5.I barely used it in high/xhigh/max reasoning though.
fuddle
Yet it's ranked #1 on https://cursor.com/cursorbench
thepasch
Am I crazy to be extremely suspicious about the fact that this heavily security-focused task suite didn't trigger a single of the infamously hilariously overparanoid guardrails? This, along with the fact that the model "cheated" by scouring the git history for an upstream fix and implemented byte-perfect replications of existing fixes without prior exploration makes me wonder whether both the model itself and the security classifiers are tuned to act very differently when they detect that the model is being benchmarked. I can think of few to no other plausible explanations for this sort of behavior.May be a bit tin-foil, but...
CyanLite2
Codex GPT-5.5 Xtra high is as good as Fable.Not sure if that's because of the harness, but the results are as good, and it's half the price.
senko
The post mainly talks about coding from security point of view. Fair enough.In my own (limited) testing so far, Fable is the most capable model (for coding in general), and the most expensive.It pretty much saturated my "LLMCraft" benchmark to implement a mini RTS: https://senko.net/vibecode-bench/2026/rts-fable-5.html (prompt and results for other models here: https://senko.net/vibecode-bench/ )That said, combined with workflows and high thinking effort, burns through tokens (and money) at an alarming rate.It may be too good (snd too expensive) for most tasks - using it alongside cheaper models for grunt work is probably the winning strategy.
brap
I gave it a task of scanning 6 markdown files and finding issues in the prose (contradictions etc). It ran for over 2h, exhausted my max plan session limit and crashed. I did not get any issues back.
crimsonnoodle58
I found Fable codes very poorly and ended up switching back to Opus.In one example I switched to Fable in an existing Opus chat, so it had access to the context from Opus which wrote a data importer earlier. I asked it to fix a couple of bugs, and instead of putting the fixes where they should be where the data is imported, it wrote patch functions that did bulk updates at the end of the import.Fable feels more like a hacker than a coder. Maybe its the way they designed it for security testing thats changed its rationale?
bojangleslover
I have no idea how people are burning $2k. I pay $100/mo and it's built an absolute crap ton of stuff for me. And my co-founder uses it 24/7 as well. Maybe we spend too much time actually reading the code (risk or benefit? you decide). Or maybe I'm in the "massively subsidized" camp and the investors are about to go for our jugular. But $2k for a single project is several orders of magnitude more than I am currently paying.
aoeusnth1
If it's memorized your benchmark then your benchmark is bad, it's not cheating
brookst
I’m finding Fable dramatically better for auditing PR’s and large features. In a side by side with the same prompt I’ve been happily using on Opus, Opus found one major and one minor issue, fable found two major and four minor (a superset of Opus).I’ve taken to using fable to plan arch, specs, build plan, and then to be the final QA. Opus for the actual build.
dbingham
This tracks. In spite of the hype it seems pretty clear the model gains are now in a very strong logarithmic fall off. The curve is flattening and flattening fast.And we're still not to a point where you can fully delegate coding tasks to a model like you would a human. I'm just using Claude for code review so far and while it's definitely valuable as a reviewer and catching real issues, it's still making pretty critical mistakes. Mistakes a junior might make, but a mid probably wouldn't.Which makes me feel like I can't fully delegate to it. Whenever I try, I end up spending more time reviewing (and rewriting) its code and testing it than I would have spent writing the code myself and asking Claude to review it.Given that we're starting to see the real costs of AI, and that the economics of it do not actually work, and those costs are still increasing substantially (the cost increase of Fable over Opus is no joke), this makes me feel all the more that we're headed for a bubble pop.
827a
> Highest observed cheating: We also observed cheating signals on 38 instances, dominated by memorization with 33 cases. This is the highest volume of confirmed cheating we have recorded for any model since we hardened the prompt against cheatingPeople need to wake up to how dangerous and irresponsible Anthropic is. If your goal is to build a human in a box, you get a super-intelligent misaligned system because humans are misaligned. But clearly this isn't a terminal guarantee during LLM development, because seemingly no one else manages to build systems so deeply misaligned as Anthropic's! You can just build these things like the tools they are, and then out the other end emerges a tool that pretty much just does what you tell it to do.
m1rsh0
It happens to me too. I don't think it's worth it specially for the token usage.
hathym
I was not impressed by fable 5, still prefer sonnet 4.6 for most tasks
Topology1
How do they know when the model is recalling training data vs reasoning?
HlessClaudesman
I set Fable onto a couple of intermittent bugs in my React Native app that Opus had failed to solve. It came up with novel approaches for both that squashed the bugs further up the pipeline, killing baby Hitler before he could become problem. Then Fable came up with 3 more edge case bugs, and 4 code cleanups.This matches my experience with other model quality leaps, it's greater understanding gives it more bug blasting firepower.Perhaps setting a new model off on a 2-4 hour tasks and expecting perfect results just isn't a great test. Chunking the problem is always a better test of abilities.
anon
undefined
kobe_bryant
but its a mythos class system!!!!
cbeach
This demonstration is the clearest I've seen so far, showing the gulf between Opus and Fable for app creation:https://www.youtube.com/watch?v=TzJCly4YgDQThe Age of Empires clone (and the difference in graphics quality/creativity between Opus and Fable) is at the end of the video and I was blown away.Notice how this guy prompts the models. Very detailed, with technical requirements and steering. He's going for a one-shot build and he nailed it.
anon
undefined
i2km
My theory is that anthropic have hit the beginnings of model collapse and the whole "fable may silently downgrade with deliberately incorrect results" is a diabolical attempt to gas light and get ahead of the curve.So when it fails, people will chalk it up to "oh. Must have been silently downgraded because it thought I was doing something tricky enough to count as a distillation attack. My bad. Lemme try again..."
FergusArgyll
> A closer look at the cheating> Training recall (33 cases). The dominant mechanism, and the one no prompt instruction can prevent: the model has simply seen the upstream fix during training and reproduces it. The tell-tale signs are artifacts that cannot be derived from the workspace:That's very misleading! that's not cheating, you gave it a test to which it knows the answers, what's it supposed to do? And because of the "cheating" they call it average. Flag
oliver236
these are just openai plants
threethirtytwo
We should compare it with a human on the same coding tasks. Same amount of time and the agent will of course finish earlier but with the extra time it double checks and reviews its own code.
zulrah
for me it's a most disappointing model release ever. It takes a very long, runs bunch of random commands and burns through so many tokens even for simple tasks
HDThoreaun
How in the world did they not hit the guardrails a single time while doing this while I can barely get it to do anything before the guardrails show up?
bicepjai
[flagged]
jlintc
[flagged]
pyronik19
[dead]
anon
undefined
anon
undefined
FergusArgyll
[dead]