LLM Year in Review

<- Back

LLM Year in Review

swyx

Comments (56)

socketcluster
For me, Claude Code was the most impressive innovation this year. Cursor was a good proof of concept but Claude Code is the tool that actually got me to use LLMs for coding.The kind of code that Claude produces looks almost exactly like the code I would write myself. It's like it's reading my mind. This is a game changer because I can maintain the code that Claude produces.With Claude Code, there are no surprises. I can pretty much guess what its code will look like 90% to 95% of the time but it writes it a lot faster than I could. This is an amazing innovation.Gemini is quite impressive as well. Nano banana in particular is very useful for graphic design.I haven't tried Gemini with coding yet but TBH, Claude Code does such a great job; if I could code any faster, I would get decision fatigue. I don't like rushing into architecture or UX decisions. I like to sit on certain decisions for a day or two before starting implementation. Once you start in a particular direction, it's hard to undo and you may try to double down on the mistake due to sunk cost fallacy. I try hard to avoid that.
andai
The bit about o3 being the turning point is very interesting. I heard someone say that o3 (or perhaps the cheaper o4-mini) should have been called gpt-5, and that people would have been mind blown. Instead it kind of went under the radar as far as the mainstream goes.Whereas we just got the incremental progress with gpt-5 instead and it was very underwhelming. (Plus like 5 other issues at launch, but that's a separate story ;)I'm not sure if o4-mini would have made a good default gpt though. (Most use is conversational and its language is very awkward.) So they could have just called it gpt-5 pro or something, and put it on the $20 tier. I don't know.
augment_me
I noticed that despite really liking Karpathy and the blog, I was am kind of wincing/involuntarily reacting to the LLM-like "It's not X, its Y"-phrases:> it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer> it's not just about the image generation itself, it's about the joint capability coming from text generationThere would be no reaction from me on this 3 years ago, but now this sentence structure is ruined for me
thoughtpeddler
I appreciate Andrej’s optimistic spirit, and I am grateful that he dedicates so much of his time to educating the wider public about AI/LLMs. That said, it would be great to hear his perspective on how 2025 changed the concentration of power in the industry, what’s happening with open-source, local inference, hardware constraints, etc. For example, he characterizes Claude Code as “running on your computer”, but no, it’s just the TUI that runs locally, with inference in the cloud. The reader is left to wonder how that might evolve in 2026 and beyond.
dandelionv1bes
Something I’ve been thinking about is how as end stage users (eg building our own “thing” on top of an LLM) we can broadly verify it’s doing what we need without benchmarks. Does a set of custom evals built out over time solve this? Is there more we can do?
starchild3001
The distinction Karpathy draws between "growing animals" and "summoning ghosts" via RLVR is the mental model I didn't know I needed to explain the current state of jagged intelligence. It perfectly articulates why trust in benchmarks is collapsing; we aren't creating generally adaptive survivors, but rather over-optimizing specific pockets of the embedding space against verifiable rewards.I’m also sold on his take on "vibe coding" leading to ephemeral software; the idea of spinning up a custom, one-off tokenizer or app just to debug a single issue, and then deleting it, feels like a real shift.
jkubicek
> In the same way, LLMs should speak to us in our favored format - in images, infographics, slides, whiteboards, animations/videos, web apps, etc.You think every Electron app out there re-inventing application UX from scratch is bad, wait until LLMs are generating their own custom UX for every single action for every user for every device. What does command-W do in this app? It's literally impossible to predict, try it and see!
mips_avatar
I would love Andrej's take on the fast models we got this year. Gemini 3 flash and Grok 4 fast have no business being as good + cheap + fast as they are. For Andrej's prediction about LLMs communicating with us via a visual interface we're going to need fast models, but I feel like AI twitter/HN has mostly ignored these.
victorbuilds
Notable omission: 2025 is also when the ghosts started haunting the training data. Half of X replies are now LLMs responding to LLMs. The call is coming from inside the dataset.
mvkel
> In this world view, nano banana is a first early hint of what that might look like.What is he referring to here? Is nano banana not just an image gen model? Is it because it's an LLM-based one, and not diffusion?
delichon
> I like this version of the meme for pointing out that human intelligence is also jagged in its own different way.The idea of jaggedicity seems useful to advancing epistemology. If we could identify the domains that have useful data that we fail to extract, we could fill those holes and eventually become a general intelligence ourselves. The task may be as hard as making a list of your blind spots. But now we have an alien intelligence with an outside perspective. While making AI less jagged it might return the favor.If we keep inventing different kinds of intelligence the sum of the splats may eventually become well rounded.
nkko
Beyond graduating students, I see model labs as “accelerators/incubators” bundling, launching, and productizing observed ideas that gain traction. The sheer strength of their platforms, the number of eyes watching them, near-zero marginal costs, and seemingly unlimited budgets mean that only slow decision-making can prevent them from becoming the next Amazons of everything.
TheAceOfHearts
I think one of the things that is missing from this post is engaging a bit in trying to answer: what are the highest priority AI-related problems that the industry should seek to tackle?Karpathy hints at one major capability unlock being UI generation, so instead of interacting with text the AI can present different interfaces depending on the kind of problem. That seems like a severely underexplored problem domain so far. Who are the key figures innovating in this space so far?In the most recent Demis interview, he suggests that one of the key problems that must be solved is online / continuous learning.Aside from that, another major issues is probably reducing hallucinations and increasing reliability. Ideally you should be able to deploy an LLM to work on a problem domain, and if it encounters an unexpected scenario it reaches out to you in order to figure out what to do. But for standard problems it should function reliably 100% of the time.
anon
undefined
alexgotoi
LLMs still need to bring clear added value to enterprise and corporate work; otherwise, they remain a geek’s toy.Big media agencies that claim to use AI rely on strong creative teams who fine-tune prompts and spend weeks doing so. Even then, they don’t fully trust AI to slice long videos into shorter clips for social media.Heavy administrative functions like HR or Finance still don’t get approval to expose any of their data to LLMs.What I’m trying to say is that we are still in the early stages of LLM development, and as promising as this looks, it’s still far from delivering the real value that is often claimed.
swyx
xposted to https://x.com/karpathy/status/2002118205729562949
ausbah
tl;dr seems like llms are maturing on the product side and for day-day usage
bgwalter
Vibe coding is sufficient for job hoppers who never finish anything and leave when the last 20% have to be figured out. Much easier to promote oneself as an expert and leave the hard parts to other people.