I don't know how you get here from “predict the next word”

<- Back

I don't know how you get here from “predict the next word”

qsi

Comments (153)

sasjaws
A while ago i did the nanogpt tutorial, i went through some math with pen and paper and noticed the loss function for 'predict the next token' and 'predict the next 2 tokens' (or n tokens) is identical.That was a bit of a shock to me so wanted to share this thought. Basically i think its not unreasonable to say llms are trained to predict the next book instead of single token.Hope this is usefull to someone.
wavemode
> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).
ChaitanyaSai
The whole next word thing is interesting isn't it. I like to see it with Dennett's "Competence and comprehension" lens. You can predict the next word competently with shallow understanding. But you could also do it well with understanding or comprehension of the full picture. A mental model that allows you to predict better. Are the AIs stumbling into these mental models? Seems like it. However, because these are such black boxes, we do not know how they are stringing these mental models together. Is it a random pick from 10 models built up inside the weights? Is there any system-wide cohesive understanding, whatever that means? Exploring what a model can articualate using self-reflection would be interesting. Can it point to internal cognitive dissonance because it has been fed both evolution and intelligent design, for example? Or these exist as separate models to invoke depending on the prompt context, because all that matters is being rewarded by the current user?
teekert
I think this is a thing not often discussed here, but I too have this experience. An LLM can be fantastic if you write a 25-pager then later need to incorporate a lot of comments with sometimes conflicting arguments/viewpoints.LLMs can be really good at "get all arguments against this", "Incorporated this view point in this text while making it more concise.", "Are these views actually contradicting or can I write it such that they align. Consider incentives".If you know what you're doing and understand the matter deeply (and that is very important) you'll find that the LLM is sometimes better at wording what you actually mean, especially when not writing in your native language. Of course, you study the generated text, make small changes, make it yours, make sure you feel comfortable with it etc. But man can it get you over that "how am I going to write this down"-hump.Also: "Make an executive summary" "Make more concise", are great. Often you need to de-linkedIn the text, or tell it to "not sound like an American waiter", and "be business-casual", "adopt style of rest of doc", etc. But it works wonders.
ruhith
Predict the next token' is true but not explanatory. It's like saying humans 'fire neurons.' Technically correct, explains nothing useful about the behavior you're actually observing. The debate isn't whether the description is accurate - it's whether it's at the right level of abstraction.
pushedx
Yes, most people (including myself) do not understand how modern LLMs work (especially if we consider the most recent architectural and training improvements).There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.
callmeal
The "predict the next word" to a current llm is at the same level as a "transistor" (or gate) is to a modern cpu. I don't understand llms enough to expand on that comparison, but I can see how having layers above that feed the layers below to "predict the next word" and use the output to modify the input leading to what we see today. It is turtles all the way down.
modeless
It's clear that in the general case "predict the next word" requires arbitrarily good understanding of everything that can be described with language. That shouldn't be mysterious. What's mysterious is how a simple training procedure with that objective can in practice achieve that understanding. But then again, does it? The base model you get after that simple training procedure is not capable of doing the things described in the article. It is only useful as a starting point for a much more complex reinforcement learning procedure that teaches the skills an agent needs to achieve goals.RL is where the magic comes from, and RL is more than just "predict the next word". It has agents and environments and actions and rewards.
GodelNumbering
It is probably the first-time aha moment the author is talking about. But under the hood, it is probably not as magical as it appears to be.Suppose you prompted the underlying LLM with "You are an expert reviewer in..." and a bunch of instructions followed by the paper. LLM knows from the training that 'expert reviewer' is an important term (skipping over and oversimplifying here) and my response should be framed as what I know an expert reviewer would write. LLMs are good at picking up (or copying) the patterns of response, but the underlying layer that evaluates things against a structural and logical understanding is missing. So, in corner cases, you get responses that are framed impressively but do not contain any meaningful inputs. This trait makes LLMs great at demos but weak at consistently finding novel interesting things.If the above is true, the author will find after several reviews that the agent they use keeps picking up on the same/similar things (collapsed behavior that makes it good at coding type tasks) and is blind to some other obvious things it should have picked up on. This is not a criticism, many humans are often just as collapsed in their 'reasoning'.LLMs are good at 8 out of 10 tasks, but you don't know which 8.
throawayonthe
> economist> wowed by smoke and mirrorsmany such cases
anon
undefined
mnewme
Is this an ad? Seems like it. The text is not really what the headline suggests.
gammalost
It is really interesting how great and also how terrible LLMs can be at the same time. For example, I had a really annoying bug yesterday, I missed one character, "_". Asking ChatGPT for help led to a lot of feedback that was arguably okay but not currently relevant (because there was a fatal flaw in the code).Remade the conversation with personal information stripped here https://chatgpt.com/share/699fef77-b530-8007-a4ed-c3dda9461d...
tolerance
It’s interesting to read about the use and leverage of LLMs outside of programming.I’m not too familiar with the history, but the import of this article is brushing up on my nose hairs in a way that makes me think a sort of neo-Sophistry is on the horizon.
visarga
> Nothing you write will matter if it is not quickly adopted to the training dataset.That is my take too, I was surprised to see how many people object to their works being trained on. It's how you can leave your mark, opening access for AI, and in the last 25 years opening to people (no restrictions on access, being indexed in Google).
belZaah
It’s called emergent behavior. We understand how an llm works, but do not have even a theory about how the behavior emerges from among the math. We understand ants pretty well, but how exactly does anthill behavior come from ant behavior? It’s a tricky problem in system engineering where predicting emergent behavior (such as emergencies) would be lovely.
Alex_L_Wood
Unless proven otherwise, assume everything coming from AI industry is an ad, a pitch to investors to raise money or a straight-up lie. AI is useful in some instances, but there are so much money riding on it that there are forces way bigger then us propping it all up.And this is an ad, I assume.
retrac
I know this sounds insane but I've been dwelling on it. Language models are digital Ouija boards. I like the metaphor because it offers multiple conflicting interpretations. How does a Ouija board work? The words appear. Where do they come from? It can be explained in physical terms. Or in metaphysical terms. Collective summing of psychomotor activity. Conduits to a non-corporeal facet of existence. Many caution against the Ouija board as a path to self-inflicted madness, others caution against the Ouija board as a vehicle to bring poorly understood inhuman forces into the world.
mrorigo
Attention is all you need.
libraryofbabel
I have come to think “predict the next token” is not a useful way to explain how LLMs work to people unfamiliar with LLM training and internals. It’s technically correct, but at this point saying that and not talking about things like RLVR training and mechanistic interpretability is about as useful as framing talking with a person as “engaging with a human brain generating tokens” and ignoring psychology.At least AI-haters don’t seem to be talking about “stochastic parrots” quite so much now. Maybe they finally got the memo.
pharrington
Why do the deliverables always take about 1 hour? Is this fully automated?
tsunamifury
I think it’s funny that at Google I invented and productized next word (and next action) predictor in Gmail and hangouts chat and I’ve never had a single person come to me and ask how this all works.To me LLMs are incredibly simple. Next word next sentence next paragraph and next answer are stacked attention layers which identify manifolds and run in reverse to then keep the attention head on track for next token. It’s pretty straight forward math and you can sit down and make a tiny LLM pretty easily on your home computer with a good sized bag of words and contextTo me it’s baffling everyone goes around saying constantly that not even Nobel prize winners know how this works it’s a huge mystery.Has anyone thought to ask the actual people like me and others who invented this?
intended
The article talks about LLMs reviewing Econ papers.I’m hesitant to call this an outright win, though.Perhaps the review service the author is using is really good.Almost certainly the taste, expertise and experience of the author is doing unseen heavy lifting.I found that using prompts to do submission reviews for conferences tended to make my output worse, not better.Letting the LLM analyze submissions resulted in me disconnecting from the content. To the point I would forget submissions after I closed the tab.I ended up going back to doing things manually, using them as a sanity check.On the flip side, weaker submissions using generative tools became a nightmare, because you had to wade through paragraphs of fluff to realize there was no substantive point.It’s to the point that I dread reviewing.I am going to guess that this is relatively useful for experts, who will submit stronger submissions, than novices and journeymen, who will still make foundational errors.
WD-42
This is really hard to judge because by the looks of it, finance papers mostly consist of gobbledygook and extensive filler to begin with.
leptons
[flagged]
bdhcuidbebe
[flagged]
themafia
> The comments it offered were on the par of the best comments I’ve received on a paper in my entire academic career.Sort of the lowest hanging fruit imaginable. Just because it became "fundamental" to the process doesn't mean it gained any quality.