There are no new ideas in AI, only new datasets

<- Back

There are no new ideas in AI, only new datasets

bilsbie

Comments (228)

voxleone
I'd say with confidence: we're living in the early days. AI has made jaw-dropping progress in two major domains: language and vision. With large language models (LLMs) like GPT-4 and Claude, and vision models like CLIP and DALL·E, we've seen machines that can generate poetry, write code, describe photos, and even hold eerily humanlike conversations.But as impressive as this is, it’s easy to lose sight of the bigger picture: we’ve only scratched the surface of what artificial intelligence could be — because we’ve only scaled two modalities: text and images.That’s like saying we’ve modeled human intelligence by mastering reading and eyesight, while ignoring touch, taste, smell, motion, memory, emotion, and everything else that makes our cognition rich, embodied, and contextual.Human intelligence is multimodal. We make sense of the world through:Touch (the texture of a surface, the feedback of pressure, the warmth of skin0; Smell and taste (deeply tied to memory, danger, pleasure, and even creativity); Proprioception (the sense of where your body is in space — how you move and balance); Emotional and internal states (hunger, pain, comfort, fear, motivation).None of these are captured by current LLMs or vision transformers. Not even close. And yet, our cognitive lives depend on them.Language and vision are just the beginning — the parts we were able to digitize first - not necessarily the most central to intelligence.The real frontier of AI lies in the messy, rich, sensory world where people live. We’ll need new hardware (sensors), new data representations (beyond tokens), and new ways to train models that grow understanding from experience, not just patterns.
tippytippytango
Sometimes we get confused by the difference between technological and scientific progress. When science makes progress it unlocks new S-curves that progress at an incredible pace until you get into the diminishing returns region. People complain of slowing progress but it was always slow, you just didn’t notice that nothing new was happening during the exponential take off of the S-curve, just furious optimization.
EternalFury
What John Carmack is exploring is pretty revealing. Train models to play 2D video games to a superhuman level, then ask them to play a level they have not seen before or another 2D video game they have not seen before. The transfer function is negative. So, in my definition, no intelligence has been developed, only expertise in a narrow set of tasks.It’s apparently much easier to scare the masses with visions of ASI, than to build a general intelligence that can pick up a new 2D video game faster than a human being.
jschveibinz
I will respectfully disagree. All "new" ideas come from old ideas. AI is a tool to access old ideas with speed and with new perspectives that hasn't been available up until now.Innovation is in the cracks: recognition of holes, intersections, tangents, etc. on old ideas. It has bent said that innovation is done on the shoulders of giants.So AI can be an express elevator up to an army of giant's shoulders? It all depends on how you use the tools.
strangescript
If you work with model architecture and read papers, how could not know there are a flood of new ideas? Only few yield interesting results though.I kind of wonder if libraries like pytorch have hurt experimental development. So many basic concepts no one thinks about anymore because they just use the out of the box solutions. And maybe those solutions are great and those parts are "solved", but I am not sure. How many models are using someone else's tokenizer, or someone else's strapped on vision model just to check a box in the model card?
kogus
To be fair, if you imagine a system that successfully reproduced human intelligence, then 'changing datasets' would probably be a fair summary of what it would take to have different models. After all, our own memories, training, education, background, etc are a very large component of our own problem solving abilities.
LarsDu88
If datasets are what we are talking about, I'd like to bring attention to the biological datasets out there that have yet to be fully harnessed.The ability to collect gene expression data at a tissue specific level has only been invented and automated in the last 4-5 years (see 10X Genomics Xenium, MERFISH). We've only recently figured out how to collect this data at the scale of millions of cells. A breakthrough on this front may be the next big area of advancement.
cadamsdotcom
What about actively obtained data - models seeking data, rather than being fed. Human babies put things in their mouths, they try to stand and fall over. They “do stuff” to learn what works. Right now we’re just telling models what works.What about simulation: models can make 3D objects so why not give them a physics simulator? We have amazing high fidelity (and low cost!) game engines that would be a great building block.What about rumination: behind every Cursor rule for example, is a whole story of why a user added it. Why not take the rule, ask a reasoning model to hypothesize about why that rule was created, and add that rumination (along with the rule) to the training data. Providing opportunities to reflect on the choices made by their users might deepen any insights, squeezing more juice out of the data.
ctoth
Reinforcement learning from self-play/AlphaWhatever? Nah must just be datasets. :)
NetRunnerSu
Because the externally injected loss function will empty the brain of the model.Models need to decide for themselves what they should learn.Eventually, after entering the open world, reinforcement learning/genetic algorithms are still the only perpetual training solution.https://github.com/dmf-archive/PILF
piinbinary
AI training is currently a process of making the AI remember the dataset. It doesn't involve the AI thinking about the dataset and drawing (and remembering) conclusions.It can probably remember more facts about a topic than a PhD in that topic, but the PhD will be better at thinking about that topic.
Daisywh
If we’re serious about data being more important than models, then where are the Similar to ISO standards for dataset quality? We have so many model metrics, but almost nothing standardized for data integrity or reproducibility.
somebodythere
I don't know if it matters. Even if the best we can do is get really good at interpolating between solutions to cognitive tasks on the data manifold, the only economically useful human labor left asymptotes toward frontier work; work that only a single-digit percentage of people can actually perform.
Leon_25
At Axon, we see the same pattern: data quality and diversity make a bigger difference than architecture tweaks. Whether it's AI for logistics or enterprise automation, real progress comes when we unlock new, structured datasets, not when we chase “smarter” models on stale inputs.
anon
undefined
seydor
There are new ideas, people are finding new ways to build vision models, which then are applied to language models and vice versa (like diffusion).The original idea of connectionism is that neural networks can represent any function, which is the fundamental mathematical fact. So we should be optimistic, neural nets will be able to do anything. Which neural nets? So far people have stumbled on a few productive architectures, but it appears to be more alchemy than science. There is no reason why we should think there won't be both new ideas and new data. Biology did it, humans will do it too.> we’re engaged in a decentralized globalized exercise of Science, where findings are shared openlyMaybe the findings are shared, if they make the Company look good. But the methods are not anymore
Kapura
Here's an idea: make the AIs consistent at doing things computers are good at. Here's an anecdote from a friend who's living in Japan:> i used chatgpt for the first time today and have some lite rage if you wanna hear it. tldr it wasnt correct. i thought of one simple task that it should be good at and it couldnt do that.> (The kangxi radicals are neatly in order in unicode so you can just ++ thru em. The cjks are not. I couldnt see any clear mapping so i asked gpt to do it. Big mess i had to untangle manually anyway it woulda been faster to look them up by hand (theres 214))> The big kicker was like, it gave me 213. And i was like, "why is one missing?" Then i put it back in and said count how many numbers are here and it said 214, and there just werent. Like come on you SHOULD be able to count.If you can make the language models actually interface with what we've been able to do with computers for decades, i imagine many paths open up.
AbstractH24
Imagine if the original moores law tracked how often CPUs doubled the semi conductors while still functioning properly 50% of the time.I don’t think it would have had the same impact
mikewarot
Hardware isn't even close to being out of steam. There are some breathtakingly obvious premature optimizations that we can undo to get at least 99% power reduction for the same amount of compute.For example, FPGAs use a lot of area and power routing signals across the chip. Those long lines have a large capacitance, and thus cause a large amount of dynamic power loss. So does moving parameters around to/from RAM instead of just loading up a vast array of LUTs with the values once.
tim333
An interesting step forward, although an old idea we seem close to is recursive self improvement. Get the AI to make a modified version of itself to try to think better.
sakex
There are new things being tested and yielding results monthly in modelling. We've deviated quite a bit from the original multi head attention.
lossolo
I wrote about it around a year ago here:"There weren't really any advancements from around 2018. The majority of the 'advancements' were in the amount of parameters, training data, and its applications. What was the GPT-3 to ChatGPT transition? It involved fine-tuning, using specifically crafted training data. What changed from GPT-3 to GPT-4? It was the increase in the number of parameters, improved training data, and the addition of another modality. From GPT-4 to GPT-40? There was more optimization and the introduction of a new modality. The only thing left that could further improve models is to add one more modality, which could be video or other sensory inputs, along with some optimization and more parameters. We are approaching diminishing returns." [1]10 months ago around o1 release:"It's because there is nothing novel here from an architectural point of view. Again, the secret sauce is only in the training data. O1 seems like a variant of RLRF https://arxiv.org/abs/2403.14238Soon you will see similar models from competitors." [2]Winter is coming.1. https://news.ycombinator.com/item?id=406241122. https://news.ycombinator.com/item?id=41526039
tantalor
> If data is the only thing that matters, why are 95% of people working on new methods?Because new methods unlock access to new datasets.Edit: Oh I see this was a rhetorical question answered in the next paragraph. D'oh
b0a04gl
if datasets are the new codebases ,then the real IP can be dataset version control. how you fork ,diff ,merge and audit datasets like code. every team says 'we trained on 10B tokens' but what if we can answer 'which 5M tokens made reasoning better', 'which 100k made it worse'. then we can start being targeted leverage
lsy
This seems simplistic, tech and infrastructure play a huge part here. A short and incomplete list of things that contributed:- Moore's law petering out, steering hardware advancements towards parallelism- Fast-enough internet creating shift to processing and storage in large server farms, enabling both high-cost training and remote storage of large models- Social media + search both enlisting consumers as data producers, and necessitating the creation of armies of Mturkers for content moderation + evaluation, later becoming available for tagging and rlhf- A long-term shift to a text-oriented society, beginning with print capitalism and continuing through the rise of "knowledge work" through to the migration of daily tasks (work, bill paying, shopping) online, that allows a program that only produces text to appear capable of doing many of the things a person doesWe may have previously had the technical ideas in the 1990s but we certainly didn't have the ripened infrastructure to put them into practice. If we had the dataset to create an LLM in the 90s, it still would have been astronomically cost-prohibitive to train, both in CPU and human labor, and it wouldn't have as much of an effect on society because you wouldn't be able to hook it up to commerce or day-to-day activities (far fewer texts, emails, ecommerce).
rar00
disagree, there are a few organisations exploring novel paths. It's just that throwing new data at an "old" algorithm is much easier and has been a winning strategy. And, also, there's no incentive for a private org to advertise a new idea that seems to be working (mine's a notable exception :D).
TimByte
What happens when we really run out of fresh, high-quality data? YouTube and robotics make sense as next frontiers, but they come with serious scaling, labeling, and privacy headaches
blobbers
Why is DeepSeek specifically called out?
krunck
Until these "AI" systems become always-on, always-thinking, always-processing, progress is stuck. The current push button AI - meaning it only processes when we prompt it - is not how the kind of AI that everyone is dreaming of needs to function.
nyrulez
Things haven't changed much in terms of truly new ideas since electricity was invented. Everything else is just applications on top of that. Make the electrons flow in a different way and you get a different outcome.
russellbeattie
Paradigm shifts are often just a conglomeration of previous ideas with one little tweak that suddenly propels a technology ahead 10x which opens up a whole new era.The iPhone is a perfect example. There were smartphones with cameras and web browsers before. But when the iPhone launched, it added a capacitive touch screen that was so responsive there was no need for a keyboard. The importance of that one technical innovation can't be overstated.Then the "new new thing" is followed by a period of years where the innovation is refined, distributed, applied to different contexts, and incrementally improved.The iPhone launched in 2007 is not really that much different than the one you have in your pocket today. The last 20 years has been about improvements. The web browser before that is also pretty much the same as the one you use today.We've seen the same pattern happen with LLMs. The author of the article points out that many of AI's breakthroughs have been around since the 1990s. Sure! And the Internet was created in the 1970s and mobile phones were invented in the 1980s. That doesn't mean the web and smartphones weren't monumental technological events. And it doesn't mean LLMs and AI innovation is somehow not proceeding apace.It's just how this stuff works.
SamaraMichi
This brings us to the problem AI companies are facing, the lack of data, they have already hoovered as much as they can from the internet and desperately need more data.Which make sit blatantly obvious why we're beginning to see products being marketed under the guise of assistants/tools to aid you whose actual purpose is to gather real world picture and audio data, think meta glasses and what Ives and Altman are cooking up with their partnership.
ks2048
The latest LLMs are simply multiplying and adding various numbers together... Babylonians were doing that 4000 years ago.
anon291
I mean there's no new ideas for saas but just new applications and that worked out pretty well
Night_Thastus
Man I can't wait for this '''''AI''''' stuff to blow over. The back and forth gets a bit exhausting.
saltserv
[dead]
luppy47474
Hmmm
code_for_monkey
[flagged]
alganet
Dataset? That's so 2000s.Each crawl on the internet is actually a discrete chunk of a more abstractly defined, constant influx of information streams. Let's call them rivers (it's a big stream).These rivers can dry up, present seasonal shifts, be poisoned, be barraged.It will never "get there" and gather enough data to "be done".--Regarding "new ideas in AI", I think there could be. But this whole thing is not about AI anymore.