Microgpt

<- Back

Microgpt

tambourine_man

Comments (83)

hackersk
What I find most valuable about this kind of project is how it forces you to understand the entire pipeline end-to-end. When you use PyTorch or JAX, there are dozens of abstractions hiding the actual mechanics. But when you strip it down to ~200 lines, every matrix multiplication and gradient computation has to be intentional.I tried something similar last year with a much simpler model (not GPT-scale) and the biggest "aha" moment was understanding how the attention mechanism is really just a soft dictionary lookup. The math makes so much more sense when you implement it yourself vs reading papers.Karpathy has a unique talent for making complex topics feel approachable without dumbing them down. Between this, nanoGPT, and the Zero to Hero series, he has probably done more for ML education than most university programs.
subset
I had good fun transliterating it to Rust as a learning experience (https://github.com/stochastical/microgpt-rs). The trickiest part was working out how to represent the autograd graph data structure with Rust types. I'm finalising some small tweaks to make it run in the browser via WebAssmebly and then compile it up for my blog :) Andrej's code is really quite poetic, I love how much it packs into such a concise program
0xbadcafebee
Since this post is about art, I'll embed here my favorite LLM art: the IOCCC 2024 prize winner in bot talk, from Adrian Cable (https://www.ioccc.org/2024/cable1/index.html), minus the stdlib headers: #define a(_)typedef _##t #define _(_)_##printf #define x f(i, #define N f(k, #define u _Pragma("omp parallel for")f(h, #define f(u,n)for(I u=0;u<(n);u++) #define g(u,s)x s%11%5)N s/6&33)k[u[i]]=(t){(C*)A,A+s*D/4},A+=1088*s; a(int8_)C;a(in)I;a(floa)F;a(struc){C*c;F*f;}t;enum{Z=32,W=64,E=2*W,D=Z*E,H=86*E,V='}\0'};C*P[V],X[H],Y[D],y[H];a(F _)[V];I*_=U" 炾ોİ䃃璱ᝓ၎瓓甧染ɐఛ瓁",U,s,p,f,R,z,$,B[D],open();F*A,*G[2],*T,w,b,c;a()Q[D];_t r,L,J,O[Z],l,a,K,v,k;Q m,e[4],d[3],n;I j(I e,F*o,I p,F*v,t*X){w=1e-5;x c=e^V?D:0)w+=r[i]*r[i]/D;x c)o[i]=r[i]/sqrt(w)*i[A+e*D];N $){x W)l[k]=w=fmax(fabs(o[i])/~-E,i?w:0);x W)y[i+k*W]=*o++/w;}u p)x $){I _=0,t=h*$+i;N W)_+=X->c[t*W+k]*y[i*W+k];v[h]= _*X->f[t]*l[i]+!!i*v[h];}x D-c)i[r]+=v[i];}I main(){A=mmap(0,8e9,1,2,f=open(M,f),0);x 2)~f?i[G]=malloc(3e9):exit( puts(M" not found"));x V)i[P]=(C*)A+4,A+=(I)*A;g(&m,V)g(&n,V)g(e,D)g(d,H)for(C*o;;s>=D?$=s=0:p<U||_()("%s",$[P]))if(! (*_?$=*++_:0)){if($<3&&p>=U)for(_()("\n\n> "),0<scanf("%[^\n]%*c",Y)?U=*B=1:exit(0),p=_(s)(o=X,"[INST] %s%s [/INST]",s? "":"<<SYS>>\n"S"\n<</SYS>>\n\n",Y);z=p-=z;U++[o+=z,B]=f)for(f=0;!f;z-=!f)for(f=V;--f&&f[P][z]|memcmp(f[P],o,z););p<U? $=B[p++]:fflush(0);x D)R=$*D+i,r[i]=m->c[R]*m->f[R/W];R=s++;N Z){f=k*D*D,$=W;x 3)j(k,L,D,i?G[~-i]+f+R*D:v,e[i]+k);N 2)x D)b=sin(w=R/exp(i%E/14.)),c=1[w=cos(w),T=i+++(k?v:*G+f+R*D)],T[1]=b**T+c*w,*T=w**T-c*b;u Z){F*T=O[h],w=0;I A=h*E;x s){N E)i[k[L+A]=0,T]+=k[v+A]*k[i*D+*G+A+f]/11;w+=T[i]=exp(T[i]);}x s)N E)k[L+A]+=(T[i]/=k?1:w)*k[i*D+G[1]+A+f];}j(V,L ,D,J,e[3]+k);x 2)j(k+Z,L,H,i?K:a,d[i]+k);x H)a[i]*=K[i]/(exp(-a[i])+1);j(V,a,D,L,d[$=H/$,2]+k);}w=j($=W,r,V,k,n);x V)w=k[i]>w?k[$=i]:w;}}
verma7
I wrote a C++ translation of it: https://github.com/verma7/microgpt/blob/main/microgpt.cc2x the number of lines of code (~400L), 10x the speedThe hard part was figuring out how to represent the Value class in C++ (ended up using shared_ptrs).
red_hare
This is beautiful and highly readable but, still, I yearn for a detailed line-by-line explainer like the backbone.js source: https://backbonejs.org/docs/backbone.html
MattyRad
Hoenikker had been experimenting with melting and re-freezing ice-nine in the kitchen of his Cape Cod home.Beautiful, perhaps like ice-nine is beautiful.
freakynit
Is there something similar for diffusion models? By the way, this is incredibly useful for learning in depth the core of LLM's.
kelvinjps10
Why there is multiple comments talking about 1000 c lines, bots?
fulafel
This could make an interesting language shootout benchmark.
colonCapitalDee
Beautiful work
anon
undefined
ViktorRay
Which license is being used for this?
coolThingsFirst
Incredibly fascinating. One thing is that it seems still very conceptual. What id be curious about how good of a micro llm we can train say with 12 hours of training on macbook.
ThrowawayTestr
This is like those websites that implement an entire retro console in the browser.
rramadass
C++ version - https://github.com/Charbel199/microgpt.cpp?tab=readme-ov-fil...Rust version - https://github.com/mplekh/rust-microgpt
dhruv3006
Karapthy with another gem !
lynxbot2026
[flagged]
Paddyz
[flagged]
tithos
What is the prime use case
profsummergig
If anyone knows of a way to use this code on a consumer grade laptop to train on a small corpus (in less than a week), and then demonstrate inference (hallucinations are okay), please share how.
jimbokun
It’s pretty staggering that a core algorithm simple enough to be expressed in 1000 lines of Python can apparently be scaled up to achieve AGI.Yes with some extra tricks and tweaks. But the core ideas are all here.