I fed 24 years of my blog posts to a Markov model

<- Back

I fed 24 years of my blog posts to a Markov model

zdw

Comments (55)

sebastianmestre
Cool article, it got me to play around with Markov models, too! I first did a Markov model over plain characters.> Itheve whe oiv v f vidleared ods alat akn atr. s m w bl po ar 20Using pairs of consecutive characters (order-2 Markov model) helps, but not much:> I hateregratics.pyth fwd-i-sed wor is wors.py < smach. I worgene arkov ment by compt the fecompultiny of 5, ithe donsTriplets (order 3) are a bit better:> I Fed tooks of the say, I just train. All can beconsist answer efferessiblementate> how examples, on 13 Debian is the more M-x: Execute testerationLLMs usually do some sort of tokenization step prior to learning parameters. So I decided to try out order-1 Markov models over text tokenized with byte pair encoding (BPE).Trained on TFA I got this:> I Fed by the used few 200,000 words. All comments were executabove. This value large portive comment then onstring takended to enciece of base for the see marked fewer words in the...Then I bumped up the order to 2> I Fed 24 Years of My Blog Posts to a Markov Model> By Susam Pal on 13 Dec 2025>> Yesterday I shared a little program calle...It just reproduced the entire article verbatim. This makes sense as BPE removes any pair of repeated tokens, making order-2 Markov transitions fully deterministic.I've heard that in NLP applications, it's very common to run BPE only up to a certain number of different tokens, so I tried that out next.Before limiting, BPE was generating 894 tokens. Even adding a slight limit (800) stops it from being deterministic.> I Fed 24 years of My Blog Postly coherent. We need to be careful about not increasing the order too much. In fact, if we increase the order of the model to 5, the generated text becomes very dry and factualIt's hard to judge how coherent the text is vs the author's trigram approach because the text I'm using to initialize my model has incoherent phrases in it anyways.Anyways, Markov models are a lot of fun!
vunderba
I did something similar many years ago. I fed about half a million words (two decades of mostly fantasy and science fiction writing) into a Markov model that could generate text using a “gram slider” ranging from 2-grams to 5-grams.I used it as a kind of “dream well” whenever I wanted to draw some muse from the same deep spring. It felt like a spiritual successor to what I used to do as a kid: flipping to a random page in an old 1950s Funk & Wagnalls dictionary and using whatever I found there as a writing seed.
OuterVale
Really fascinating how you can get such intriguing output from such a simple system. Prompted me to give it a whirl with the content on my own site.https://vale.rocks/micros/20251214-0503
lacunary
I recall a Markov chain bot on IRC in the mid 2000s. I didn't see anything better until gpt came along!
hilti
First of all: Thank you for giving.Giving 24 years of your experience, thoughts and life time to us.This is special in these times of wondering, baiting and consuming only.
Aperocky
Here's a quick custom markov page you can have fun with, (all in client) https://aperocky.com/markov/npm package of the markov model if you just want to play with it on localhost/somewhere else: https://github.com/Aperocky/weighted-markov-generator
monoidl
I think this is more correctly described as a trigram model than a Markov model, if it would naturally expand to 4-grams when they were available, etc, the text would look more coherentIirc there was some research on "infini-gram", that is a very large ngram model, that allegedly got performance close to LLMs in some domains a couple years back
manthangupta109
Damn interesting!
hexnuts
I just realized, one of the things that people might start doing is making a gamma model of their personality. I won't even approach who they were as a person, but it will give their Descendants (or bored researchers) a 60% approximation of who they were and their views. (60% is pulled from nowhere to justify my gamma designation, since there isn't a good scale for personality mirror quality for LLMs as far as I'm aware.)
ikhatri
When I was in college my friends and I did something similar with all of Donald Trump’s tweets as a funny hackathon project for PennApps. The site isn’t up anymore (RIP free heroku hosting) but the code is still up on GitHub: https://github.com/ikhatri/trumpitter
swyx
now i wonder if you can compare vs feeding into a GPT style transformer of a similar Order of Magnitude in param count..
anthk
Megahal/Hailo (cpanm -n hailo for Perl users) can still be fun too.Usage: hailo -t corpus.txt -b brain.brn Where "corpus.txt" should be a file with one sentence per line. Easy to do under sed/awk/perl. hailo -b brain.brn This spawns the chatbot with your trained brain.By default Hailo chooses the easy engine. If you want something more "realistic", pick the advanced one mentioned at 'perldoc hailo' with the -e flag.
atum47
I usually have this technical hypothetical discussions with ChatGpt, I can share if you like, me asking him this: aren't LLMs just huge Markov Chains?! And now I see your project... Funny