Starting from scratch: Training a 30M Topological Transformer

<- Back

Starting from scratch: Training a 30M Topological Transformer

tuned

Comments (31)

kouteiheika
If you want to prove (i.e. show that it works and/or it's faster in a real-world scenario) a new alternative to attention without breaking the bank then one of the best ways to do that would probably be to retrain an already existing model, just with swapped attention modules. Then once you have such a model you can do apples-to-apples benchmarks.This has been done successfully in the past:https://huggingface.co/featherless-ai/QRWKV-72BNote that this is a 72B model which would be very expensive to train from scratch, but here they did the conversion for less than $2000.
ashirviskas
I wonder what if we just crammed more into the "tokens"? I am running an experiment of replacing discrete tokens with embeddings + small byte encoder/decoder. That way you can use embedding space much more efficiently and have it contain much more nuance.Experiments I want to build on top of it:1. Adding lsp context to the embeddings - that way the model could _see_ the syntax better, closer to how we use IDEs and would not need to read/grep 25k of lines just to find where something is used. 2. Experiments with different "compression" ratios. Each embedding could encode a different amount of bytes and we would not rely on a huge static token dictionary.I'm aware that papers exist that explore these ideas, but so far no popular/good open source models employ this. Unless someone can prove me wrong.
lostmsu
Comparison with vanilla of the same size/flops budget?
keyle
Does this make any sense, to anyone?
geoffbp
I dug into this a bit (with AI ofc) and it spat this out. I found it an easy way to visualise and start to understand:> Standard AI models (like GPT-4) treat data using Global Geometry. They imagine every word as a point floating in a massive, flat, high-dimensional room. To see how two words relate, they draw a straight line between them.> Local Topology changes the "room" into a landscape (a manifold). Instead of a flat void, the data exists on a curved surface that has hills, valleys, and paths.