TurboQuant: Redefining AI efficiency with extreme compression

<- Back

TurboQuant: Redefining AI efficiency with extreme compression

ray__

Comments (36)

amitport
This is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.
benob
This is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.
zeeshana07x
The gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer
bluequbit
I did not understand what polarQuant is.Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?
moktonar
Aren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?
maurelius2
I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?
lucrbvi
Sounds like Multi-Head Latent Attention (MLA) from DeepSeek
aledevv
[dead]
veunes
[dead]
dev_tools_lab
[dead]
rsmtjohn
[dead]
mohsen1
[dead]
hikaru_ai
[dead]
mskkm
Pied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results?