<- Back
Comments (36)
- amitportThis is a great development for KV cache compression. I did notice a missing citation in the related works regarding the core mathematical mechanism, though. The foundational technique of applying a geometric rotation prior to extreme quantization, specifically for managing the high-dimensional geometry and enabling proper bias correction, was introduced in our NeurIPS 2021 paper, "DRIVE" (https://proceedings.neurips.cc/paper/2021/hash/0397758f8990c...). We used this exact rotational approach and a similar bias correction mechanism to achieve optimal distributed mean estimation. I also presented this work and subsequent papers in a private invited talk at Google shortly after publication. Given the strong theoretical overlap with the mechanisms in TurboQuant and PolarQuant, I hope to see this prior art acknowledged in the upcoming camera-ready versions.
- benobThis is the worst lay-people explanation of an AI component I have seen in a long time. It doesn't even seem AI generated.
- zeeshana07xThe gap between how this is described in the paper vs the blog post is pretty wide. Would be nice to see more accessible writing from research teams — not everyone reading is a ML engineer
- bluequbitI did not understand what polarQuant is.Is is something like pattern based compression where the algorithm finds repeating patterns and creates an index of those common symbols or numbers?
- moktonarAren’t polar coordinates still n-1 + 1 for radius for n-dim vector? If so I understand that angles can be quantized better but when radius r is big the error is large for highly quantized angles right? What am I missing?
- maurelius2I'm somewhat at a loss here other than understanding the fundamentals. Can someone tell me how the compression impact performance?
- lucrbviSounds like Multi-Head Latent Attention (MLA) from DeepSeek
- aledevv[dead]
- veunes[dead]
- dev_tools_lab[dead]
- rsmtjohn[dead]
- mohsen1[dead]
- hikaru_ai[dead]
- mskkmPied Piper vibes. As far as I can tell, this algorithm is hardly compatible with modern GPU architectures. My guess is that’s why the paper reports accuracy-vs-space, but conveniently avoids reporting inference wall-clock time. The baseline numbers also look seriously underreported. “several orders of magnitude” speedups for vector search? Really? anyone has actually reproduced these results?