Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers

<- Back

Launch HN: Deepsilicon (YC S24) – Software and hardware for ternary transformers

areddyyt

Comments (79)

danjl
In my experience, trying to switch VFX companies from CPU-based rendering to GPU-based rendering 10+ years ago, a 2-5x performance improvement wasn't enough. We even provided a compatible renderer that accepted Renderman files and generated matching images. Given the rate of improvement of standard hardware (CPUs in our case, and GPU-based inference in yours), a 2-5x improvement will only last a few years, and the effort to get there is large (even larger in your case). Plus, I doubt you'll be able to get your HW everywhere (i.e. mobile) where inference is important, which means they'll need to support their existing and your new SW stack. The other issue is entirely non-technical, and may be an even bigger blocker -- switching the infrastructure of a major LLM provider to a new upstart is just plain risky. If you do a fantastic job, though, you should get aquahired, probably with a small individual bonus, not enough to pay off your investors.
cs702
Watching the video demo was key for me. I highly recommend everyone else here watches it.[a]From a software development standpoint, usability looks great, requiring only one import, import deepsilicon as ds and then, later on, a single line of Python, model = ds.convert(model) which takes care of converting all possible layers (e.g., nn.Linear layers) in the model to use ternary values. Very nice!The question for which I don't have a good answer is whether the improvement in real-world performance, using your hardware, will be sufficient to entice developers to leave the comfortable garden of CUDA and Nvidia, given that the latter is continually improving the performance of its hardware.I, for one, hope you guys are hugely successful.---[a] At the moment, the YouTube video demo has some cropping issues, but that can be easily fixed.
0xDA7A
I think the part I find most interesting about this is the potential power implications. Ternary models may perform better in terms of RAM and that's great, but if you manage to build a multiplication-free accelerator in silicon, you can start thinking about running things like vision models in < 0.1W of power.This could have insane implications for edge capabilities, robots with massively better swarm dynamics, smart glasses with super low latency speech to text, etc.I think the biggest technical hurdle would be simulating the non linear layers in an efficient way, but you can also solve that since you already re-train your models and could use custom activation functions that better approximate a HW efficient non linear layer.
jacobgorm
I was part of a startup called Grazper that did the same thing for CNNs in 2016, using FPGAs. I left to found my own thing after realizing that new better architectures, SqueezeNet followed by MobileNets, could run even faster than our ternary nets on off-the-shelf hardware. I’d worry that a similar development might happen in the LLMs space.
nicoty
Could the compression efficiency you're seeing somehow be related to 3 being the closest natural number to the number e, which also happens to be the optimal radix choice (https://en.wikipedia.org/wiki/Optimal_radix_choice) for storage efficiency?
nostrebored
What do you think about the tension between inference accuracy and the types of edge applications used today?For instance, if you wanted to train a multimodal transformer to do inference on CCTV footage I think that this will have a big advantage over Jetson. And I think there are a lot of potentially novel use cases for a technology like that (eg. if I'm looking for a suspect wearing a red hoodie, I'm not training a new classifier to identify all possible candidates)But for sectors like automotive and defense, is the accuracy loss from quantization tolerable? If you're investing so much money in putting together a model, even considering procuring custom hardware and software, is the loss in precision worth it?
henning
I applaud the chutzpah of doing a company where you develop both hardware and software for the hardware. If you execute well, you could build yourself a moat that is very difficult for would-be competitors to breach.
sidcool
Congrats on launching. This is inspiring. .
transfire
Combine it with TOC, and then you’d really be off to the races!https://intapi.sciendo.com/pdf/10.2478/ijanmc-2022-0036#:~:t...
Havoc
> This represents an almost 8x compression ratio for every weight matrix in the transformer modelSurely you’d need more ternary weights though to achieve same performance outcome?A bit like a Q4 quant is smaller than a Q8 but also tangibly worse so the “compression” isn’t really like for likeEither way excited about more tenary progress.
stephen_cagle
Is one expectation from moving from a 2^16 state parameter to a tristate one that the tristate one will only need to learn the number of states of the 2^16 states that were actually significant? I.E. we can prune the "extra" bits from the 2^16 that did not really affect the result?
mikewarot
Since you're flexible on the silicon side, perhaps consider designing things so that the ternary weights are loaded from an external configuration rom into a shift register chain, instead of fixed. This would allow updating the weights without having to go through the whole production chain again.
tejasvaidhya
There’s more to it. https://x.com/NolanoOrg/status/1813969329308021167I will be archiving the full report with more results soon.
99112000
An area worth exploring are IP cameras imho1. They are everywhere and aren't going anywhere.. 2. Network infrastructure to ingest and analyze thousands of cameras producing video footage is very demanding.. 3. Low power and low latency scream asic to me
bjornsing
Have you tried implementing your ternary transformers on AVX(-512)? I think it fits relatively well with the hardware philosophy, and being able to run inference without a GPU would be a big plus.
marmaduke
What kind of code did you try on the CPU for, say, ternary gemm? I imagine ternary values maps nicely to vectorized mask instructions, and much of tiling etc from usual gemm
dnnssl2
What is the upper bound on the level of improvement (high performance networking, memory and compute) you can achieve with ternary weights?
maratc
Is there a possibility where this can run on a specialized hardware which is neither a CPU nor GPU, e.g. NextSilicon Maverick chips?
lappa
Great project, looking forward to seeing more as this develops.Also FYI, your mail server seems to be down.
ccamrobertson
Congrats, always cool to see YC founders working on silicon!
luke-stanley
The most popular interfaces (human, API and network) I can imagine are ChatGPT, OpenAI compatible HTTP API, Transformers HuggingFace API and models, Llama.cpp / Ollama / Llamafile, Pytorch. USB C, USB A, RJ45, HDMI/video(?) If you can run a frontier model or a comparable model with the ChatGPT clone like Open UI, with a USB or LAN interface, that can work on private data quickly, securely and competitively to a used 3090 it would be super badass. It should be easy to plug in and be used for running chat or API use or fine-tune or use with raw primitives via Pytorch or a very similar compatible API. I've thought about this a bit. There's more I could say but I've got to sleep soon... Good luck, it's an awesome opportunity.
anirudhrahul
Can this run crysis?
Taniwha
Yeah I've been thinking about this problem for a while from the making gates level, I've been thinking that the problem essentially breaks down to a couple of pop counts and a subtract, it's eminently pipelineable
hy3na
ternary transformers have existed for a long time before you guys TerDit, vision ones etc. Competing in the edge inference space is likely going to require a lot of capex and opex + breaking into markets like defense thatre hard asf without connections and a strong team. neither of you guys are chip architects either and taping out silicon requires a lot of foresight to changing market demands. good luck, hopefully it works out.
felarof
Very interesting!
meyerluanna
[dead]
_zoltan_
you might want to redo the video as it's cropped too much, and maybe it's only me but it's _really_ annoying to watch like this.
Haskell4life
[flagged]