<- Back
Comments (99)
- sonzohanI also recently decided to buy a datacenter GPU and slap it into a system. Some notes from my experience that the author doesn't mention in their article:Decommissioned NVIDIA V100s and AMD MI50s are fairly cheap, $200 for 16gb and $400-500 for 32gb, for local experimentation. They are also very old. There's an enthusiast community keeping these two cards alive and working with current platforms and models.Nitpick, but the V100 doesn't support bfloat16. The performance hit is not a big deal if you're fiddling with local models, but the card is on it's way out in terms of hardware features.The MI50 does support bf16, but not the current edition of AMD ROCm. Vulkan support is good and the MI50 works with most major platforms (llama.cpp, vllm, etc.), but it's not without some pain points like manual recompilation. Fortunately the open source community has already paid most of your way.The cooling requirements for these cards cannot be understated. A consumer grade GPU may throttle if in a small case without additional fans, but if given the same treatment a datacenter GPU will overheat itself idling. You will need to buy, at least, a bunch of decent 120mm fans to prevent this or invest in some water cooling.I ultimately went with an AMD MI100 32GB ($950). I'm an AMD fan, current ROCm editions support it, and it was low-fuss to get things working. I'm debating getting a second so I can try out bigger models like qwen3-coder-next.
- TeknomadixTesla V100 SXM2 16GB is NOT DGX class as the author writes. It's HGX class. The V100 comes in two classes, SXM2 and SXM4, the latter coming with a Max of 80gb on board memory. Typically these are installed 8×A100 80GB SXM4 on an HGX riser, and what that gives you is NVSwitch fabric and 640GB of pooled HBM2e (on package stacked memory /w ~2 TB/s of memory bandwidth). 2u standard rack footprint too.
- mickeypImpressive work. But the problem is not the 30 tok/s which is fine for agentic coding and chat.It's prefill; slow prefill kills agentic workloads dead.If you have 100,000 tokens at ~150tok/s per the OP, you're looking at: You have: 100000 / (150/s) You want: hms 11 min + 6.6666667 sec Which is quite a wait indeed.
- jonhohleI was just looking into this and was worried about the fan setup. Interesting that he was able to solve it with good results.In case anyone is interested, I’m using PCIE passthrough on a FreeBSD host to a Linux guest with an older Pascal card. It’s worked great and I’ve been thinking about putting a nicer card in there. The SXM route seems great, but I’ve been burned (almost literally because of the heat) by DC components before.
- bob1029> And yes, if you want the absolute best, Opus 4.8 exists. It also costs more per 20 minutes of heavy use than I paid for this entire GPU and adapter setup combined. But the gap is shockingly small.I don't think this is a fair characterization of the situation. I use frontier models via API pre-paid tokens every single day, and I can barely rack up $100 per month. The fact that we figured out how to burn double this in 20 minutes is impressive, but I don't think it reflects the reality that many are experiencing right now. There are some exceptionally gluttonous approaches to harnessing LLMs that I think are serving as convenient straw men in these discussions.Paying for the API will almost always be more economical than self-hosting equivalent infrastructure. I am not against self-hosting, but the article suggests a primarily economic motivation for this effort. If you are consuming fewer than 10^9 tokens per month, I really don't think it's worth your time to try and compete with the hyperscalars. Most of the money is to be found in the integration of this technology with existing businesses.
- mondainxGreat write-up, I've often considered these DC cards for a project and now you've convinced me to pick one up; you describe the price of the unit against what one spends on tokens and that does it for me.
- segmondyThe most interesting and perhaps useful for most would be how they control the fan. If you are thinking of doing this, you really want to get those fans under control, they are loud. For anyone thinking of these, v100s idle super high! 25-35watt with nothing loaded and easily 50w when a model is loaded.
- matjaThe AMD MI250X GPUs are also interesting - 128GB of HBM2E at 3TB/s, sometimes you see them second-hand for under $1k, the catch obviously is that it needs an OAM socket. Never seen an easy way to hook them up to a regular mainboard.
- omarqureshiCould probably avoid the crazy fan with a waterblock - I've seen a whole kit, v100 + PCIE adapter + block for £235. Yes, you'll have to pay for pump, radiators and radiator fans, but that should really quieten it down
- abejfehrBased on the title I was really hoping to see how this was used for gaming, but they just ran an LLM on it
- lucamarkCongrats! Most people won’t want to debug drivers, kernels, ACPI, adapters, and fan headers. But for those who do, the capability-per-pound is absurd.
- 00dazzleThat's the same price per VRAM GB as an arc pro B70
- ewy1despite gaming being used in the title, it is not mentioned in the article, but i'm curious how this performs.i've ran some multi vendor frankenstein setups before and sometimes it even works, so i'm curious to hear your experience with it.
- whoamiiThe real question: did your local LLM write this post?
- jmyeetSome context:- In 2017, the v100 was a ~$10,000 GPU. I believe there was a PCI-e version but this is probably so cheap because SXM2 is going to be harder to use;- A 5090 has 1800GB/s of internal memory bandwidth (compared to 900GB/s in the 9 year old GPU). Of course a 5090 is substantially more expensive;- A 5090 has ~21k CUDA cores vs ~5k;- The current $10k NVidia GPU is the RTX 6000 Pro w/ 96GB of VRAM. It has slightly more CUDA cores but it otherwise pretty much just a 5090. This is unsurprising. NVidia uses VRAM for market segmentation.Consider this: in 5-10 years, the trillions spent on AI data centers will likewise be sold for scrap most likely. That's how short the runway is for OpenAI and Anthropic to recover that investment.Anyway, I'm kind of impressed the author managed to get this all to work. I don't think it even would've occurred to me that someone had made an SXM2 adapter, particularly because it's not even used anymore. Like props to whoever did that.
- KnuthIsGodAI written posts will kill HN.
- pogueBut could you game with the GPU? Or is that purely a drivers issue?
- viseythVolta (and Pascal, which I'm using) should still be supported with driver 580 as long as you don't use the open modules, and you can use up to cuda 12.9 and cudnn 9.10.2. No need to limit yourself to an old kernel.
- gtirloni> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.sigh
- anonundefined
- wg0Wait a few years, everyone will be able to put one at half the price.
- axpy906Wow. V100. That brings back memories. Way to go.
- recursivegirth> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.Had to stop there. Annoying. I can't stand AI use for writing. It makes any otherwise great article feel so disingenuous.
- casey2Some resell group is going to have to make this easier. The shear amount of these cards otherwise heading towards the landfill is staggering. That is if Big Tech don't destroy them to prevent model weights from leaking.
- hypfer[dead]
- lelanthran> The compute is still real. The VRAM is still real. And the memory bandwidth is where it gets genuinely surprising.Because humans write exactly like this /s
- knollimarA little bit of local copium but neat read.Isn't a rasbpi with 16gb of RAM $300 now?