4× RTX Pro 6000 Blackwell on Water, and the One Card That Wouldn't Behave

sabareesh1 pts1 comments

4× RTX PRO 6000 Blackwell on Water, and the One Card That Wouldn't Behave | Life and AI

This rig exists to train models , not serve them. Four RTX PRO 6000 Blackwell cards in one chassis at 600 W each is 2.4 kW of heat to evict, and training runs are hours-to-days long with every card pinned at full TDP. Air coolers can do it for an inference burst; they cannot do it for a multi-day training job — the fans get loud, the cards stack their exhaust into each other, and the first one to thermal-throttle stalls the whole synchronous step. So we converted all four to waterblocks. Most of the build went fine. One didn’t — and the reason was sitting on the workbench.

This post is the short version: what we did, what broke, how we found it, and where we landed.

The rig#

4× RTX PRO 6000 Blackwell Workstation (GB202, 96 GB GDDR7, 600 W)

Threadripper Pro 7995WX on WRX90

4× Bykski waterblocks (full-cover, GPU + VRM + memory front-side)

Custom loop: single distro/reservoir, two pumps, distilled water, two Alphacool NexXxoS XT45 Full Copper 1260 mm Super Nova radiators (9× 140 mm fans each), four GPUs plumbed in parallel

2× 1500 W PSUs (3 kW total budget) to feed the ~2.4 kW sustained draw; AC circuit got upgraded mid-build after an earlier all-cards-down event under load

The waterblocks themselves are straightforward: pull the stock cooler, clean the die, fresh paste on the GPU, thermal pads on memory and VRMs, torque the block down in a star pattern. The catch on these cards is the backplate — the memory packages on the back also need cooling, which means either pads against the case panel or small finned heatsinks glued on with thermal adhesive. I went with HOAOH 2.0 W/m·K tape on most spots and GENNEL G109 thermal adhesive where I needed something that wouldn’t migrate.

The card that wouldn’t behave#

Three cards came up clean. The fourth — GPU 1 on this rig — would idle fine, then fall off the bus under load. The dmesg signature was always the same:

NVRM: Xid (PCI:0000:02:00): 79, pid='', GPU has fallen off the bus.<br>NVRM: Xid (PCI:0000:02:00): 154, Node Reboot Required<br>Xid 79 by itself is a generic &ldquo;GPU stopped responding&rdquo; — it can be driver, PCIe link, power, or the card. The companion 154 plus the PCIe AER logs showed a DPC containment event: the root port killed the link because the card stopped acknowledging transactions. That narrowed it to the card or its power delivery, not software.

The painful part is that everything else looked normal. The card enumerated. It loaded the driver. It ran short workloads. It only failed after the VRMs had been driving real current for a while.

The temptation here is to chase software: try a different driver, a different vLLM build, swap CUDA versions, blame torch.compile. I tried some of that. None of it changed anything. The next step was to stop guessing and look at the card.

Pulling the block#

This is the back side of the GPU with the block off. The big metal lid in the middle is the GB202 IHS. The black ring around it is the VRM — each of those small black squares marked 85N is a power inductor (a choke). They sit between the VRM MOSFETs and the GPU core, smoothing the switched current that feeds the die.

A 600 W card has a lot of these chokes for a reason. They share the load. Lose one and the rest pick up its share, but the regulator&rsquo;s feedback loop gets unhappy and the current waveform gets noisy.

If you look at the upper-right cluster of chokes, one pad is empty. There are two bare solder lands with nothing on them.

The two shiny rectangles are the landing pads. The component that should be bridging them is gone.

It was on the bench.

The part#

About 3 mm on a side, marked 85N , identical to the 23 still on the board. At some point during the waterblock conversion — most likely while peeling the stock thermal pad off the VRM area — the choke came off with the pad and ended up on the mat. It&rsquo;s small enough that it didn&rsquo;t get noticed during reassembly.

Now the failure mode makes sense. Idle and light loads: the remaining chokes carry the current without complaint. Sustained inference at 600 W: ripple climbs, one of the GPU&rsquo;s internal rails dips out of spec, and the card aborts the link rather than corrupt data. Hence Xid 79 only under real load, and only on this one card.

Putting it back#

Resoldering a power inductor onto a multi-layer GPU PCB is not glamorous work but it isn&rsquo;t exotic either. I did it with a $40 SmartFix soldering kit from Amazon — not a $500 rework station. Flux the pads, tin them lightly, place the part, reflow with the surrounding area shielded. The pads on these inductors are big and flat, which actually makes them easier to land than the fine-pitch stuff next to them. Visual check under magnification, continuity check across the part, reinstall the waterblock with fresh paste and pads, back in the loop.

If you&rsquo;re hesitating because you don&rsquo;t own pro...

card rsquo pads cards thermal blackwell

Related Articles