Was my $48K GPU server worth it?

apwheele1 pts0 comments

Was my $48K GPU server worth it? – Rosmine ML Blog

Was my $48K GPU server worth it?

In 2024 I quit my FAANG job to become an independent researcher. To do this I needed GPUs, so I built "grumbl", a 6x 6000 Ada GPU server.

This blog describes the build, some of the issues I faced, and answers the question "was it worth it to build the server myself, or should I have rented cloud GPUs?"

(It’s called "grumbl" because apparently I cannot spell "GPUs")

GPUs as an investment

This rig cost me $48K. That sounds expensive, but it’s way less expensive than quitting my job. Because of the loss of income, if more powerful GPUs could help me make my work be successful just 2 months earlier than I would have with a smaller machine, then buying a more powerful server would be worth it. So I decided to buy the most powerful server that I could run in my apartment.

Choosing the GPUs

I found Tim Dettmers’ guide to choosing a GPU helpful. From that I narrowed it down to A100’s, H100’s or RTX 6000 Ada. A100’s don’t support FP8 and have slower inference performance than the newer GPUs, and I’m going to be doing a lot of inference (RL), so narrowed it down to 6000 Ada vs H100. Looking at the price/throughput ratios of 6000 Ada vs H100 vs A100, I went with the 6000 Ada GPUs.

Power Constraints

I live in an apartment and don’t have the option to upgrade my electrical circuits to support standard datacenter servers. 6 GPUs requires too much power for a single apartment circuit to handle, so I had to get 2 power supplies, and plug the power supplies into 2 outlets in separate circuits.

If you google "plugging a PC into multiple outlets", you get lots of warnings that if you even consider such a setup you will instantly burst into flames. So I hired a professional PC builder make sure it was safe. This is more expensive than doing everything myself, but it’s less expensive than doing something wrong and burning down my apartment.

Ironically, after designing the entire build around apartment power constraints, I ended up moving grumbl to my parents’ basement, where I could upgrade the circuits anyway.

Building my own GPU server vs. using a Cloud Provider

Is it better to buy my own GPUs or should I have rented from a cloud provider? I decided to measure this by calculating how much I used the GPUs, and comparing that to how much it would’ve cost to rent equivalent compute in the cloud.

In 2024 I calculated at the then current GPU rental rates, it would take me about a year of close to 85%+ utilization to match cloud rental rates. That should be easy to do, but for a full analysis, I need to also account for electricity and the fact that as more powerful GPUs become available, the cost to rent equivalent compute will decrease.

To be thorough, I wrote a script that would log the usage of each gpu every minute. I also logged the power usage in watts so I could calculate how much I spent on electricity.

In this analysis, I only compared against on-demand pricing. There are also payment plans where you reserve the instance for 6-12 months, but those seemed not worth it to me, since they were only a little cheaper than buying the server itself, and this way I got to keep the gpus.

Using grumbl without a monitor is wasting its potential, since it has ports for up to 24 monitors. I could make my own mini vegas sphere

GPU usage over time graph

To measure GPU usage, for each GPU I counted the number of hours each day where I used that GPU at least once. This seemed a fair comparison against rental since I wouldn’t stop and restart a cloud server if it was only going to be idle for less than an hour.

This comparison is generous to cloud renting, because it assumes I could stop and start each GPU independently. Much of the idle time I had was when I was running multiple experiments in parallel, and one finished/failed but the others kept going, and I wouldn’t have stopped the server if I was renting

Note: This is meant to be a measure of how much I use the gpus, not training efficiency, so a GPU with 10% utilization would still count as active for the hour. (My code would be equally inefficient running in the cloud)

Here is the graph of use over time:

You can see 3 separate times the server was down for maintenance. This is quite stressful because you don’t know if the server isn’t booting because a single PCIe riser failed, or because something went catastrophically wrong and fried all the GPUs.

In June 2025 you can see a clear increase in usage, before that I was doing smaller experiments where dev time was comparable to experiment time, so there was more down time between experiments when implementing. After June 2025, I had a project that required more compute, so I always had most GPUs continuously running experiments, and only 1-2 GPUs for dev.

From the graph, the total average use was 76%. If you calculate since 1/1/25, utilization is 85%. I have to admit, I’m a little disappointed in that. I’m running experiments...

gpus server cloud worth because power

Related Articles