Avoiding wasteful electricity use while self hosting LLMs

Fixing a stuck Ollama runner and building a GPU watchdog – blog

If you self-host LLMs, at some point, you’ll likely experience a stuck GPU consuming significant electricity for long periods of time. This is my summary of discovering the problem & then implementing automation that corrects the issue. I want automation to discover & correct these issues. I don’t want to be the first line of defense against unnecessary electricity consumption.

My Self Hosted Setup

I run a large language model on hardware at home — an Ollama server on a mini PC (AMD Ryzen AI MAX+ 395 with an integrated GPU). Self-hosting gives me privacy, no subscription, and a model that works without an internet connection. The tradeoff is that I’m my own ops team. When something goes wrong, there’s no provider watching for outages.

This weekend, something went wrong.

I noticed the cooling fan on my AI server running at full speed without letting up. I hadn’t been using it- and my family’s use of it is pretty limited. There likely were no active requests for the system. Either the server was compromised and mining cryptocurrency, or a process was hung and burning power for nothing. The system draws up to 85W — leaving it pinned at full power indefinitely would show up on my electricity bill.

I needed to do two things. First, determine whether the machine was doing real work or stuck. Second, build a solution that catches this class of problem automatically: Monitoring for long running fan activity has an unreliable Mean Time To Resolution.

Diagnosis. I checked what the model was doing with ollama ps and found a model stuck in a Stopping... state — it was supposed to unload and free the GPU, but the underlying process never exited. I confirmed the conditions by reading the GPU utilization gauge from the kernel (/sys/class/drm/card1/device/gpu_busy_percent): ~89% busy, ~85W. It had been in this state for roughly 20 hours with zero inference requests. The normal shutdown command (ollama stop) had no effect. A service restart (sudo systemctl restart ollama) cleared it. The root cause is a known Ollama bug where the GPU stays at full utilization with no work to do.

Watchdog. At this point, I confirmed that the Ollama system was hung and needed to be reset. But this problem has happened before! It would likely happen again. I needed to stop relying on my ability to detect unexpected zephyrs emanating off my server. I created a watchdog that alerts me only when this specific failure occurs. The logic of the watchdog needs to be narrow to avoid false positives. The basics of the system are as follows:

A cron job samples the system every 5 minutes with two reads: is a model process loaded , and how busy is the GPU ?

A sample only counts as unhealthy when both conditions are true: a model is loaded and the GPU is at or above 70% utilization . High GPU usage during inference is normal; this targets high usage while idle.

The unhealthy state must hold continuously for 15 minutes before the watchdog alerts. That’s long enough to rule out any legitimate request.

Any healthy reading resets the counter, so the watchdog only fires on a sustained stuck condition, not a transient spike.

When it fires, it sends a push notification to my phone via ntfy.sh with the diagnosis and the one-line fix command.

After sending, it suppresses further alerts for one hour. I want one notification- not a stream of them.

There are two design choices worth noting:

The watchdog alerts but never restarts the model automatically. A person can tell a stuck runner from a legitimate long job in seconds. I didn’t want automation killing real work.

If the alert fails to send, the watchdog doesn’t mark the event as handled, so the next run retries instead of going silent.

An aside for my friends in Telco: I originally planned to send alerts as SMS through a AT&T’s email-to-SMS gateway, but that service was shut down in mid-2025. I switched to ntfy.sh push notifications, which turned out simpler and requires no stored credentials. The telco industry whiffed badly. Push notifications should be carrier infrastructure- but instead it’s an OTP service. The industry could figure out how to route calls and SMS across different networks, but it couldn’t figure out how to route push notifications across them? A pox on all your houses. IMS should have been more than SIP routing!

Watchdog Results

The stuck Ollama runner was discovered, evaluated and reset. Until last week- I ran the risk that a hung process would run silently for 20+ hours in ways that could raise my electricity bill. The system now self-reports within 15 minutes of high usage. When a model pins the GPU again, I’ll get a phone notification with clear guidance on how to fix it, whether I’m at the machine or not. The watchdog is free to run and is very simple. A cron job and two timestamp files that survive reboots.

Cron Setup:

The cron setup lives in /etc/cron.d/ollama-watchdog (a system cron...

Avoiding wasteful electricity use while self hosting LLMs

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy