A disappearing Service Processor | Oxide Computer Company
11 Dec 2025
A disappearing Service Processor<br>LA
Laura Abbott<br>Engineer
One of the considerations in designing our Oxide rack is asking which parts we<br>expect to be accessible and by what means. The Oxide rack is designed to live<br>in a data center with exclusive access via the network. The only reason an<br>engineer should ever need to physically visit a rack is to replace a failing<br>part, such as a disk. Our Service Processor (SP) is accessible via the management network.
During some of our first attempts at putting our next generation Cosmo sled<br>into an Oxide rack, we would see the Service Processor drop off the network.<br>This is a tricky situation to debug, as without network access we have limited<br>insight into the state of the SP itself. Debugging started based on the state<br>of the rest of the system (original Hubris bug may contains spoilers for the blog post!):
The AMD host CPU was still alive, meaning the full system itself still had power
The SP itself was not broadcasting over the management network that it was alive
There were no increases in network data counters coming from the SP
The fans were spinning at a constant elevated rate. The service processor is<br>responsible for fan control, so this was an indication the fan controller may<br>have fallen back to emergency full power mode.
This was not reproducible on a sled outside a rack
The Service Processor runs our custom operating system, Hubris. Each portion of the system (networking, thermal<br>control, update etc.) is written as a separate task . Hubris is not a true Real<br>Time Operating System with deadline guarantees, but it does have the notion of<br>task priorities. One of our working theories was that we had a software bug<br>that was causing task starvation. If the networking task was unable to run due<br>to some other task eating up all the CPU time, it would not be able to respond<br>over the network. A likely culprit of task starvation could be a task that had<br>gotten into an infinite crash loop, with all CPU time being spent restarting the<br>task. We adjusted the task restart time to have a longer delay to catch this<br>case. We also wanted to be able to observe if the SP was still making progress<br>even if we lacked networking access, and so switched our chassis LED from<br>"always on" to blinking.
We were fortunate to be able to reproduce the issue with these debug changes, but the<br>results were still confusing: in some cases we would see the LED stuck on, and<br>in other cases the LED was stuck off. The task responsible for LED blinking was<br>near the top of priorities, which limited the number of places we could have a<br>stuck task.
One of the many advantages of writing Hubris in Rust is eliminating<br>bug classes such as buffer overflows. A category of issues Hubris is<br>still particularly prone to is stack overflows. This is because Hubris<br>requires manual sizing of stacks for tasks and calculating maximum stack<br>size has proven tricky. Our ability to detect undersized stacks has<br>improved with the addition of emit-stack-sizes feature<br>but we can still hit some edge cases.<br>When a stack overflow occurs, the task safely restarts. A stack overflow in<br>the kernel would potentially produce similar behavior of a system that looks<br>like it isn’t making progress. Unfortunately for us the stack margins on the<br>kernel were relatively large (512 bytes!) so this was an unlikely case.
At this point, we really needed to get more debugging information out of the<br>system. For manufacturing purposes, we have SWD debug headers. These are not<br>expected to be used on a production system and especially not a system in a<br>running rack. We had to do some creative cable pulling to get them attached<br>with the assistance of coworkers in the Oxide office.
Fortunately, our cable attachment paid dividends: we reproduced the issue with<br>the probe attached! This was not immediately fruitful: the debug probe was<br>unable to actually halt the CPU via debug halt, which limited our ability to<br>extract diagnostic information. Our Service Processor uses a Cortex-M7 STM32H7,<br>and the number of ways to put the system in such a state is limited.
This put our focus on identifying what parts of the system could cause such<br>behavior. A major<br>change from our first generation Gimlet system was the addition of an FPGA to<br>control more parts of our system such as host flash.<br>This FPGA is connected using a simple, old-school parallel bus, like the sort<br>you might use for RAM, and accessed via the STM32H7 Flexible Memory Controller.<br>As stated in the manual (Section 22.1 RM0433):
Its main purposes are:<br>* to translate AXI transactions into the appropriate external device protocol<br>* to meet the access time requirements of the external memory devices
One way a CPU can potentially get stuck is if it never receives a bus<br>acknowledgement from an external device. A bug in the FPGA timing, for example,<br>could result in the CPU hanging forever when attempting to read a register.<br>To...