xAI and 10X TrainingSkip to main contentWe value your privacy<br>We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking “Accept”, you consent to our use of cookies.<br>AcceptReject
xAI and 10X Training<br>Written by By Tom Costello
Published / Last Updated May 29, 2026
Category Blog
Ceramic has built the fastest training stack. Vendors (AWS, Coreweave, AMD, and Lambda) have tested this, and we have demonstrated more than 80% MFU on B200s. That is actually higher GEMM performance than these chips can expect for matrices of the size found in LLM models. We have extracted all the possible performance because we train at line speed of matrix multiplication, as seen in the table below linked from Lambda’s blog at https://lambda.ai/blog/ceramic-lambda-training-performance-nvidia-hgx-b200:
Performance Metrics: 8B Model Training on 8 NVIDIA Blackwell GPUs
Recent claims that SpaceX is writing a new training stack in C to get a 10X performance increase, and, as the experts, we thought it would be polite to give some advice. Elon posted,
“SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
The potential speed improvement vs JAX for large training runs is over an order of magnitude.”
What follows is not good advice for the average researcher - this makes your stack harder to modify, and difficult to change for experiments - but it also makes it really fast.
The first thing to realize is that Autograd is not your friend. Symbolic differentiation is one of the great CS achievements. It was the original LISP demo*, but it gets in the way of performance. All the tricky parts of the backward pass are done in specialized kernels (flash attention, RMSNorm) so autograd is just used for trivial things - like addition to the residual stream or taking the derivative of GEMM (it is two GEMMs). Just stop using it. Suddenly, you have straight line code. This is worth 10%.
The second thing is to fuse ops. Fuse them all. A single layer is just a few GEMMs and a few other ops like rotary embeddings and norms (and attention). Fuse these. Tri Dao’s code is your friend here. Another 10%.
Don’t use frameworks . Once you have a framework, you lose track of where things are kept. Instead, allocate fixed contiguous buffers for grads and parameters. Decide on the memory layout to make all networking non-copy. This is easy, so long as you have gotten rid of autograd and frameworks.
Don’t use streams. They make your code easier, but they slow everything down. Networking will overlap with computation. You don't need separate streams to run parallel tasks. Explicitly work out how long each compute and network step takes, and call the networking explicitly at the right time. This saves another 10%.
Don’t use modules - break up each piece rather than use nice code. The linear module is made up of one GEMM in the forward pass and 2 in the backward. But, we need to break this up for the backward pass, as one of these GEMMs is not on the critical path (the one that computes the gradient of the parameters). We will need to postpone this until we have some networking to do. We explicitly delay tasks that are not on the critical path until we have something to overlap. 5% more.
These tricks will get a dense model to 95% of GEMM performance. The idea is to do less, not more. But sometimes doing less means being explicit about what you are doing, which means writing a few more lines of code.
Don’t use tensor parallelism if you don’t need it. If you think you need it, stop and realize that you can almost certainly do without it. You might think that you need it to save space, but a little thought will show that you can move parameters rather than activations over the “tensor parallel” group. This has the property that the networking is not on the critical path, so we have removed all the slowdown from TP networking (another 10%).
You probably want long context, so you might be thinking about context parallelism. This involves networking on the critical path, unless, of course, you use our idea of time-based context parallelism, which leverages the pipeline parallelism to do context parallelism over time. We split the context into microbatches that are fed into the pipeline (we do need to reverse on the backward pass - but we can do that, because we got rid of autograd and frameworks). This removes the cost of context parallelism completely. There is a lesson here, the way to remove the cost of networking is to not send the data.
MoE models make some things a little trickier. The two big changes are that we now have expert parallelism, which uses 8 times more networking than tensor parallelism. That is a lot. The second big issue is that 1 in 32 tokens now go to each expert. Our (or rather Jensen’s) GEMMs don’t work well...