Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Back to Articles

Profiling in PyTorch (Part 2): From nn.Linear to a Fused MLP

Published June 11, 2026 Update on GitHub Upvote 10

Aritra Roy Gosthipaty ariG23498 Follow

Rémi Ouazan Reboul ror Follow

Sergio Paniego sergiopaniego Follow

Pedro Cuenca pcuenq Follow

Sayak Paul sayakpaul Follow

In the first part of this series "Profiling in PyTorch", we used torch.add(torch.matmul(x, w), b) to learn how to read PyTorch profiler traces. We also discussed several other topics that came our way - the CPU dispatch chain, launch overhead, the difference between an overhead-bound and a compute-bound regime, and some internals of torch.compile.

In the second iteration (this blog post), we climb one rung up the ladder. We replace the hand-written matmul-add pair with an nn.Linear (with bias=True). This is the building block every deep learning model uses. We then stack three of them (specific to our example), with an activation in between, to form a Multilayer Perceptron (MLP) block.

The scripts for this blog post live here: 02_linear.py, 03_simple_mlp.py, and 03_kernels_mlp.py. Like before, it helps to open them in a separate tab and walk through the code as you read. We use an NVIDIA A100-SXM4-80GB GPU to run the scripts. It is really easy to set up a GPU on the Hugging Face infrastructure and experiment with the scripts using Dev Mode with Spaces. One could also run the scripts with the Hugging Face Jobs pipeline.

Before we begin, a quick recap of two ideas we will lean on repeatedly:

A GPU kernel is a program that runs in parallel on many threads of the GPU.

The CPU schedules and launches these kernels. Most of the PyTorch overhead you see in a profiler trace is this scheduling work.

From matmul-add to Linear

nn.Linear is a module wrapper around the same matrix multiplication and addition we already profiled in Part 1. The only difference is that it owns its weight and bias as parameters and exposes a forward method that PyTorch users have grown familiar with.

# bias=True would truly emulate the multiplication and addition # operations we have seen in part 1 of the series linear_layer = nn.Linear(in_dim, out_dim, bias=True) y = linear_layer(x)

The operation at hand can be written as:

y = x @ w.T + b

Where x is the input, w is the weight and b is the bias. Let's run 02_linear.py and check the profile.

uv run 02_linear.py --batch 1024 --in_dim 32 --out_dim 64 uvx trace-util traces -b traces

trace-util is a utility that will sync your traces to a Hugging Face bucket and then provide the Preffeto URLs on your terminal.

Figure 1: Profiler trace of nn.Linear

Figure 1 shows the profiler trace of a forward call of the linear layer. We trace the forward call of the linear layer with a similar schedule setup as the previous traces, with wait=1, warmup=1 and active=3. This is why we see three Profile Steps in the CPU and GPU lanes.

What is the transpose doing?

Figure 2: The transpose CPU row

If we zoom into the profiler trace, as we do in Figure 2, we notice an aten::t (transpose) op before the aten::addmm (multiplication and addition) op. We can already figure out that nn.Linear transposes the weight parameter and then multiplies it with the input. This is the reason we see an aten::t op.

An important thing to notice is that aten::t does not really copy or reorganize data: it only rewrites tensor metadata (shape and stride) on the CPU to represent the transposed matrix. It does not launch a kernel on the GPU. One can verify this two ways: by looking at the GPU lane in the trace, or by checking the aten::t row in the profiler table and the time it took on CUDA.

Why are there no separate mul and add kernels?

Figure 3: No aten::add in the profile of a linear layer

There is no aten::add (the bias addition) in the dispatch chain of the linear layer, as seen in Figure 3. This is because the bias addition has been folded into the matrix multiplication kernel, using what is called an epilogue .

An epilogue is a small computation that a GEMM (GEneral Matrix Multiply) kernel does at the very end, just before it writes its result back to HBM (High Bandwidth Memory, the GPU's main memory). Adding a bias, applying an activation, or scaling by a constant are all classic epilogues. The point of an epilogue is to avoid loading or writing to HBM a second time, since memory traffic makes an operation expensive.

nn.Linear calls torch.nn.functional.linear, which, in turn, calls aten::linear. aten::linear looks at the inputs, notices that a bias was passed, and dispatches aten::addmm(bias, x, weight) instead of doing a matmul and an add separately. addmm computes:

out = x @ weight.T + bias

The cuBLAS GEMM kernel that runs on the GPU has a bias-add variant built in, and that's the kernel aten::addmm picks. The add never appears as a separate kernel because it is part of the matmul kernel's writeback , which is exactly what an...

Profiling in PyTorch (Part 2): From Nn.Linear to a Fused MLP

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

It's Not Just X. It's Y

Show HN: GoPeek – open links in live mini browser windows without new tabs