Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Log In<br>Sign Up
Back to Articles
Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Published<br>May 29, 2026<br>Update on GitHub<br>Upvote 18
+12
Aritra Roy Gosthipaty ariG23498 Follow
Sayak Paul sayakpaul Follow
Sergio Paniego sergiopaniego Follow
Rémi Ouazan Reboul ror Follow
Pedro Cuenca pcuenq Follow
What you cannot profile, you cannot optimize.
Whether you are trying to squeeze more tokens per second out of a Large Language Model (LLM), shave milliseconds off inference, or just understand why your training loop runs slower than the spec sheet promises, the path eventually runs through profiling.
The catch is that profiling has a steep on-ramp. The traces are dense walls of colored rectangles. The events carry intimidating names. Most tutorials assume you can already read them. So even when we know we should be profiling, opening a trace can feel like a chore best left for later (or for someone else). This post, and the series it kicks off, is our attempt to lower that on-ramp.
This is the opening post of Profiling in PyTorch , a series where we slowly build the skill of reading profiler traces and use it to drive optimization. The plan:
Part 1 (this post): start with the simplest possible operation, a matrix multiplication followed by a bias add, and learn how to read what the profiler hands back.
Part 2: scale up to nn.Linear and a small MLP, use the traces to motivate optimizations, and peek at the kernels underneath.
Part 3: put it all together on Large Language Models with transformers.
We document the journey from a beginner's point of view. No prerequisites apart from basic PyTorch. Treat this as a leisurely read with some "Aha!" moments. The structure of the post is intentionally question-led: we open a trace, ask "wait, why is that happening?", and chase the answer until something clicks. By the end you should know:
how to set up torch.profiler and what it actually hands back,
how to read the profiler table and the trace (CPU lane, GPU lane, and the suspicious gaps in between),
the chain of events from a Python call all the way down to a CUDA kernel,
what changes (and, more interestingly, what does not change) when you slap torch.compile on top.
Before we begin, two definitions that will make everything below read better:
A GPU kernel is a program that runs in parallel on many threads of the GPU.
The CPU schedules and launches these kernels.
You don't usually have to write GPU kernels yourself; when you use a PyTorch operation, it is automatically translated to one or more kernels that do the job on GPU.
With those two ideas in your back pocket, let's start asking questions.
Here is the entire script that we use for the post: 01_matmul_add.py. We recommend opening this script in a separate tab and walk through the code step by step. We use the NVIDIA A100-SXM4-80GB GPU to run the scripts.
The matrix multiplication and addition operation
As correctly quipped by Dr. Sara Hooker, just as we are primarily made up of water, Deep Neural Networks are primarily made up of matrix multiplies. As fundamental as they are, it would be a shame to start our profiling journey with anything else.
def fn(x, w, b):<br>return torch.add(torch.matmul(x, w), b)
The matrix addition along with the matrix multiplication mimics how weights and biases interact in a neuron. This addition (pun intended) will help us understand how it paves the way for compilation later in the post.
To profile, we will be using the torch.profiler module. The steps involved are:
Have the code to profile ready (here def fn, which wraps the matrix multiplication and matrix addition)
Annotate the algorithm. While this is completely optional, we recommend doing this. The record_function annotates our function as matmul_add, which will be easy to navigate in the traces (as we note later)
def step():<br>with torch.profiler.record_function("matmul_add"):<br>return fn(x, w, b)
Wrap the code with the torch.profiler.profile context manager
with torch.profiler.profile(<br>activities=[<br>torch.profiler.ProfilerActivity.CPU, # the cpu activities<br>torch.profiler.ProfilerActivity.CUDA, # the gpu activities<br>],<br>) as prof:<br># it is recommended to run events multiple times to warm up the GPUs<br>for _ in range(5):<br>step()<br>prof.step()
Export the profile
# the profiler table<br>prof.key_averages().table(sort_by="cuda_time_total", row_limit=15)
# the profiler trace<br>prof.export_chrome_trace(trace_path)
The profiler exports two distinct artifacts:
The profiler table: Provides the statistical summary of the algorithm. It answers "What is taking the most time". This becomes really helpful to figure out hotspots. A hotspot would be events that take the most amount of time, can be a bottleneck of the pipeline, or an event that is triggered a lot of times.
The profiler trace: Provides the temporal execution view. Answers "When and Why an operation happened", depicting the...