PyTorch Custom Operation - Lei Mao's Log Book
PyTorch Custom Operation<br>05-10-2026 05-10-2026 blog 23 minutes read (About 3501 words) visits
Introduction<br>Using PyTorch custom operations is common in PyTorch models. PyTorch custom operations can be custom classes and custom functions implemented in C++ and CUDA and used in both Python and C++ inference programs.
In this blog post, I would like to share how to implement PyTorch custom operations in C++ and CUDA, and how to use them in PyTorch models and AOTInductor compiled inference programs, using a simple identity convolution example.
PyTorch Custom Function<br>PyTorch custom functions can be implemented in C++ and CUDA and registered using the TORCH_LIBRARY_IMPL macro. Both the CPU and CUDA implementations can be provided, and PyTorch will dispatch to the correct implementation based on the device of the input tensors.
custom_ops.cpp1<br>10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>18<br>19<br>20<br>21<br>22<br>23<br>24<br>25<br>26<br>27<br>28<br>29<br>30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br>38<br>39<br>40<br>41<br>42<br>43<br>44<br>45<br>46<br>47<br>48<br>49<br>50<br>51<br>52<br>// ---------------------------------------------------------------------------<br>// CPU implementation: plain element-wise copy via clone().<br>// ---------------------------------------------------------------------------<br>torch::Tensor identity_conv_cpu_impl(const torch::Tensor& input)<br>TORCH_CHECK(!input.is_cuda(),<br>"identity_conv_cpu_impl: input must be a CPU tensor");<br>return input.clone();
// ---------------------------------------------------------------------------<br>// Host-side dispatcher.<br>// ---------------------------------------------------------------------------<br>torch::Tensor identity_conv_cuda_impl(const torch::Tensor& input)<br>TORCH_CHECK(input.is_cuda(),<br>"identity_conv_cuda_impl: input must be a CUDA tensor");
// Output has the same shape, dtype, and strides as input.<br>auto output = torch::empty_like(input);<br>const int64_t numel = input.numel();
if (numel == 0)<br>return output;
// Upload shape and strides to the device so the kernel can read them.<br>const int ndim = input.dim();<br>const auto opts =<br>torch::TensorOptions().dtype(torch::kInt64).device(input.device());<br>const auto shape_dev = torch::tensor(<br>std::vectorint64_t>(input.sizes().begin(), input.sizes().end()), opts);<br>const auto strides_dev = torch::tensor(<br>std::vectorint64_t>(input.strides().begin(), input.strides().end()),<br>opts);
constexpr int kThreads = 256;<br>const int blocks = static_castint>((numel + kThreads - 1) / kThreads);
AT_DISPATCH_FLOATING_TYPES_AND2(<br>at::ScalarType::Half, at::ScalarType::BFloat16, input.scalar_type(),<br>"identity_conv_cuda_impl",<br>[&]()<br>identity_kernelscalar_t>>>(<br>input.data_ptrscalar_t>(), output.data_ptrscalar_t>(),<br>shape_dev.data_ptrint64_t>(), strides_dev.data_ptrint64_t>(),<br>ndim, numel);<br>});
C10_CUDA_KERNEL_LAUNCH_CHECK();<br>return output;
custom_op_registration.cpp1<br>10<br>11<br>// CUDA kernel implementation for my_ops::identity_conv_op.<br>TORCH_LIBRARY_IMPL(my_ops, CUDA, m)<br>m.impl("identity_conv_op", identity_conv_cuda_impl);
// CPU fallback.<br>TORCH_LIBRARY_IMPL(my_ops, CPU, m)<br>m.impl("identity_conv_op", identity_conv_cpu_impl);
PyTorch Custom Class<br>PyTorch custom functions are stateless and cannot hold any parameters. If we would like to implement a custom class that holds some parameters and has a forward() method that can be called from Python, we can use torch::CustomClassHolder to define a custom class in C++ and register it with TORCH_LIBRARY macro.
custom_class.cpp1<br>10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>18<br>19<br>20<br>21<br>22<br>23<br>24<br>25<br>// ---------------------------------------------------------------------------<br>// IdentityConvClass<br>//<br>// A custom class registered with torch.classes so that it can be embedded<br>// in a torch.nn.Module, exported with torch.export, and compiled with<br>// AOTInductor.<br>//<br>// The forward() method delegates to the CUDA identity kernel. The<br>// `channels_` field is preserved for semantic completeness and is serialised<br>// via def_pickle so that the class survives export/import round-trips.<br>// ---------------------------------------------------------------------------<br>struct IdentityConvClass : torch::CustomClassHolder<br>int64_t channels_;
explicit IdentityConvClass(int64_t channels) : channels_(channels) {}
torch::Tensor forward(const torch::Tensor& x)<br>return x.is_cuda() ? identity_conv_cuda_impl(x)<br>: identity_conv_cpu_impl(x);
int64_t get_channels() const { return channels_; }<br>};
custom_class_registration.cpp1<br>10<br>11<br>12<br>13<br>14<br>15<br>16<br>17<br>18<br>19<br>20<br>21<br>22<br>23<br>24<br>25<br>26<br>27<br>28<br>29<br>30<br>31<br>32<br>33<br>34<br>35<br>36<br>37<br>// ---------------------------------------------------------------------------<br>// Operator / class registration<br>//<br>// This file has no pybind11 dependency and is compiled into<br>// libidentity_conv_ops.so, which can be dlopen'd by a pure C++ binary<br>// without needing libpython.<br>// ---------------------------------------------------------------------------<br>TORCH_LIBRARY(my_ops, m)<br>// Register IdentityConvClass so Python can instantiate it as<br>//...