A visual introduction to kernel functions

--><br>Beautiful Kernel Functions and How to See Them

-->

Beautiful Kernel Functions and How to See Them

Kelvin Paschal

Machine Learning

May 30, 2026

Let us assume you have a machine that gives an arbitrary amount of gold whenever you insert cheese. You don’t know how much cheese you’d need to insert to get a specific amount of gold. The mapping is also not linear, i.e., $G$ is not directly or inversely proportional to $C$, so bigger amount of cheese doesn’t necessarily mean larger portions of gold. Your goal is to figure out how to get the largest portion of gold from this machine, assuming you have a finite amount of cheese.

We’re assuming the process is not random, that is, there is an assumed mapping from amount of cheese $(C)$ to amount of gold $(G)$. Let us call this mapping $f$, and you want to uncover it so that you can predict the amount $G$ you’d get whenever you insert some $C$.

You want to understand the relationship: $G = f(C)$

One way to uncover this relationship is by inserting different amount of cheese and observing the amount of gold you get. This is called a data collection/generation process. With this data, you can build a model to help you predict $G$ for every $C$ you insert. But why is this called a model?

A model is an approximation of something else. We don’t know the internal workings of the machine, and we can’t observe all possible outputs from it since we do not have infinite cheese. We’re building an approximation of the cheese-gold mapping, based on the limited number of inputs and outputs observed. This is essentially what machine learning modeling is; an attempt to correctly approximate the process that generates some type of data, based on the historical observations we’ve collected about this process.

For the purpose of this post, there is a specific type of machine learning method I will talk about, and it’s called a Gaussian process (GP). To explain GPs, I’ll continue our previous analogy.

Say you’ve only observed one or two data points, there are infinitely many guesses you can still make about this cheese-gold mapping. Of course, this space becomes less ‘infinite,’ as you collect more observations from the machine. A GP works by constructing an infinite amount of guesses or functions of the true process you want to approximate. As you accumulate more observations, it changes the shape of these functions to match the data, and hence the true process (just like the way you change your mind after getting new information). A GP is simply a distribution over functions (or guesses). Because we have an infinite amount of guesses, the expected true guess (or best model) is the mean of all plausible guesses. We can use the variation/spread between those guesses to calculate an uncertainty. If the uncertainty is large, then all guesses are significantly different, and our mean guess is probably wrong. If the uncertainty is small, it means that the guesses are not too dissimilar, and we can trust the mean.

$$GP(m(x), k (x,x’))$$

A GP is characterized by its mean and covariance. The kernel function is what helps us calculate the covariance or uncertainty. It tells us how strongly two points should be correlated. I’ve been working with GPs in the last few years, and I’ve come to love how flexible they are. They are non-parametric models, so they do not assume a fixed or finite set of parameters for the function shape. You can tune how a GP models a dataset by changing its kernel function. If you look up the definition of a kernel function, you’d get something like this.

A kernel function is a mathematical tool used in machine learning, particularly in algorithms like Support Vector Machines (SVMs), to transform data into a higher-dimensional space without explicitly calculating the coordinates in that space. This allows for the analysis of complex, nonlinear relationships in the data while maintaining computational efficiency.

In the context of GPs, a kernel or covariance function $k(x, x’) = Cov(f(x), f(x’))$, encodes which function values should vary together. They as used as a measure of similarity.

If you know the ‘shape’ or pattern of the given dataset, you can use the right kernel function when training a GP model. This is where domain knowledge of a given dataset can be useful. One fun thing I love about kernels is that you can add or multiply them to form composites. This means you’re able to bias the model to even more complex data representations.

Now that we’ve built an intuition for machine learning and GPs, I will use the rest of this post to go over different kernel representations and their visualizations. I provide figures to show a 1D sample from the GP prior when using a specific kernel, and I show covariance heatmaps where the kernel compares two inputs.

Linear Kernel Function . This kernel...

A visual introduction to kernel functions

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy