Reflecting to optimise

Reflecting to optimise | Magnus Ross

2026/06/26

This is nothing to be proud of, but I have never really studied optimisation in depth. Oh sure, I know my Adam from my AdaGrad and I even used L-BFGS one time, but when people start talking about dual spaces and convergence for L∞L^\inftyL∞ continuous functions, I tend to glaze over a bit. For some reason I think in my head a lot of it seemed a bit old fashioned, whatever that means; why do I need to learn about optimising constrained convex functions in the gradient-descent-for-everything-wait-actually-just-use-a-pre-trained-model era? Isn’t that what they used to, like, make the cheapest possible diet? Boring!

Well, today on the blog, I’d like to talk a bit about my optimisation blind spot in the context of an interesting problem I have been working on related to protein binder design. The first way you think to do something is rarely the best; here we’re going to discuss a concrete example of that. If you, like me, are not an optimisation expert, then I hope you’ll learn something useful from this post. If you are, then feel free to have a good old laugh at my ignorance!

Setup

Ok so what’s the setup? Let’s say we have a categorical probability distribution with kkk categories where the probability of each category is given by the vector x∈Rk\mathbf{x}\in\mathbb{R}^kx∈Rk. There are some constraints on x\mathbf{x}x for it to be valid:

the probabilities must be normalised: ∑i=1kxi=1\sum^{k}_{i=1} x_i = 1∑i=1kxi=1,

and, the probabilities must be greater than 0: xi≥0 ∀i∈{1,⋯ ,k}x_i \geq 0 \ \ \forall i \in \{1, \cdots, k\}xi≥0 ∀i∈{1,⋯,k}.

We have a non-convex function fff which takes in our probability vector, and spits out a real number. We want to find the x\mathbf{x}x that minimises this function x∗=argminx∈Δf(x)\mathbf{x}^{*} = \text{argmin}_{\mathbf{x}\in\Delta}f(\mathbf{x})x∗=argminx∈Δf(x). In our setup, we will assume we can compute the gradient ∇xf\nabla_\mathbf{x} f∇xf, but evaluation of the gradient, and indeed the function itself, is computationally expensive.

Protein related aside

This is a simplified version of the problem of hallucination for de-novo binder design where x\mathbf{x}x represents a distribution over k=20k=20k=20 amino acids, and fff is a folding model, like Alphafold which takes a sequence of LLL of these amino acid distributions, called a position specific scoring matrix (PSSM), as input. We want to find the sequence that gives the best fold (according to some metric like ipSAE). In this case the input to fff is now a matrix X∈Rk×L\mathbf{X} \in \mathbb{R}^{k \times L}X∈Rk×L where each column is a probability vector.

A first attempt

When seeing a problem like this, where we need to optimise something with constraints, my first instinct is to try and re-parameterise the problem so that I don’t have to worry about them, and we can just use all the “normal” methods. We basically want to rewrite x\mathbf{x}x as a function of some other parameters with no constraints, then we can just optimise those. In this case, we can write

xi=softmax(ℓ)i=eℓi∑j=1keℓj<br>\mathbf{x}_i = \text{softmax}(\mathbf{\ell})_i = \frac{e^{\ell_i}}{\sum^{k}_{j=1} e^{\ell_j}}<br>xi=softmax(ℓ)i=∑j=1keℓjeℓi<br>where we call ℓ∈Rk\ell\in \mathbb{R}^{k}ℓ∈Rk the logits. Note that for any ℓ\mathbf{\ell}ℓ, the output x\mathbf{x}x is guaranteed to be a valid probability vector obeying the constraints we set out earlier. Perfect! Problem solved! Now we can just find ℓ∗=argminℓ∈Rkf(softmax(ℓ))\mathbf{\ell}^{*} = \text{argmin}_{\mathbf{\ell}\in \mathbb{R}^{k}}f(\text{softmax}(\ell))ℓ∗=argminℓ∈Rkf(softmax(ℓ)) by taking gradients w.r.t. ℓ\ellℓ and running some gradient descent method, and get x∗=softmax(ℓ∗)\mathbf{x}^* = \text{softmax}(\ell^*)x∗=softmax(ℓ∗). This all makes sense, and would have been case closed for me in the past, but it turns out there are other ways, and those ways are interesting!

The simplex

Let’s take a step back and think a bit more about the structure of this problem. It turns out there is a special name for the space of x\mathbf{x}x’s that obey the constraints set out at the start: the probability simplex Δk−1\Delta^{k-1}Δk−1, so we can write x∈Δk−1\mathbf{x} \in \Delta^{k-1}x∈Δk−1. The probability simplex is the k−1k-1k−1 dimensional region of space where both non-negativity and normalisation are satisfied. Let’s take the concrete example of k=3k=3k=3 with categories A,B,CA,B,CA,B,C. The 2-simplex is a triangular region oriented perpendicular to the (1,1,1)(1, 1, 1)(1,1,1) direction, it looks something like this:

2-simplex in 3D and 2D.

On the left we show the region in the full k=3k=3k=3 dimensions, and on the right, we show the 2D “top down” view. The vertices of the simplex are the points where we have one category with certainty. On the faces, one of the categories doesn’t contribute (e.g. on the A/C face, B...

Reflecting to optimise

Related Articles

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI

How to Earn a Billion Dollars