The Quadratic Sandwich

The quadratic sandwich

2026-04-08

The quadratic sandwich

If you have ever tried to minimize a function with gradient descent, you probably noticed that some functions are a joy to optimize and others are a nightmare. The difference often boils down to two properties: strong convexity and L-smoothness . These two concepts define a “sandwich” of quadratic bounds around your function that tells you exactly how well-behaved it is. If the sandwich is tight, life is good. If one slice of bread is missing, things get ugly fast.

In this post we’ll build up both concepts from scratch, see how they combine into the quadratic sandwich, understand what happens at the level of the Hessian’s eigenvalues, and pick up a neat trick to verify L-smoothness without ever computing an eigenvalue.

Strong convexity — the function can’t be too flat

A differentiable function \(f:\mathbb{R}^n\to\mathbb{R}\) is \(\mu\)-strongly convex (with \(\mu > 0\)) if for all \(x, y\)

\[f(y) \geq f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2} \|y - x\|^2\]

If this looks familiar, it’s because the first two terms on the right are the first-order Taylor expansion of \(f\) at \(x\). For a plain convex function, the Taylor expansion is already a global underestimator (that’s the subgradient inequality). But strong convexity asks for more: the function must stay above the tangent plus a quadratic gap. The parameter \(\mu\) controls how aggressive this gap is — the bigger \(\mu\), the more the function curves upward and away from its linear approximation.

The intuition is that a strongly convex function has a guaranteed minimum curvature of \(\mu\) in every direction. It can’t flatten out, it can’t plateau, it can’t have a degenerate valley where one direction is basically flat. There is always a force pulling you toward the minimum, and that force grows linearly with the distance from the minimizer.

L-smoothness — the function can’t be too steep

A differentiable function \(f\) is \(L\)-smooth if its gradient is Lipschitz continuous:

\[\|\nabla f(x) - \nabla f(y)\| \leq L \|x - y\| \quad \forall \; x, y\]

Read this carefully: the change in the gradient between any two points is always dominated by a rescaled version of the change in the input. No matter how far apart \(x\) and \(y\) are, the gradient difference \(\|\nabla f(x) - \nabla f(y)\|\) can never outpace \(L\) times the input difference \(\|x - y\|\). The constant \(L\) acts as a leash on the gradient: it can move, but it can’t jerk. No abrupt turns, no sudden spikes in curvature.

Now here’s the equivalent characterization that will matter for the sandwich, sometimes called the descent lemma : if \(f\) is convex and \(L\)-smooth, then for all \(x, y\)

\[f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2}\|y - x\|^2\]

This is not obvious from the Lipschitz condition alone — it requires a short derivation that we work out in the Appendix. The key idea is to integrate the gradient along the segment from \(x\) to \(y\) and use Cauchy-Schwarz together with the Lipschitz bound to control the error.

Look at the structure: same shape as the strong convexity condition, but with the inequality flipped and \(\mu\) replaced by \(L\). The function now stays below its tangent plus a quadratic term. In other words, the function can bend, but not more than a quadratic with curvature \(L\).

The parameter \(L\) caps the maximum curvature in any direction. If strong convexity sets a floor on curvature, L-smoothness sets a ceiling.

The quadratic sandwich

Now we put both slices of bread together. If \(f\) is \(\mu\)-strongly convex and \(L\)-smooth, then both inequalities hold simultaneously and for all \(x, y\) we get

\[\begin{cases}<br>f(x) + \langle \nabla f(x), y - x \rangle + \frac{\mu}{2}\|y - x\|^2 \leq f(y) \\[6pt]<br>f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2}\|y - x\|^2<br>\end{cases}\]

The function is trapped between two parabolas centered at any point \(x\): a tighter one from below (curvature \(\mu\)) and a wider one from above (curvature \(L\)). This is the quadratic sandwich.

The proof of the sandwich is worked out in the Appendix.

The condition number

The ratio

\[\kappa = \frac{L}{\mu}\]

is called the condition number and it measures how thick the sandwich is. By construction, since \(L \geq \mu\) (the maximum curvature can’t be smaller than the minimum curvature), the condition number is always bounded below by 1: \(\kappa \geq 1\). A small \(\kappa\) (close to 1) means the function is “almost quadratic” — both bounds are tight, and the function is easy to optimize. A large \(\kappa\) means the curvature varies wildly across directions, and this is where gradient descent starts to suffer. Think about it in two scenarios:

Some directions lack curvature (\(\mu\) is tiny): a small \(\mu\) doesn’t necessarily mean the function is flat — you could still have a nonzero slope. But it means you have no “acceleration” to exploit: the...

The Quadratic Sandwich

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast