When does learning from data work (math starting from basic probability)

VC Dimension and the Fundamental Theorem of Statistical Learning — from Scratch | Prateek Chandra Jha

← Back to home

VC Dimension and the Fundamental Theorem of Statistical Learning — from Scratch

May 2026

This post answers a single question: when does learning from data actually work?

You train a model on samples, it performs well on those samples, and you hope it performs well on new data. When is that hope justified? The answer turns out to be a clean equivalence: a hypothesis class is learnable if and only if it has finite VC dimension. This is the Fundamental Theorem of Statistical Learning.

We'll build every piece from scratch: Markov's inequality, Hoeffding's lemma, concentration inequalities, the union bound, VC dimension, the Sauer–Shelah lemma, symmetrization, and the generalization bounds that tie it all together. Nothing is assumed beyond basic probability (expectations, independence, indicator random variables). We prove both directions of the equivalence: the upper bound (finite VC dimension implies learnability) and the lower bound (via information theory: total variation, KL divergence, Pinsker's inequality, and Le Cam's method).

This is Part 1 of two. Part 2 continues the story with Rademacher complexity — a probabilistic refinement of VC dimension that gives tighter, data-dependent bounds — and shows how all the pieces connect in one beautiful chain.

1. The Setup: What Does “Learning” Mean?

We work in the simplest interesting setting: binary classification.

Definition (The Learning Problem)

Instance space $\mathcal{X}$: the set of all possible inputs (images, documents, feature vectors, etc.).

Label space $\mathcal{Y} = \{0, 1\}$: binary labels.

Distribution $\mathcal{D}$: an unknown probability distribution over $\mathcal{X} \times \mathcal{Y}$. Nature draws $(x, y) \sim \mathcal{D}$.

Hypothesis class $\mathcal{H} \subseteq \{0,1\}^{\mathcal{X}}$: a fixed set of classifiers $h : \mathcal{X} \to \{0,1\}$ that the learner is allowed to pick from.

Training sample $S = ((x_1, y_1), \ldots, (x_m, y_m))$: drawn i.i.d. from $\mathcal{D}$.

Definition (True Risk and Empirical Risk) The true risk (or population loss) of a hypothesis $h$ is $$ L_{\mathcal{D}}(h) = \Pr_{(x,y)\sim\mathcal{D}}[h(x) \neq y]. $$ The empirical risk (or training error) on a sample $S$ of size $m$ is $$ L_S(h) = \frac{1}{m}\sum_{i=1}^m \mathbf{1}[h(x_i) \neq y_i]. $$

Example (Spam Classification) Suppose $\mathcal{X}$ = all possible emails and $\mathcal{Y} = \{0, 1\}$ where 1 = spam. Nature has some distribution $\mathcal{D}$ over (email, label) pairs — this captures both which emails show up and how they're labeled. We don't know $\mathcal{D}$; we only see a training set of $m$ labeled emails.

The true risk $L_{\mathcal{D}}(h)$ is the probability that classifier $h$ misclassifies a random future email. We never know this exactly — it involves all emails that could ever arrive.

The empirical risk $L_S(h)$ is the fraction of training emails that $h$ gets wrong. This we can compute.

The fundamental question: when does small $L_S(h)$ guarantee small $L_{\mathcal{D}}(h)$?

The learner sees only $S$ and wants to find $h \in \mathcal{H}$ with small $L_{\mathcal{D}}(h)$. The most natural strategy is Empirical Risk Minimization (ERM) : pick any $h \in \mathcal{H}$ that minimizes $L_S(h)$. In plain English: choose the hypothesis that makes the fewest mistakes on the training data.

But ERM has an obvious failure mode: overfitting. If $\mathcal{H}$ is expressive enough, some $h \in \mathcal{H}$ will have $L_S(h) = 0$ purely by memorizing the training data, while having terrible true risk. So we need to understand when ERM is trustworthy.

This leads to two questions that the rest of the post will answer:

When is learning possible at all? Under what conditions on $\mathcal{H}$ can any algorithm generalize?

When does ERM specifically work? When does the training error track the true error simultaneously for every hypothesis?

These questions are formalized by two definitions: PAC learnability (question 1) and uniform convergence (question 2). Let's build up to each one.

What Should “Learnable” Mean?

We want a definition that captures: "given enough data, the learner will probably do well." There are three things to get right:

How well? The learner's error should be within $\epsilon$ of the best possible error in $\mathcal{H}$. We use $\epsilon$ ("accuracy parameter") because we can't demand perfection — with finite data, there's always some approximation error.

How probably? Since $S$ is random, the learner might get unlucky (e.g., all training emails are spam). We allow a failure probability $\delta$ ("confidence parameter").

Against what? The guarantee must hold for every distribution $\mathcal{D}$. We don't know how nature generates data, so we can't assume anything about it.

Definition (PAC...

When does learning from data work (math starting from basic probability)

Related Articles

Amazon, Facebook, FBI have access to a private intelligence-sharing network

SpaceX not the behemoth everyone thought

The Mirror Is Part of the Machine

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits