The effective sample size – Alex Smola
Years ago I wrote about correcting covariate shift by reweighting your data. Your features come from the wrong distribution \(q\), you care about a target \(p\), so you weight every observation by \(\beta_i = p(x_i)/q(x_i)\) and your estimates are unbiased again. I ended that post by admitting the weights “can be quite a bit off,” and waved at fixing it another day.
Here is the more basic question I skipped. Even when the weights are exactly right, what do they cost you? Reweighting buys you less bias. It charges you variance, and the currency of that charge has a name.
Why correct weights still hurt
The intuition is in the picture below. Data from \(N(0,1)\), target \(N(\mu,1)\), so the correct weight is \(w(x) = e^{\mu x - \mu^2/2}\). It is wildly uneven: the rare points sitting where \(p\) has its mass get enormous weight, and everything else gets almost none. At a shift of only \(\mu = 2\), the heaviest 1% of observations carry 37% of the total weight, and the effective fraction of usable data has already fallen below 2%.
When a few points dominate the estimate, the rest of the data is effectively ignored, and any error in that handful, an outlier, a mislabeled point, a slightly wrong weight, runs straight into your answer with nothing to average it out. You did not get \(n\) observations. You got a few, dressed up as many. So you would like to know: how many did you actually get?
Counting what you actually got
Normalize the weights so that \(\|\alpha\|_1 = \sum_i \alpha_i = 1\), and define
\[ n_{\mathrm{eff}} = \frac{1}{\|\alpha\|_2^2} = \frac{1}{\sum_i \alpha_i^2}. \]
Equal weights \(\alpha_i = 1/n\) give \(\sum_i \alpha_i^2 = 1/n\), hence \(n_{\mathrm{eff}} = n\): nothing wasted. All the weight on one point gives \(n_{\mathrm{eff}} = 1\): a sample of size one. Everything else lands in between. This is Kish’s effective sample size. The only thing left to explain is why that function is the right one. Here are two derivations, from opposite ends of the field, that both produce the same outcome.
Variance of a sum of Normal random variables
Let the \(x_i\) be iid \(N(0,1)\) and form the weighted average \(\bar x = \sum_i \alpha_i x_i\). By independence,
\[ \mathrm{Var}[\bar x] = \sum_i \alpha_i^2 \,\mathrm{Var}[x_i] = \|\alpha\|_2^2. \]
A plain average of \(m\) iid unit-variance variables has variance \(1/m\). Set \(1/m = \|\alpha\|_2^2\) and you get \(m = \|\alpha\|_2^{-2} = n_{\mathrm{eff}}\). The weighted average is exactly as noisy as an unweighted average over \(n_{\mathrm{eff}}\) fresh draws. That is the quickest route to the definition: one line of variance algebra.
Hoeffding’s inequality
Variance is an average-case statement. The same quantity controls the worst case. Let the \(x_i\) be iid in \([0,1]\) with mean \(\mathbb{E}[x_i]\), and take \(\bar x = \sum_i \alpha_i x_i\) again. Each term lives in an interval of width \(\alpha_i\), so Hoeffding’s inequality, the Chernoff bounding argument for bounded variables, gives
\[ \Pr\big(|\bar x - \mathbb{E}[\bar x]| \ge t\big) \le 2\exp\!\left(-\frac{2t^2}{\sum_i \alpha_i^2}\right) = 2\exp\!\left(-2\,n_{\mathrm{eff}}\,t^2\right). \]
The textbook bound for an equal-weight average is \(2\exp(-2 n t^2)\). The two are identical except that \(n\) has become \(n_{\mathrm{eff}}\). Whether you measure spread by a variance or by a tail probability, the concentration of a reweighted sum is set not by how many points you have but by how many you effectively have.
Replay buffer
The effective sample size is the knob you want in off-policy reinforcement learning. A replay buffer is data collected under earlier policies, but you want to improve the policy you are running now. The correction is the same one as covariate shift: weight each stored transition by the ratio \(\pi/b\) of the current policy \(\pi\) to the behaviour policy \(b\) that generated it. As the current policy pulls away from the buffer, those weights concentrate, and the effective sample size of the buffer, measured against the policy you actually care about, collapses along a curve like the one above.
In this case \(n_{\mathrm{eff}}\) is not a number you read off after the fact. It becomes a diagnostic control signal: how much real information the buffer still holds, when the data has gone too stale to reuse, and how large an update the current batch can support. Calibrating the algorithm to its own effective sample size is exactly what P3O does, and what we implement in FeynRL. That is the next post.
There are many more applications of the effective sample size. For instance, in Sequential Monte Carlo, aka the Particle Filter, this is used as a diagnostic to decide when it’s time to resample the current distribution to obtain a more evenly weighted set of particles. But that’s a story for another day.