A cute little trick to running classic IIR filters on the GPU – Maister's Graphics Adventures
Skip to content
About This Site<br>This is a place where I post random topics I find interesting in low-level graphics and engine programming.
ARNTZEN SOFTWARE AS
My contracting business
Archives<br>Search
Search for:
In my previous behemoth post I mentioned the struggle of implementing a comb + notch filter that looked passable for analog composite video. The blurring effect of adding the notch with FIR filtering was far too great, and there’s no way TVs at the time did it that way.
Nagging issues like these haunt me until I solve it, and I started experimenting with IIR filtering since it’s one of those topics I never really explored on GPUs, due to it being very GPU unfriendly. A notch IIR biquad filter is extremely effective on paper, and there has to be a way to do it well in parallel.
A biquad IIR
This is the building block of a ton of audio filters out there, since they can easily model classic analog filters and are easy to tweak parametrically. In the Z plane they have two zeroes and two poles:
// Z is a complex number representing where frequency response hits 0.<br>// P is a complex number representing a pole<br>// where the frequency response hits infinity (or pure resonance).<br>// Frequency response at a given frequency is measured by setting<br>// z = exp(2 * j * w)
// I probably screwed this formula up somehow,<br>// don't read too much into it.<br>H(z) = (z - Z) * (z - conj(Z)) / ((z - P) * (z - conj(P));<br>Having conjugated poles and zeroes ensures the filter coefficients are real, so the actual filter looks more like this in code:
out[x] =<br>in[x] * b0 + in[x - 1] * b1 + in[x - 2] * b2 + // FIR portion<br>out[x - 1] * a1 + out[x - 2] * a2 // IIR portion<br>The Z plane is often a little abstract, but in this case, we will design the notch directly in the Z domain. For example, we place zeroes where we want to remove frequency response, e.g. at the chroma subcarrier, then place poles close to it at the same frequency. The intuition is that close to the carrier, the response will fall to 0, and further away the pole and zero mostly negate each other, leaving a unity frequency response. The response can be made as sharp as desired, but sharper notches lead to more ringing.
E.g. for a pole set at 0.9 times the zero. Quite sharp. Getting this kind of response out of a normal convolution requires a comical number of taps. The response at DC here isn’t exactly 0 dB, but the FIR portion can be rescaled to ensure unity gain.
struct<br>float b0, b1, b2, a1, a2; // a0 is implicitly 1.0.<br>} push = {};
// Compute a biquad IIR notch filter.<br>double w = 2.0 * muglm::pi() / 27.0;<br>constexpr double fir_q = 0.99;<br>constexpr double iir_q = 0.85;
// Don't do a complete notch since that leads to really bad ringing.<br>// We just need a fairly narrow band-stop though ...
if (options.system == System::NTSC)<br>w *= 315.0 / 88.0;<br>else<br>w *= 4.43361875;
// Preconvolves two zeroes at exp(i * w) and exp(-i * w).<br>double b0 = 1.0;<br>double b1 = -2.0 * fir_q * cos(w);<br>double b2 = fir_q * fir_q;
// Preconvolves two poles at exp(i * w * Q) and exp(-i * w * Q).<br>// Flip sign here since we're designing the filter as B(z) / A(z) and<br>// when evaluating the filter we flip the signs.<br>double a1 = 2.0 * iir_q * cos(w);<br>double a2 = -iir_q * iir_q;
// Normalize the FIR gain to obtain a 0 dB gain at DC.<br>double fir_gain = b0 + b1 + b2;<br>double target_fir_gain = 1.0 - a1 - a2;<br>double fir_scale = target_fir_gain / fir_gain;
b0 *= fir_scale;<br>b1 *= fir_scale;<br>b2 *= fir_scale;
push.b0 = float(b0);<br>push.b1 = float(b1);<br>push.b2 = float(b2);<br>push.a1 = float(a1);<br>push.a2 = float(a2);
cmd.push_constants(&push, 0, sizeof(push));<br>The trick – a spicy parallel scan
While IIRs have infinite impulse response, we’re only really interested in the output of the active signal area, which is ~1440 pixels. A 1440-tap convolution is fine if we want to use FFT convolution, but that’s quite overkill for what’s literally a 5-tap filter. Surely there are ways to do IIRs effectively without falling back to serial algorithms.
Many parallel GPU algorithms boil down to scan operations in some way, or have scans as a primary ingredient of their madness. I found an older paper from 1990 (Prefix Sums and Their Applications, Guy E. Blelloch, section 1.4) which goes through this algorithm in a way that was easy enough to digest. It’s not the state of the art I think, but it gets the job more than adequately for my needs. For the IIR filter, the FIR portion of it is handled trivially with a filter kernel. We only need to concern ourselves with the feedback portion of the filter.
Usually we think of scans as just simple binary operators.
for i in range(len(data)):<br>data[i] += data[i - 1]<br>This is common enough that we have special subgroup operations for them, like subgroupInclusiveAdd. This is not enough to implement IIRs since it cannot capture the "decay" of an impulse response. The two intuitions...