The Null in Your Not IN

radimm1 pts0 comments

The NULL in your NOT IN | boringSQL

Table of Contents

Sample schema

The query that returns nothing

IN is an OR, NOT IN is an AND

Three-valued logic does the rest

NULLs on the left side too

The seam, in the source

A grammar asymmetry: list vs. subquery

Why the planner won't save you

The fix is landing in PostgreSQL 19

Decision matrix

What to do instead

A NOT IN query can return the wrong answer without telling you. It is valid SQL, it runs without an error, and it hands back a perfectly well-formed result set that happens to be empty when it should not be. No warning, no hint, nothing in the logs: just zero rows where you expected hundreds, and a database that considers it correct.

Almost always the cause is a single NULL sitting somewhere you forgot to look, combined with two keywords you have typed a thousand times: NOT IN. None of it is a Postgres bug. This is exactly what the SQL standard mandates, implemented faithfully. That is precisely what makes it so easy to walk into, and why the planner could not safely optimize around it for about twenty-five years. It comes down to one if statement in the parser.

Sample schema

Nothing elaborate. A table of products, one of which has no category assigned yet, and a table of archived categories that happens to contain a NULL:

CREATE TABLE products (id int, category_id int);<br>INSERT INTO products VALUES (1, 10), (2, 20), (3, NULL), (4, 10);

CREATE TABLE archived (category_id int);<br>INSERT INTO archived VALUES (20), (NULL);<br>The NULL in archived is not contrived. The moment a column is nullable (and most are, by default), a NULL can find its way into any subquery you point a NOT IN at. That is the whole point: this is not an exotic data condition, it is the ordinary one.

The query that returns nothing

Here is the request you have written a hundred times: give me the products whose category is not archived.

SELECT id, category_id FROM products<br>WHERE category_id NOT IN (SELECT category_id FROM archived);<br>You expect products 1 and 4 (category 10, which is not in the archived set). What comes back is:

id | category_id<br>----+-------------<br>(0 rows)<br>Every row gone. Not a subset, not an off-by-one: all of them. Drop the NULL from archived and the same query behaves:

SELECT id, category_id FROM products<br>WHERE category_id NOT IN (SELECT category_id FROM archived<br>WHERE category_id IS NOT NULL); id | category_id<br>----+-------------<br>1 | 10<br>4 | 10<br>(2 rows)<br>To understand why a single NULL empties the entire result, we have to stop thinking of NOT IN as a single thing and watch the parser take it apart.

IN is an OR, NOT IN is an AND

IN is not a primitive operator. It is shorthand that the parser rewrites into a chain of equality comparisons joined by OR:

x IN (a, b, c)<br>-- becomes<br>x = a OR x = b OR x = c<br>NOT IN is the logical negation of that, and by De Morgan's law negating an OR of equalities gives you an AND of inequalities:

x NOT IN (a, b, c)<br>-- becomes<br>x <> a AND x <> b AND x <> c<br>This is not an analogy. It is literally the expression Postgres builds, and you can read it straight off an EXPLAIN. The literal-list forms collapse into array operators whose names give the whole game away:

EXPLAIN (COSTS OFF) SELECT * FROM products WHERE category_id IN (1, 2, 3);<br>-- Filter: (category_id = ANY ('{1,2,3}'::integer[]))

EXPLAIN (COSTS OFF) SELECT * FROM products WHERE category_id NOT IN (1, 2, 3);<br>-- Filter: (category_id <> ALL ('{1,2,3}'::integer[]))<br>IN is = ANY: equal to any element, an OR. NOT IN is <> ALL: different from all elements, an AND.

The actual node types matter here, because they are what you end up staring at when you dump a parse tree or read a normalized pg_stat_statements entry. A literal list compiles to a single ScalarArrayOpExpr: the scalar on the left, the array on the right, and a useOr flag that is the entire difference between = ANY and <> ALL. The subquery forms are a different node altogether, a SubLink. Recognising those two names on sight tells you immediately which path the planner is on.

If "IN and = ANY are the same operator" is news: they compile to the same parse node and the same plan, with the spellings diverging only in plan-cache churn and selectivity estimates. The NOT IN case in front of you here is the one corner where the choice is not cosmetic but a matter of correctness.

Three-valued logic does the rest

SQL does not have two truth values, it has three: true , false , and unknown . Any comparison against NULL yields unknown, because NULL means "no value here" and you cannot ask whether an absent value is different from 20:

-- not false. unknown (displayed as a blank)<br>SELECT 10 <> NULL;<br>Now walk the NOT IN expansion for product 1 (category 10) against the archived set of 20 and NULL:

20 AND 10 <> NULL, which evaluate to true and unknown; the NULL poisons its branch, and AND carries the unknown to the root, so the row is dropped"><br>20 AND 10 <> NULL, which evaluate to true and unknown; the NULL poisons its branch,...

null category_id archived products from select

Related Articles