Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries - Apache DataFusion Blog
Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries
Posted on: Wed 10 September 2025 by Adrian Garcia Badaracco (Pydantic), Andrew Lamb (InfluxData)
Contents<br>Motivation and Results
Background: TopK and Dynamic Filters
Worked Example
TopK + Dynamic Filters
Hash Join + Dynamic Filters
Dynamic Filter Extensibility: Custom ExecutionPlan Operators<br>Design of Scan Operator Integration
Future Work
Acknowledgements
About the Authors
About DataFusion
Footnotes
Appendix<br>Queries and Data<br>Figure 1: ClickBench Q23
This blog post introduces the query engine optimization techniques called TopK<br>and dynamic filters. We describe the motivating use case, how these<br>optimizations work, and how we implemented them with the Apache DataFusion<br>community to improve performance by an order of magnitude for some query<br>patterns.
Motivation and Results¶
The main commercial product at Pydantic, Logfire, is an observability<br>platform built on DataFusion. One of the most common workflows / queries is<br>"show me the last K traces" which translates to a query similar to:
SELECT * FROM records ORDER BY start_timestamp DESC LIMIT 1000;
We noticed this was pretty slow, even though DataFusion has long had the<br>classic TopK optimization (described below). After implementing the dynamic<br>filter techniques described in this blog, we saw performance improve by over 10x<br>for this query pattern, and are applying the optimization to other queries and<br>operators as well.
Let's look at some preliminary numbers, using ClickBench, which has<br>the same pattern as our motivating example:
SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;
Figure 1 : Execution times for ClickBench Q23 with and without dynamic<br>filters (DF)1, and late materialization<br>(LM)2 for different partitions / core usage.<br>Dynamic filters alone (yellow) and late materialization alone (red) show a large<br>improvement over the baseline (blue). When both optimizations are enabled (green)<br>performance improves by up to 22x. See the appendix for more measurement details.
Background: TopK and Dynamic Filters¶
To explain how dynamic filters improve query performance, we first need to<br>explain the so-called "TopK" optimization. To do so, we will use a simplified<br>version of ClickBench Q23:
SELECT *<br>FROM hits<br>ORDER BY "EventTime"<br>LIMIT 10
A straightforward, though slow, plan to answer this query is shown in Figure 2.
Figure 2 : Simple Query Plan for ClickBench Q23. Data flows in plans from the<br>scan at the bottom to the limit at the top. This plan reads all 100M rows of the<br>hits table, sorts them by EventTime, and then discards everything except the top 10 rows.
This naive plan requires substantial effort as all columns from all rows are<br>decoded and sorted, even though only 10 are returned.
High-performance query engines typically avoid the expensive full sort with a<br>specialized operator that tracks the current top rows using a heap, rather<br>than sorting all the data. For example, this operator<br>is called TopK in DataFusion, SortWithLimit in Snowflake, and topn in<br>DuckDB. The plan for Q23 using this specialized operator is shown in Figure 3.
Figure 3 : Query plan for Q23 in DataFusion using the TopK operator. This<br>plan still reads all 100M rows of the hits table, but instead of first sorting<br>them all by EventTime, the TopK operator keeps track of the current top 10<br>rows using a min/max heap. Credit to Visualgo for the<br>heap icon
Figure 3 is better, but it still reads and decodes all 100M rows of the hits table,<br>which is often unnecessary once we have found the top 10 rows. For example,<br>while running the query, if the current top 10 rows all have EventTime in<br>2025, then any subsequent rows with EventTime in 2024 or earlier can be<br>skipped entirely without reading or decoding them. This technique is especially<br>effective at skipping entire files or row groups if the top 10 values are in the<br>first few files read, which is very common when the<br>data insert order is approximately the same as the timestamp order.
Leveraging this insight is the key idea behind dynamic filters, which introduce<br>a runtime mechanism for the TopK operator to provide the current top values to<br>the scan operator, allowing it to skip unnecessary rows, entire files, or portions<br>of files. The plan for Q23 with dynamic filters is shown in Figure 4.
Figure 4 : Query plan for Q23 in DataFusion with specialized TopK operator<br>and dynamic filters. The TopK operator provides the minimum EventTime of the<br>current top 10 rows to the scan operator, allowing it to skip rows with<br>EventTime later than that value. The scan operator uses this dynamic filter<br>to skip unnecessary files and rows, reducing the amount of data that needs to<br>be read and processed.
Worked Example¶
To make dynamic filters more concrete, here is a fully worked example....