Dynamic Filters: 25x Faster Queries by Passing Info Between Operators

Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries - Apache DataFusion Blog

Dynamic Filters: Passing Information Between Operators During Execution for 25x Faster Queries

Posted on: Wed 10 September 2025 by Adrian Garcia Badaracco (Pydantic), Andrew Lamb (InfluxData)

Contents Motivation and Results

Background: TopK and Dynamic Filters

Worked Example

TopK + Dynamic Filters

Hash Join + Dynamic Filters

Dynamic Filter Extensibility: Custom ExecutionPlan Operators Design of Scan Operator Integration

Future Work

Acknowledgements

About the Authors

About DataFusion

Footnotes

Appendix Queries and Data Figure 1: ClickBench Q23

This blog post introduces the query engine optimization techniques called TopK and dynamic filters. We describe the motivating use case, how these optimizations work, and how we implemented them with the Apache DataFusion community to improve performance by an order of magnitude for some query patterns.

Motivation and Results¶

The main commercial product at Pydantic, Logfire, is an observability platform built on DataFusion. One of the most common workflows / queries is "show me the last K traces" which translates to a query similar to:

SELECT * FROM records ORDER BY start_timestamp DESC LIMIT 1000;

We noticed this was pretty slow, even though DataFusion has long had the classic TopK optimization (described below). After implementing the dynamic filter techniques described in this blog, we saw performance improve by over 10x for this query pattern, and are applying the optimization to other queries and operators as well.

Let's look at some preliminary numbers, using ClickBench, which has the same pattern as our motivating example:

SELECT * FROM hits WHERE "URL" LIKE '%google%' ORDER BY "EventTime" LIMIT 10;

Figure 1 : Execution times for ClickBench Q23 with and without dynamic filters (DF)1, and late materialization (LM)2 for different partitions / core usage. Dynamic filters alone (yellow) and late materialization alone (red) show a large improvement over the baseline (blue). When both optimizations are enabled (green) performance improves by up to 22x. See the appendix for more measurement details.

Background: TopK and Dynamic Filters¶

To explain how dynamic filters improve query performance, we first need to explain the so-called "TopK" optimization. To do so, we will use a simplified version of ClickBench Q23:

SELECT * FROM hits ORDER BY "EventTime" LIMIT 10

A straightforward, though slow, plan to answer this query is shown in Figure 2.

Figure 2 : Simple Query Plan for ClickBench Q23. Data flows in plans from the scan at the bottom to the limit at the top. This plan reads all 100M rows of the hits table, sorts them by EventTime, and then discards everything except the top 10 rows.

This naive plan requires substantial effort as all columns from all rows are decoded and sorted, even though only 10 are returned.

High-performance query engines typically avoid the expensive full sort with a specialized operator that tracks the current top rows using a heap, rather than sorting all the data. For example, this operator is called TopK in DataFusion, SortWithLimit in Snowflake, and topn in DuckDB. The plan for Q23 using this specialized operator is shown in Figure 3.

Figure 3 : Query plan for Q23 in DataFusion using the TopK operator. This plan still reads all 100M rows of the hits table, but instead of first sorting them all by EventTime, the TopK operator keeps track of the current top 10 rows using a min/max heap. Credit to Visualgo for the heap icon

Figure 3 is better, but it still reads and decodes all 100M rows of the hits table, which is often unnecessary once we have found the top 10 rows. For example, while running the query, if the current top 10 rows all have EventTime in 2025, then any subsequent rows with EventTime in 2024 or earlier can be skipped entirely without reading or decoding them. This technique is especially effective at skipping entire files or row groups if the top 10 values are in the first few files read, which is very common when the data insert order is approximately the same as the timestamp order.

Leveraging this insight is the key idea behind dynamic filters, which introduce a runtime mechanism for the TopK operator to provide the current top values to the scan operator, allowing it to skip unnecessary rows, entire files, or portions of files. The plan for Q23 with dynamic filters is shown in Figure 4.

Figure 4 : Query plan for Q23 in DataFusion with specialized TopK operator and dynamic filters. The TopK operator provides the minimum EventTime of the current top 10 rows to the scan operator, allowing it to skip rows with EventTime later than that value. The scan operator uses this dynamic filter to skip unnecessary files and rows, reducing the amount of data that needs to be read and processed.

Worked Example¶

To make dynamic filters more concrete, here is a fully worked example....

Dynamic Filters: 25x Faster Queries by Passing Info Between Operators

Related Articles

Elevated error rates on requests to multiple models

Donald Trump and sons to be 'forever' exempt from tax audits

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self- Play

Old Reddit Is Down

The ultimate female fantasy – A feminist critique of Beauty and the Beast