The Modern Data Stack Is Dead. Who Killed It?

MaxLenormand1 pts0 comments

The Modern Data Stack Is Dead. Who Killed It? | Fused

Skip to main content<br>TL;DR<br>The modern data stack (ETL + warehouses + orchestration tools) is breaking down.<br>AI can now write transformations and queries.<br>The new bottleneck is running that code safely, at scale.<br>Fused is a serverless execution layer for data and AI workflows.

What is Fused?<br>Fused is a serverless execution layer for data and AI workflows. With Fused, teams can write simple Python functions and run them at scale—directly on their data—without having to manage any infrastructure.

We have a murder on our hands.

The victim? The modern data stack.<br>For years, data teams have relied on a patchwork of tools: ETL pipelines, warehouses, orchestration layers, and dashboards. Getting anything done often meant stitching together 5-6 different systems just to move and transform data.<br>Something changed.<br>The modern data stack wasn’t killed by one company or project. It was replaced by a shift across the entire ecosystem.<br>We've rounded up the suspects as to who may have been at the scene of the crime, and we're going to round them up one after the other.

Suspect A: AI Code Editors​

The most obvious suspect at this crime scene is AI.

Claude Code, OpenAI Codex, and Cursor are everywhere now and have radically transformed how data teams write and ship code. While AI hasn't fully replaced software engineers (yet), it has changed how we write code.

As an example, our Developer Advocate, Max, never took the time to learn SQL but has seen a huge unlock in using AI models to write more and more complex queries — without needing to memorize every piece of SQL syntax.

Our own Fused docs have shifted from being written purely for human developers to being written for human and AI developers, with things like simple llms.txt files that let AI models easily parse documentation and make sense of it.

It would however be all too simple to pin this crime on the AI scapegoat. The inspection is probably worth continuing a little further.

Suspect B: S3 + Parquet​

Our next suspect is actually two working in tandem.

File formats like Parquet allow large, partitioned files that store massive amounts of data directly on cloud storage — no complex database systems required. Parquet files are fast to read and write, and they're portable: openable anywhere, downloadable to disk, and compatible with everything from pandas in Python to DuckDB natively. They really shine when paired with cloud storage, since most modern libraries can read and write to the cloud directly:

import pandas as pd

# Read from cloud<br>df = pd.read_parquet("s3://bucket/file.parquet")

# Transform<br>df = ...

# Save<br>df.to_parquet("s3://bucket/processed_file.parquet")

Data management doesn't simply go away — it still requires clever sorting and partitioning so files are efficient to find and read. But all of that is now possible directly in Parquet files stored on a cloud bucket.

Example of the USDA's Crop Data Layer dataset converted to H3 hexagons for partitioning, stored in Parquet files directly on S3. Each file covers a specific area of interest and each Parquet file is partitioned to access only a specific row group of data.

This leads us naturally to the next suspect: query engines that make it trivially easy to read from a whole directory of Parquet files.

Suspect C: DuckDB​

DuckDB is a top-tier suspect — close to the victim, practically family. A relative of the modern data stack that may have committed the crime on its own kin.

DuckDB is an open-source, in-process analytical database management system. Beyond the technical jargon: it lets you write SQL without needing a dedicated server to manage, and it runs incredibly fast on columnar data. It also queries a directory of Parquet files hosted on an S3 bucket directly, treating them like a database.

-- Searching for a specific ID across 50GB of parquet files on S3 is a single query now<br>SELECT *<br>FROM read_parquet('s3://overturemaps-us-west-2/registry/*.parquet')<br>WHERE id = 'fea28f69-7afa-460c-b270-61ef74cd340c';

This lets you take a simple query, run it from a laptop, and scale it to huge amounts of data on S3. Engines like DuckDB are optimized for analytics and benefit from Parquet's partitioning, which can skip entire files and row groups.

Large queries that would blow up in a regular pandas run are now possible in-memory with minimal setup:

SELECT<br>user_id,<br>count(*) AS event_count,<br>approx_quantile(duration_ms, 0.95) AS p95<br>FROM read_parquet('s3://bucket/events/*.parquet')<br>GROUP BY user_id;

DuckDB is our main example here, but the pattern is bigger. Other engines like Polars reflect the same shift: open formats on object storage plus a query runtime that can move to the data. At the time of writing, the two libraries are neck and neck and DuckDB is gaining steam over Polars.

DuckDB makes even more sense in a world where LLMs have gotten incredible at writing even complex SQL queries, lowering the bar to entry dramatically. That...

data parquet files write from duckdb

Related Articles