A Quick Primer on DuckDB, S3, and Plotly Studio

A Quick Primer on DuckDB, S3, and Plotly Studio · Chris Parmer

This stack the easiest and fastest way to analyze and visualize large datasets.

I'm going to keep this post short, because the main point I want to share is that it's easy, capable, and you should try it out.

This post walks explores the NOAA weather station network data that is hosted on their public S3 account in parquet files. Each year is about 10 million rows of data and they host data back to 1750 (!) - so 100-200+ GB an. What's great about this stack is that you don't need to download this data locally nor set up any data warehouse to query it.

The data lives on elsewhere and the DuckDB process that runs in Plotly Studio machine does a bunch of clever things (partitioning, predicate pushdown, streaming) to capture the results efficiently without fetching the entire dataset. And there's no server or database in between your computer and S3, so the architecture is simple and simple is good.

But you don't need to know any of that. The point is that it just works. And it works on cheap hardware, too.

This analysis was run on my travel M1 MacBook Air with 16GB of RAM, most of which is used by other applications.

One last thing - you'll notice in these examples that the data isn't in just a single file on S3: it's split apart into a bunch of files in folders like `/YEAR={year}/ELEMENT={element}/*.parquet` (so the the folder name is literally `year=2025/`). This is called "Hive partitioning" and if you are the one organizing your datasets, you should organize it this way. It's a convention that DuckDB officially supports when it makes queries (so it can join and aggregate data across files). The data that is most likely to be queried together should be in a single file as that will be most efficient. And if your data is changing, you can "partition" (i.e. put it in these folders) in a way so that adding new data just means dropping in a new file in a new folder, rather than editing an existing file.

Now, some graphs.

Map of 130,000 weather stations in NOAA's dataset

Zooming in on USA

One of my favorite uses of radial charts - showing temperature bands for two cities throughout the year.

To explore on your own, open up Plotly Studio and enter this prompt:

Query s3://noaa-ghcn-pds/parquet/by_year/YEAR={year}/ELEMENT={element}/*.parquet and s3://noaa-ghcn-pds/ghcnd-stations.txt with DuckDB. Then, make a map of all weather stations with hover data about the station.

Ready, set

go!

OK, now turn it into a Dash app. I want the map on the left (full viewport height) and then a stack of time series charts on the right with appropriate rolling aggregations for each metric on the left of the app (also viewport). Clicking on stations on the left will update a dropdown of stations selected (multi dropdown) which will then update the charts on the right.

Dash app showing map on the left and time series on the right. Click on a datapoint and the app will (use DuckDB) to fetch the weather metrics for that station (from S3)

Every step in Plotly Studio auto-generates the code to query and visualize the data. It's not a black box, you can toggle the view and see the exact queries:

Toggle the preview into the code view and see exactly what each step did.

and the analytics approach derived from that code is presented in each step's Methodology

The (auto-generated) Methodology explains the analytics behind each step.

It's fluid and creative.

Let's make a radial charts that go around an entire year with some percentile bands comparing yearly temp in two different cities. To start, we'll do SF & NYC for last year (median, min/max per week - going around a circle representing a year with nice labels for each month of the year.

It's never been easier to visualize data exactly how you want to.

Or as I like to call it, Dusk mode.

And best of all, sharing any of these apps, charts, or tables is simple because it's Plotly after all and you can just click "Publish" on any step - graphs, tables, apps - et voilà, a link you can share with anyone. It's also simple because the data is on S3 so you don't need to download or reupload files and you hit any kind of "maximum file size allowed" quotas. The networking is simple too, because there's nothing in between your code and the data. So it's simple x 3.

It doesn't get much simpler than this.

A Quick Primer on DuckDB, S3, and Plotly Studio

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

Apple WWDC 2026 Livestream

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

It's Not Just X. It's Y