A Quick Primer on DuckDB, S3, and Plotly Studio · Chris Parmer
This stack the easiest and fastest way to analyze and visualize large<br>datasets.
I'm going to keep this post short, because the main point I want to share is<br>that it's easy, capable, and you should try it out.
This post walks explores the NOAA weather station network data that is<br>hosted on their public S3 account in parquet files. Each year is about 10<br>million rows of data and they host data back to 1750 (!) - so 100-200+ GB<br>an. What's great about this stack is that you<br>don't need to download this data locally nor set up any<br>data warehouse to query it.
The data lives on elsewhere and the DuckDB process that runs in Plotly<br>Studio machine does a bunch of clever things (partitioning, predicate<br>pushdown, streaming) to capture the results efficiently without fetching the<br>entire dataset. And there's no server or database in between your computer<br>and S3, so the architecture is simple and simple is good.
But you don't need to know any of that. The point is that it just works. And<br>it works on cheap hardware, too.
This analysis was run on my travel M1 MacBook Air with 16GB of RAM, most<br>of which is used by other applications.
One last thing - you'll notice in these examples that the data isn't in<br>just a single file on S3: it's split apart into a bunch of files in<br>folders like `/YEAR={year}/ELEMENT={element}/*.parquet` (so the the<br>folder name is literally `year=2025/`). This is called "Hive<br>partitioning" and if you are the one organizing your datasets, you<br>should organize it this way. It's a convention that DuckDB officially<br>supports when it makes queries (so it can join and aggregate data across<br>files). The data that is most likely to be queried together should be in<br>a single file as that will be most efficient. And if your data is<br>changing, you can "partition" (i.e. put it in these folders) in a way so<br>that adding new data just means dropping in a new file in a new folder,<br>rather than editing an existing file.
Now, some graphs.
Map of 130,000 weather stations in NOAA's dataset
Zooming in on USA
One of my favorite uses of radial charts - showing temperature bands for<br>two cities throughout the year.
To explore on your own, open up<br>Plotly Studio and enter this prompt:
Query<br>s3://noaa-ghcn-pds/parquet/by_year/YEAR={year}/ELEMENT={element}/*.parquet<br>and s3://noaa-ghcn-pds/ghcnd-stations.txt with DuckDB. Then, make a map<br>of all weather stations with hover data about the station.
Ready, set
go!
OK, now turn it into a Dash app. I want the map on the left (full<br>viewport height) and then a stack of time series charts on the right<br>with appropriate rolling aggregations for each metric on the left of the<br>app (also viewport). Clicking on stations on the left will update a<br>dropdown of stations selected (multi dropdown) which will then update<br>the charts on the right.
Dash app showing map on the left and time series on the right. Click on<br>a datapoint and the app will (use DuckDB) to fetch the weather metrics<br>for that station (from S3)
Every step in Plotly Studio auto-generates the code to query and visualize<br>the data. It's not a black box, you can toggle the view and see the exact<br>queries:
Toggle the preview into the code view and see exactly what each step<br>did.
and the analytics approach derived from that code is presented in each<br>step's Methodology
The (auto-generated) Methodology explains the analytics behind each<br>step.
It's fluid and creative.
Let's make a radial charts that go around an entire year with some<br>percentile bands comparing yearly temp in two different cities. To<br>start, we'll do SF & NYC for last year (median, min/max per week -<br>going around a circle representing a year with nice labels for each<br>month of the year.
It's never been easier to visualize data exactly how you want to.
Or as I like to call it, Dusk mode.
And best of all, sharing any of these apps, charts, or tables is simple<br>because it's Plotly after all and<br>you can just click "Publish" on any step<br>- graphs, tables, apps - et voilà, a link you can share with anyone. It's<br>also simple because the data is on S3 so you don't need to download or<br>reupload files and you hit any kind of "maximum file size allowed" quotas.<br>The networking is simple too, because there's nothing in between your code<br>and the data. So it's simple x 3.
It doesn't get much simpler than this.