"what if you don't have the dataset?"

cjlm2 pts0 comments

“What if you don't have the dataset?” · Chris Parmer

My partner's family is from San Diego and so we frequently<br>drive up and down the coast from San Francisco. It takes about 8-10 hours<br>but we've grown to favor splitting up the drive in half on one side of trip<br>and staying overnight in the little towns stuck in time along the 1<br>like Pismo Beach and Cayucos.

During one of these drives last year, I was in the middle of an<br>intensely creative and industrious period building out the next version of<br>Plotly Studio - an LLM-powered analytics<br>and visualization app - and our roadtrip turned into a<br>rubber duck discussion<br>about its underpinnings and possibilities.

Talking out loud - especially to those who aren't so close to the problem -<br>almost always brings up new ideas. In this case, it was my partner<br>who brought up an idea - and stubbornly held on to it - that fooled me and later<br>surprised me.

What if you don't have a dataset?<br>Why can't the app just find the data for you?

The first step in doing any data analytics or visualization is, of course,<br>uploading or connecting to your dataset.

The idea that we could just skip this step seemed ridiculous to me at that point.<br>The web is mostly full of unstructured data<br>(documents, text) and in my experience the big open data<br>providers like Kaggle<br>are awash with fabricated datasets (you can tell because all of the data is<br>uniformly distributed).

Reliable websites like Wikipedia don't have much in the way<br>of structured datasets and scientific journals are often paywalled or don't include<br>data outside of small tables embedded in PDFs.

So I shrugged off the idea.

But then, weeks later, I started just asking open ended questions while QA'ing my prototypes<br>without providing any dataset of my own.

And I found that Plotly Studio - through its LLM provider - had a curious<br>and specific knowledge of primary source data sources on the web.

Data sources with obscure URLs serving file formats of yesteryear.<br>A data source to seemingly help you any question that you might have<br>about the world.

Over the last couple of months, here are some of my favorite examples of data<br>that LLMs surfaced for me, and that have really come alive for me in my own<br>personal life.

Water Temperatures

I do a fair amount of open water swimming off the coast of San Francisco<br>and was surprised to find plentiful water temperature data courtesy of<br>NOAA's buoys.

This data is available through these (previously undiscoverable, at least to me!)<br>URLs that serve opaque data structures.<br>Like all of the examples here, these URLs were not found through web search -<br>they were just in the LLM's world knowledge. Yes, that's right - the LLMs<br>just know about these URLs that look like this:<br>https://www.ndbc.noaa.gov/data/realtime2/{station_id}.txt and know<br>that the station I'm interested in is probably 46026.

A graph of ocean temp (the line) below the daily air temperature bands<br>that I made in Plotly Studio. Plotly Studio<br>fetched the air temp data from Open-Meteo API and the water temp data<br>from NOAA buoy's off the coast of SF.<br>Water was 49 this week from a storm that caused an upswell from the cold, deep ocean water.<br>This coincided with a week of warm air temperatures showcasing one of the widest air-water<br>temperature differences of the year!

The code that Plotly Studio generated via an LLM to fetch the data and make the graph.<br>That URL ("https://www.ndbc.noaa.gov/data/realtime2/{station_id}.txt") and<br>station number (46026) was not provided by me nor discovered in web search - it was<br>remarkably just part of the LLM's world knowledge. As was the knowledge about the data structure<br>and how to parse it.

311 Civic Data

311 data - the city complaint hotline - is a treasure-trove of data and<br>is remarkably accessible and well known by LLMs.

One of my favorite queries is to look up recent graffiti complaints in the city<br>as a little underground art tour (one citizen's graffiti complaint is another citizen's masterpiece!).

Prompt: "connect to SF 311 data and show me new graffiti complaints over the last week on a map".<br>This data was courtesy of Socrata's API endpoint:<br>https://data.sfgov.org/resource/vw6y-z8j6.json<br>Another remarkable example of the obscure world knowledge in LLMs -<br>`vw6y-z8j6.json` is not a common URL pathname!

311 data is available in most major cities. In preparing for a talk I gave in Boston, I plotted<br>the trajectory of the snow storm of the season by tracking 311 complaints about snow.

Capturing the eye of the storm rolling in through Boston at 2:30AM by visualizing<br>cumulative snow-related 311 complaints in Boston. This was Plotly Studio in its early Beta UI -<br>oh how much cleaner it looks today!

At a recent SF meetup, we wondered how likely our cars parked on Valencia St<br>would be to get a parking ticket or not:

Map of parking tickets in San Francisco.<br>Made with the prompt: "connect to SF open data and show me data about parking tickets on Valencia street - how likely, when, and a map"

When are...

data water from plotly studio dataset

Related Articles