The Hard Part of Alternative Data Isn’t Getting It. It’s Knowing What It Means.
Christopher
SubscribeSign in
The Hard Part of Alternative Data Isn’t Getting It. It’s Knowing What It Means.
Christopher<br>Jun 11, 2026
Share
This is the third post in a series on building Metis, a trading engine I’m developing to answer one question: can understanding how energy flows let us see what moves natural gas prices before the market does?<br>The first two posts covered a SIMD performance investigation; a silent compiler fallback and a Windows power management cap that together produced a 92% performance gap on identical hardware. This post is different; it’s about data, and specifically about a problem that almost nobody in quant circles writes about directly.<br>Thanks for reading! Subscribe for free to receive new posts and support my work.
Subscribe
Read any data thread on r/quant and you will hit one of two dead ends. Someone complains about a $50k vendor feed that arrived full of NaNs, or someone drops a "just use Databento" and walks away. Both assume the hard part is the plumbing. Get the data in cleanly and cheaply, and the rest follows.<br>I don’t think that’s true, data access is largely a solved problem. The hard problem is temporal semantics; what does this data actually measure, when does it measure it, and how does its measurement frequency relate to the phenomenon you’re trying to model? Get that wrong and clean data produces wrong answers very confidently.<br>Building Metis forced me to think about this explicitly, because the signals I’m tracking don’t share a clock.
My Sources Don’t Share a Clock
Metis ingests eleven data sources. Here’s what their temporal structures actually look like:<br>LMP electricity prices from CAISO arrive sub-hourly — real-time grid stress, priced by the minute. EIA natural gas storage updates weekly, every Thursday. The US Drought Monitor publishes once a week, always on Tuesday, measuring categorical severity across the country. AIS maritime vessel data streams continuously but gets aggregated to daily port snapshots. Congressional bills have multiple economically meaningful dates per event — introduction, committee action, floor vote, presidential signature — and the market continuously reprices the probability of enactment at every step, meaning a single 'policy date' timestamp is a fiction. FRED macroeconomic series vary by series: some monthly, some quarterly, some revised backward.<br>The standard playbook is to resample everything to a shared daily frequency, align the timestamps, and feed it into an ML pipeline. It produces beautiful, clean tensors. It also creates a massive point-in-time (PIT) correctness problem. If you are not strictly respecting the exact publication timestamp, especially with backward-revised data like FRED, your daily resample is quietly injecting look-ahead bias directly into your model. Clean data will confidently give you a backtest that is entirely fictional.<br>Furthermore, it encodes a hidden assumption: that a weekly drought reading on Tuesday contains the same information content as a Tuesday LMP reading. They are not comparable. Forcing them into a shared timestamp doesn’t fix the alignment problem, it just hides it.<br>That changes the question to ask from “how do I align these” to “what does each source actually measure, and at what granularity does that measurement become meaningful?”
Legislative Time Is Not Market Time
Congressional bills are where this gets most interesting, and most underappreciated.<br>The obvious approach is to timestamp a bill by its signing date and call it a signal. But a bill’s signing date is just one of several economically meaningful moments in its lifecycle, and often not the most important one. Introduction tells you something about legislative intent and political pressure. Committee action tells you something about viability. Floor passage tells you the policy is real and coming. Presidential signature makes it law. And then there’s a gap; sometimes months, sometimes years between law on paper and economic reality.<br>Trump’s tariff announcements are the clearest example of why this matters. The signing dates were the least interesting timestamps. The market started repricing supply chains before the ink dried, and the more consequential signal — countries remaking trade agreements without the US — played out over months after the legal fact. China and the EU didn’t wait for the next round of negotiations; they started building alternative relationships that will persist regardless of what US policy does next.<br>For LNG specifically this creates a structural question: when a US export policy changes, which timestamp do you use? The announcement, the legal enactment, the first cargo diversion, or the point where European terminals had locked in alternative supply contracts? These are not the same event. They’re four different signals with four different lags, and treating any one of them as the canonical “policy date”...