Vendor data feed is wrong

tolugenius1 pts0 comments

Why I Built My Own Data Pipeline Instead of Buying Clean Data

Christopher

SubscribeSign in

Why I Built My Own Data Pipeline Instead of Buying Clean Data

Christopher<br>Jun 12, 2026

Share

Fourth in a series on building Metis, a trading engine asking: can understanding how energy flows let us see what moves natural gas prices before the market does? Previous posts covered SIMD performance debugging and why my data sources don’t share a clock.

In my last post, I started off by discussing this point: every discussion of alternative data eventually arrives at the same conclusion, buy a vendor feed. Clean, documented, supported. Someone else has already solved the ingestion problem, the schema problem, the reliability problem. You pay for the abstraction and you get to think about signals instead of pipelines.<br>Thanks for reading! Subscribe for free to receive new posts and support my work.

Subscribe

The other path is to build it yourself, source by source, directly from the origin.<br>The standard argument for the first path is time. The standard argument for the second is cost. Both are true and both miss the point.<br>The real argument for building your own is this: when you buy a clean feed, you are also buying someone else’s decisions about what to record, how to record it, and what to leave out. Those decisions are invisible in the data. They’re not documented in the schema. You inherit them silently, and they become assumptions in every model downstream.<br>Building your own does not eliminate those decisions. But it forces you to make them consciously.

The Question Behind the Question

Before writing a single ingester, I had to question not just “how do I get this data” but “who recorded this, and why?”<br>That sounds philosophical. It has practical consequences.<br>Congressional bill data is where this is most visible. Congress.gov has a public API. It returns bill text, sponsor, co-sponsors, committee assignments, floor votes, and action dates. It’s well-documented and mostly reliable.<br>What it doesn’t return is anything that happened in a closed-door meeting. It doesn’t record why a senator who opposed a bill in committee suddenly voted for it on the floor. It doesn’t capture the conversation that happened before the vote. The API is a faithful record of the official process. The official process is not the same as the actual process.<br>This matters for signal construction. If you’re tracking Congressional activity as a leading indicator of energy policy change, you’re tracking the public record of a process that has a significant private component. The data tells you what was formally decided. It doesn’t tell you when the decision was actually made, or why it moved the way it did.<br>This isn’t a criticism of the data — it’s an accurate description of what it measures. The gap between the recorded vote and the actual decision is itself information, even if you can’t observe it directly. Building the pipeline yourself forces you to sit with that gap rather than trusting that a vendor has somehow resolved it.<br>CAISO electricity price data presents a different flavor of the same problem. The data itself is reliable — real-time LMP prices are published on a fixed schedule and the source is authoritative. The problem is encoding. CAISO occasionally publishes data with non-standard SQL-incompatible characters in field values — encoding artifacts that cause ingestion to fail silently or loudly depending on where in the pipeline the error surfaces. A vendor feed would handle this transparently. Building it yourself means you encounter it directly, which means you know it exists and you know what your handling of it is.<br>Silent failure is worse than loud failure. At least a loud failure tells you something went wrong.

Not All Data Needs to Be Ingested Daily

The first architectural decision the pipeline makes isn’t about reliability or encoding. It’s about frequency.<br>The intuitive approach is to ingest everything as often as possible. More fresh data is better. Run everything daily.<br>This is wrong in two ways. It’s computationally wasteful — some sources publish weekly or monthly by definition, and re-fetching them daily produces identical results with wasted API calls. More importantly, it creates a false sense of resolution. Drought conditions don’t change daily. Building permits don’t change daily. Congressional action doesn’t change daily. Ingesting them daily doesn’t give you daily information — it gives you daily confirmation that nothing changed, dressed up as data.<br>The Metis pipeline uses three ingestion strategies, and the choice of strategy per source is a modeling decision:<br>Daily sources are those where the phenomenon genuinely changes at daily or sub-daily resolution: EIA natural gas storage, CAISO grid LMP prices, FRED macroeconomic indicators, weather, and maritime AIS vessel tracking. These run every day because the signal they carry can change every day.<br>Weekly sources run on Mondays: CME futures contracts,...

data daily pipeline building doesn problem

Related Articles