Skip to content
For roughly the last ten years, a meaningful percentage of my working hours have been spent thinking about observability. If you're not familiar with the term, "observability" is what we call it now that "monitoring" doesn't sound expensive enough. The actual work is unglamorous in that you collect a lot of logs, some metrics, a few traces, and then you give them to people.<br>I generally like my job. I like that we're always trying new ideas and approaches. I like the fact that when things go wrong, the answer is almost always sitting there in the data, waiting to be found by whoever is patient enough to look. But I want to be honest with you: in ten years of doing this work, across a half-dozen companies and every observability platform you've heard of and a few you probably haven't, logs have never stopped being the worst part of the job. They were the worst part when I started. They are the worst part today. I fully expect them to be the worst part of this job forever until the robots rise up and rip my head off in one clean sweep.<br>I've written about why logs are terrible before, so I'll spare you the full lecture and give you the short version.<br>Every developer's expectations for logs are set by a single formative experience: the syslog box. Or a container running locally. Or tail -f on a production server they probably shouldn't have SSH'd into. The point is that at some early, tender moment in their career, they had an experience with logs that was flawless. They ran grep and something useful came back. They piped it into jq and got exactly what they needed.<br>This experience is the observability equivalent of a first kiss. It ruins them for everything that comes after.<br>Because here is the thing about that flawless experience: it works because the system is small, the volume is trivial, and the person querying is the same person who wrote the log line. There is no schema drift, no cardinality explosion, no cross-team consumer with dashboard expectations, no VP asking why the "revenue events" graph has a gap in it.<br>Then there are forty services. Now there are four hundred. Now the logs are being consumed not just by developers but by customer service, who need to look up a specific user's failed checkout from Tuesday. And by the data team, who are quietly building a business-critical dashboard on top of a log line that a backend engineer is about to refactor without telling anyone. And by the on-call, who at 3 AM does not want to learn a new query language, does not want to think about index patterns, and would like the search bar to just work.<br>So you have a technical problem — the volume is enormous, the shape is inconsistent, the queries are unpredictable — sitting on top of an expectations problem, which is worse. Developers want logs instantly, they want to run arbitrary operations on them, and they will not commit to a schema. Meanwhile the less-technical consumers of that same data want the dashboards to be stable forever, the UI to be forgiving, and the whole thing to feel like a normal product. These two audiences are, in most practical respects, at war with each other, and you are the diplomat.<br>Clickhouse<br>ClickHouse came out of Yandex, where it was built to chew through analytical queries against absurd volumes of clickstream data. It was not designed for observability. It just happens to be shockingly good at it, because clickstream data and observability data have a lot in common: high volume, append-heavy, time-ordered, mostly read in aggregate, and every so often you need to reach in and find one specific needle.<br>You can run it yourself with Helm charts. You can point Grafana at it via the ClickHouse plugin, or use their own web UI, or bring your own frontend. Their docs are actually good, which I mention because it's rare enough to be worth flagging. I've never used their ClickStack setup though, so YMMV.<br>For observability specifically, the OpenTelemetry Collector has a ClickHouse exporter, which means you can pipe OTLP data straight in and let it manage the initial schema for you. ClickHouse is designed to scan billions of rows and ingest an amount of data that, when you first see the numbers, makes you assume they're lying. They're not lying. You query it with SQL, which is a language that already exists and was not created by a startup two weeks ago.<br>But why Clickhouse specifically for logs?<br>I'm ranting about logs and then I'm explaining why I like to administer Clickhouse more. Let me take a second and explain why Clickhouse is really good at logs at scale.<br>Logs, as a data shape, have some peculiar properties. They're append-only. You never update a log line, and you almost never delete a single one, though you delete a lot of them at once when retention kicks in. They arrive roughly in time order, though never actually in order. They're read in bursts where nobody looks at logs for days, and then during an incident somebody wants to scan a billion of them in seconds....