The Observability Stack That Makes Datadog Look Absurd at Scale

ClickStack: The Observability Stack That Makes Datadog Look Absurd at Scale | by Mohd Amaan | Jun, 2026 | MediumSitemapOpen in appSign up Sign in

Medium Logo

Get app Write

ClickStack: The Observability Stack That Makes Datadog Look Absurd at Scale

Mohd Amaan

10 min read· 5 days ago

Listen

Lessons from rolling out ClickStack across 15 services — the architecture that makes it cheap, the deployment patterns that actually scale, and the honest tradeoffs nobody warns you about.

The Observability Bill That Started It Most engineering teams hit the same wall, usually around the time someone on Finance prints out the Datadog quote and asks why it costs more than the team building the product. The conversation goes the same way every time. “Why is it so expensive?” Because we’re ingesting a lot of telemetry. “Can we ingest less?” We could, but then we lose visibility. “Can we sample harder?” We already sample. “Is there an alternative?” Yes. “How much engineering work?” Pause. That pause is where most teams give up and pay whatever the bill says. We didn’t — we evaluated the open-source alternatives before committing — and the answer turned out to be ClickStack : an open-source observability stack from ClickHouse that bundles HyperDX (the UI), ClickHouse (the storage and query engine), and a tuned OpenTelemetry Collector into a single deployable system. Launched in May 2025 after ClickHouse acquired HyperDX, it’s now in production at organizations ingesting billions of high-cardinality events per day. I rolled this out across 15 internal services over the past few months. This post is the deep-dive I wish someone had handed me on day one: the architecture, the deployment lessons, and the comparison against Datadog (and the open-source alternatives) that I actually trust. Press enter or click to view image in full size

ClickHouse / HyperDX

What ClickStack Actually Is ClickStack is three components glued together with strong defaults: OpenTelemetry Collector — a pre-configured OTel Collector that knows how to ingest logs, metrics, and traces over OTLP and write them to ClickHouse using batched, optimized inserts. ClickHouse — the high-performance columnar analytical database that stores everything. The same database powering analytics workloads at Uber, Cloudflare, eBay, and basically every company that’s hit “billions of events per day.” HyperDX UI — a developer-focused observability interface that runs SQL and Lucene-style queries against ClickHouse, with built-in dashboards, alerting, trace exploration, and session replay. The pitch is straightforward: OpenTelemetry-native ingestion, ClickHouse-level query performance, an engineer-friendly UI , all open source, deployable anywhere, with cost structures an order of magnitude cheaper than proprietary SaaS at scale. But what makes it interesting isn’t any single component. It’s the architectural decision sitting underneath the whole thing: treating observability as an analytical workload, not a logging workload. Press enter or click to view image in full size

ClickStack architecture overview

The Architecture Bet: “Wide Events” on a Columnar Engine Traditional observability platforms grew up around three separate signal types — logs (text), metrics (numbers), and traces (call graphs) — each with its own storage backend, query language, and retention model. You’d run Elasticsearch for logs, Prometheus for metrics, Jaeger for traces, and stitch them together in your head during incidents. ClickStack rejects that split. Everything is a wide event — a context-rich row in a ClickHouse table with whatever fields the application sends. A log line is a wide event. A trace span is a wide event with extra fields. A metric data point is a wide event with a numeric value. The schema is flexible because ClickHouse natively supports JSON columns, dynamic typing, and schema-on-read. Three architectural properties of ClickHouse make this work in ways that surprise people: 1. Columnar storage with absurd compression ClickHouse stores data column-by-column, like a warehouse. For observability data — which is overwhelmingly repetitive (same hostnames, same service names, same log levels, same trace IDs) — compression ratios of 10–20x are routine. We’re storing 2 weeks of high-cardinality production telemetry in roughly the same disk footprint that a comparable Elasticsearch deployment would have needed for one week of the same data, and at meaningfully lower cost than what SaaS providers charge for equivalent retention. 2. Parallel, vectorized query execution ClickHouse uses every CPU core on every node, executing queries in vectorized batches. A WHERE service='checkout' AND status_code >= 500 GROUP BY user_id Query across 30 days of trace data — the kind of query that takes 8 seconds in Elasticsearch — returns in under a second on a moderately-sized cluster. 3. Native JSON without schema migrations Observability schemas evolve constantly. New...

The Observability Stack That Makes Datadog Look Absurd at Scale

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI