The Mirror Is Part of the Machine

london_safari10 pts0 comments

The Mirror Is Part of the Machine | Yusuf Aytas

The worst telemetry problems I have seen did not start with waste. They started when an incident happened. We could not see enough, and the missing field became the villain of the postmortem. So, we added it. Then we added the neighbouring fields too, because nobody wanted the mirror to go dark the next time. The decision was reasonable at the time. That is what makes the mess harder to catch.

After enough incidents, you start carrying every scar. Debug logs from old failures, labels added for one investigation, dashboards built during a rollout, alerts created after one bad night. Nobody thinks they are creating an observability problem. In fact, we feel like we are just polishing the mirror for the next oncall engineer.

Then the mirror gets heavy bit by bit. The bill rises, queries slow down, dashboards contradict each other, and security finds customer identifiers reflected in places they should never have reached. Eventually we decide enough is enough. We clean up a few fields, reduce retention, delete stale dashboards, and sample more aggressively. That creates a cycle of sinning and repenting.

Broken Image Observability

Telemetry is Treated Like Exhaust

Telemetry often begins as exhaust. Logs, metrics, traces, profiles, events, and audit records come out the side of the application.

A database schema, an API contract, a queue, a cache, or a new external dependency will usually get reviewed, and security-sensitive product data gets some kind of review too. Telemetry changes often slide through as implementation detail.

In production, telemetry consumes CPU, memory, network, and disk. It draws engineering attention, security review time, and budget. It can get heavy enough to interfere with the workload it is supposed to reflect.

An application container gets sized for business logic, then an agent, sidecar, collector, logger, or profiler joins the party. Maybe the overhead is tiny per pod but it can get enormous across a fleet. Exhaust vents away and disappears. A mirror stays inside and shapes what you see.

The Bill is An Architecture Review

A telemetry bill tells you what your architecture hid under the rug. It exposes the mess that looked harmless while it was spread across services: too many clever components, too many retries, health checks, labels, and debug logs quietly multiplying in the background. Nobody feels the damage at the point of creation because the feedback loop sits way down.

A developer adds a field today, the reviewer sees useful context, the platform team sees an ingestion spike later, finance sees the invoice after that, and security finds the accidental data exposure during a review months later. When you get there, maybe the developer moved teams. Perhaps, service ownership changed.

Hence, that local decision became a global cost. The person creating the work is not the person paying the queueing cost, and the person who understands the risk may not have the authority to block it. The invoice is the only mirror that never lies.

Cardinality Is Where Context Becomes Cost

The most common technical explanation of telemetry cost is cardinality. In a time series database, a metric becomes the combination of its name and its labels. Every unique label set creates a distinct time series.

This is fine when the labels are bounded: service, env, region, status_code, route_template, team, zone. These labels describe stable operational dimensions. They let you group, filter, alert, and compare without creating an unbounded mess.

Then a team adds user_id to a metric to track down a hot partition issue. Three weeks later, the platform team sees active series explode, the team's manager is told they got a huge bill, and security realizes customer identifiers are now part of metric storage.

The storage layer sees the cross-product of every possible value. That is where engineering intent, database physics, vendor pricing, and ownership gaps collide.

Logs Are Anxiety With Timestamps

Metrics usually explode through cardinality. Logs usually explode through fear. That fear is rational. Missing one log line during an incident can waste hours. So teams learn the easy lesson and add more logs.

Logging without a questionioning leaves us with log dumps. A good log explains a state transition, a boundary crossing, a decision, a rejection, a fallback, or a failure. Compliance makes this worse when retention and indexing get treated as the same decision. Maybe you need to keep audit records for ninety days. That does not mean every debug line belongs in a hot searchable index for ninety days.

The Real Unit of Telemetry is a Decision

Most telemetry cost discussions start with volume: gigabytes per day, spans per second, active series, indexed logs, retention days, cardinality, query load. I'm all for tracking metrics but they are not the real unit.

The real unit of telemetry is a decision. What decision does this signal support? Does...

telemetry mirror decision logs enough time

Related Articles