VictoriaTraces: Tracing, Observability, and OpenTelemetry
Skip to content
Click to rate this post!
[Total: 0 Average: 0]
On the project, we’ve gradually grown to the point where it’s time to have proper tracing – to build real observability, not just monitoring.
A long time ago I did something similar with Jaeger – a monster, and it kind of stayed in my drafts from 2019 or 2020. Since right now our entire stack is VictoriaMetrics – VictoriaMetrics itself for metrics and VictoriaLogs for logs – we’ll use a solution from the VictoriaMetrics team for traces too: VictoriaTraces.
On top of that, VictoriaTraces is much lighter both in resources and in setup. You can probably compare Loki vs VictoriaLogs – and Jaeger vs VictoriaTraces: same story, a much simpler setup, much less CPU/RAM.
This post was planned as the first in a series on traces – so the first half here will be more theoretical, on Observability and OpenTelemetry. And in the second half, we’ll spin up VictoriaTraces in Kubernetes.
In the previous post OpenTelemetry: OTel Collectors in Kubernetes and integration with VictoriaMetrics stack I described a pure OpenTelemetry stack for metrics and logs, and in this post I’ll be referring back to it.
And in the next one, the third – we’ll look at how to create traces from Python.
Contents
Toggle
Monitoring vs Observability
"Monitoring is a passive action. Observability is an active practice" – from the excellent book Learning OpenTelemetry, Setting Up and Operating a Modern Observability System.
The first thing I want to cover separately is the difference between "monitoring" and "observability".
These are often confused or used interchangeably – but while they’re related, they’re really about different concepts.
So, Monitoring is when we know in advance what can break and we set up checks specifically for that: "CPU above 90% => alert. Disk more than 85% full => alert. 5xx errors on the ALB => alert."
In other words, we’re answering questions we’ve already formulated in advance. It’s essentially a dashboard-driven approach: we look at known metrics and react to known problems.
Observability is when the system lets us answer questions we didn’t formulate in advance: something "weird" is happening, and we can "dig out" the cause – even if we’ve never run into it before.
The key word here is explorability: the ability to investigate the connections and causes of problems.
For example – Backend API latency is up. Monitoring will just say "latency is high" (alert fired), but observability lets us drill down – walk through the whole chain and find the root cause: is the latency spike on a specific endpoint? a specific tenant? a specific Kubernetes Pod? Maybe one upstream is slow? In other words, we go from the symptom – to the cause through data that already exists in the system.
That’s actually why people talk about the "three pillars of observability " – Metrics, Logs, Traces. Traces (distributed tracing) are usually what distinguishes "just monitoring" from observability in practice, because traces are what let us investigate an unknown problem – see the path of a request through services and find a bottleneck we didn’t anticipate.
That said, observability isn’t about some "magical dynamic alerts": we still keep regular pre-defined alerts in the system like "if 5xx is above 1%, send a message to Slack".
What changes is what happens after the alert fires: we don’t just see "this domain is returning errors" and go grep logs in VictoriaLogs by hand – we have the ability to walk the full path: from the Slack alert – through AWS ALB – through the Kubernetes Pod – down to the component inside that Pod, and ultimately to the specific method() in the code that’s returning errors, and to the user whose requests are making that method generate errors.
So alerting is still the "monitoring" part: observability starts the moment the alert fires and you need to understand why.
Observability isn’t about detecting problems, it’s about investigating them.
What is: Tracing
Tracing (or distributed tracing) is a way to follow the path of a single request through the entire system: from the moment it hits the ALB – through a Kubernetes Pod – to the database, an external API or an LLM call, and back.
Going forward we’ll be talking about VictoriaTraces, which is built on VictoriaLogs – because the tracing concept itself is the same as for logs: a service records every "blip" – every call, every action, every request to external systems. The difference from "just logging" is that traces have an ID that ties all related calls into a tree, which lets us build the full path of a request.
One such path is called a trace . A trace consists of spans , where each span is one operation – a specific HTTP request, an SQL query, a call to another service, a queue processing step. Spans are linked into a tree via trace_id (shared across the whole trace) and parent_id (who called this span).
It looks roughly like...