A Software Engineer's Guide to Observability: Part 2 - APM & Distributed Tracing
Blueground Engineering's observability guide to APM: why tracing matters, auto-instrumentation strategies, custom span best practices, and AI-enhanced debugging workflows
In Part 1, we covered logging as your forensics tool for understanding what happened after you've identified a problem. Logs are great at telling you what occurred, but they leave you guessing about the how fast and the where.
Picture this: A user complains that checkout is slow. You check the logs. You see the request came in at 14:23:01.234 and the response went out at 14:23:09.891. Cool. 8.6 seconds. But why? Was it the payment gateway? A database query? A retry loop somewhere? All of the above?
Logs alone won't tell you. That's the gap that APM (Application Performance Monitoring) and distributed tracing fill.
Why APM
APM is a category of tooling that continuously monitors the performance and behavior of your application in production tracking things like request latency, error rates, throughput, and the time spent in individual code paths.
Here's a concrete scenario. You're running a property rental platform. Monday morning, your alerting fires: Core Web Vitals are degrading. LCP is creeping past 3 seconds on your search pages, and your conversion funnel is starting to show it. You check your logs requests are coming in, responses are going out, no obvious errors. Everything looks fine on the surface.
You open your APM dashboard. Immediately, you can see that search requests have been trending up in latency for the past two hours. The flame graph for a representative request tells the story at a glance: the application server is fast, but there's a 2 second gap sitting inside a call to the pricing service. Drill into that span and you find a database query doing a full table scan on a 400k row table, an index that was never added when the feature launched six months ago.
Without APM, you would have been digging through logs for hours trying to correlate timestamps across services. With it, you had a root cause in minutes.
How APM and distributed tracing work
At its core, APM instruments your application to record spans , discrete units of work with a start time, duration, and attached metadata. These spans are shipped to a central collector (Datadog, Honeycomb, Jaeger, etc.) where they can be queried, visualized, and alerted on.
Distributed tracing is what happens when a single user request touches multiple services. Each service emits its own spans, but they're stitched together into a single trace using a shared trace_id that propagates through HTTP headers, message queue payloads, and async jobs. The result is an end-to-end timeline of everything that happened across every service for a single request.
If APM is the monitoring layer: dashboards, latency percentiles, error rates, throughput, then distributed tracing is the investigation layer, a detailed map of what actually executed when a specific request came in. Together, they give you both the signal (something's wrong) and the explanation (here's exactly where and why).
All spans share trace_id a7f3c291 — stitched into one unified trace in the APM collector
What is auto-instrumentation?
Auto-instrumentation is the practice of having your tracing library automatically capture telemetry data without you writing any tracing code. You add the library, configure a few environment variables, and it patches common frameworks and drivers at runtime to start emitting spans.
Without writing a single line of custom tracing code, you get:
HTTP requests and responses (inbound and outbound)
Database queries (SQL, NoSQL, Redis calls)
Message queue operations (Kafka, RabbitMQ, SQS)
Common framework operations (Express routes, Koa middleware, Fastify handlers)
Automatic trace context propagation across service boundaries
That last one is subtle but critical. Auto instrumentation doesn't just trace individual services, it connects the traces across service boundaries so you get a unified view of the entire request journey .
What does auto-instrumentation miss?
Auto-instrumentation is powerful, but it has a blind spot: your business logic.
It sees HTTP calls and database queries, but it doesn't know that you're "calculating pricing" or "validating inventory" or "processing a refund." It doesn't understand the chapters of your request's story, only the infrastructure-level sentences .
This is where custom instrumentation become critical.
Where to add custom instrumentation
Not everywhere. Every span has indexing, CPU overhead, memory and network cost to ship the data. The key is instrumenting what matters .
1. Instrument business-critical operations
If it matters to your business, it should have a span. Not just for performance tracking, but for understanding when and why operations fail.
const tracer = require('dd-trace');
async function createBooking(bookingData) {<br>const span =...