Ask HN: Debugging failure in large interconnected back end systems

Ifedayo_s1 pts0 comments

I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).In practice, when something breaks, it seems like the workflow is usually:an alert fires (Datadog/Sentry/CloudWatch/etc.)or a customer complainsengineers then start checking logs, traces, dashboards across multiple systemsand eventually manually reconstruct what happened across servicesWhat I’m curious about:How do you actually trace a single failed request or transaction across multiple services today?What tools do you rely on most in practice (not in theory)?Where does it usually break down — logs, tracing, instrumentation, or just missing context?How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?What part of this is still mostly manual stitching together of information?Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.

systems multiple services practice across trying

Related Articles