Ask HN: Debugging failure in large interconnected back end systems

I’m trying to understand how teams actually debug production issues in systems made up of multiple services and external integrations (e.g. Stripe, Twilio, internal microservices, queues, webhooks, etc.).In practice, when something breaks, it seems like the workflow is usually:an alert fires (Datadog/Sentry/CloudWatch/etc.)or a customer complainsengineers then start checking logs, traces, dashboards across multiple systemsand eventually manually reconstruct what happened across servicesWhat I’m curious about:How do you actually trace a single failed request or transaction across multiple services today?What tools do you rely on most in practice (not in theory)?Where does it usually break down — logs, tracing, instrumentation, or just missing context?How long does it typically take to go from “something is wrong” → “we know exactly why it broke”?What part of this is still mostly manual stitching together of information?Trying to understand what the real pain points are in practice, especially in systems with lots of external integrations and async flows.

Ask HN: Debugging failure in large interconnected back end systems

Related Articles

The Newest Instagram "Exploit" Is the Goofiest I've Seen

It's Not Just X. It's Y

Amazon, Facebook, FBI have access to a private intelligence-sharing network

Show HN: GoPeek – open links in live mini browser windows without new tabs

Agent Memory: An Anatomy