Temporal Primer - Building Long-Running Systems
Temporal Primer - Building Long-Running Systems
Arpit Bhayani<br>engineering, databases, and systems. always building.
If you have ever taped together a cron job, a message queue, a database table for state, and a retry loop - only to watch the whole thing break during a network blip at 2am - you already understand the problem Temporal solves. The fix you built was a workflow engine. Temporal is a workflow engine done right.
Temporal is an open-source durable execution platform. The idea is simple - your code runs to completion no matter what happens to the underlying infrastructure - processes crash, network partitions happen, VMs get killed during deployments - nothing ends your workflow. It resumes exactly where it left off, with the exact state it had before.
This write-up is a primer on the core concepts and features of Temporal. It covers how the system works, what its major building blocks are, and where the non-obvious traps live. The goal is that after reading this, you understand the mental model well enough to evaluate whether Temporal belongs in your architecture and know what to reach for when it does.
Fun fact: temporal can come in very handy while building long-running agentic systems.
The Problem With Distributed Workflows
Consider a multi-step process: charge a payment, provision a resource, send a confirmation email, update a billing record. In a naive implementation, you chain these calls in sequence. The first call succeeds. The third call throws a timeout. Now what?
You either retry the whole thing from the start, risking a double-charge, or you track which steps succeeded and build a resumption mechanism. That mechanism needs its own storage, its own retry logic, its own failure model. Now, you build a state machine. Then you realize it needs to handle concurrent runs. And timeouts at each step. And human-approval pauses. And you need to be able to cancel mid-flight. And you need observability into which step each run is on.
You have just reinvented what Temporal gives you for free.
The patterns we engineers reach for without a platform - status columns in databases, polling loops, dead-letter queues, hand-rolled sagas - all exist to approximate what durable execution provides natively. Temporal collapses this complexity into a programming model where you write code that looks linear, and the platform handles all the failure recovery underneath.
Workflows, Activities, and Workers
Everything in Temporal revolves around three concepts.
A Workflow is a function that defines your business logic. It orchestrates the overall process. It is the “what should happen and in what order” written as code in your language of choice: Go, Python, TypeScript, Java, C#, or PHP. Crucially, Workflow functions must be deterministic. More on that constraint shortly, because it is the most important thing to internalize.
An Activity is a function that does actual work in the world. It calls external APIs, writes to databases, sends emails, and invokes ML models. Activities are the “do a thing” units. They are explicitly not deterministic - they interact with systems that can fail, return different results on different calls, and take unpredictable amounts of time. Temporal handles retrying Activities automatically when they fail.
A Worker is a process you deploy that polls Temporal’s task queue and executes your Workflow and Activity code. Workers are stateless. They can crash and be replaced. Temporal’s server coordinates which worker picks up which task.
Here is a minimal example in Python to make this concrete:
@activity.defn<br>async def charge_payment(order_id: str, amount: int) -> str:<br># Calls an external payments API; may fail, may be slow<br>return await payments_client.charge(order_id, amount)
@workflow.defn<br>class OrderFulfillmentWorkflow:<br>@workflow.run<br>async def run(self, order_id: str) -> None:<br># Each activity call is automatically retried on failure<br>payment_id = await workflow.execute_activity(<br>charge_payment,<br>order_id,<br>start_to_close_timeout=timedelta(seconds=30),<br>await workflow.execute_activity(<br>provision_resource,<br>order_id,<br>start_to_close_timeout=timedelta(minutes=5),<br>await workflow.execute_activity(<br>send_confirmation,<br>order_id,<br>start_to_close_timeout=timedelta(seconds=10),<br>This looks like ordinary async code. The extraordinary part is that if your worker process crashes between the charge_payment and provision_resource calls, a new worker picks up the workflow and resumes from exactly the right point. The payment is not re-charged. The workflow does not restart from scratch. This just works, without any extra code on your part.
How Durability Actually Works
Every significant event in a workflow’s life - an activity was scheduled, an activity completed, a timer fired, a signal arrived - is recorded as an immutable entry in an event history, persisted in Temporal’s database (tunable - Cassandra, PostgreSQL, or MySQL). This event...