Testing Azure retry logic locally: why I stopped mocking 429s and started injecting them | Topaz
Skip to main content<br>There is a certain category of test that feels good to write but does not actually test what you think it does. Retry logic sits squarely in that category.
The usual pattern is this: inject a fake HttpMessageHandler, make it return a 429 or 503 on the first N calls, assert that the code retried and eventually succeeded. The test passes. You ship with confidence. Then, in production, a real throttling event triggers a path through the Azure SDK that your mock never covered, and the retry policy does not behave the way the test implied.
The issue is not that the mock is wrong. It is that the mock bypasses the entire SDK transport layer. When you return a 429 from a fake handler, you are testing whether your own retry wrapper handles it correctly. You are not testing whether Azure.Core's built-in retry pipeline fires, whether the Retry-After header is respected, or whether the SDK's own exception hierarchy propagates through your application code the way you assumed. That is a different bar entirely.
Coming in v1.8<br>Fault injection is available in the nightly build today and will ship as a stable feature in Topaz v1.8. All commands below work against a nightly topaz-host instance.docker pull thecloudtheory/topaz-host:nightly
Chaos engineering docs → · Star on GitHub →
Where the real SDK retry pipeline lives
The Azure SDK — whether you are using .NET, Python, Java, or JavaScript — runs every outgoing request through a pipeline of policies. RetryPolicy sits in that pipeline. When a response comes back with a 429, RetryPolicy checks the Retry-After header, waits the specified duration, and retries. It does this transparently, below the level of the code that called GetSecretAsync or get_secret.
For that pipeline to actually exercise your retry logic, the 429 has to arrive through it. A fake HttpMessageHandler in .NET, a patched httpx transport in Python, or a stubbed HttpClient in JavaScript intercepts the request before it reaches the pipeline's transport step. Some policies still run. Others do not, depending on exactly where you injected the handler. The end result is that your retry test may be exercising a different code path than the one that runs in production.
What you actually want is something that lets the full SDK stack run, including pipeline initialization, token acquisition, and the retry machinery, and then injects the fault at the protocol boundary, after all of that setup, but before the real endpoint handler responds.
How Topaz injects faults
The fault injection engine in Topaz sits inside the request router, in this position:
Request
→ Authentication check
→ Provider registration check
→ Chaos fault roll ← injected here
→ Endpoint handler
→ Response
By the time a fault fires, the SDK has already acquired a token, serialized the request, and gone through its full pipeline. The fault response comes back through the same transport path as a real Azure response. If your SDK is configured to respect Retry-After on 429s, it will find a Retry-After: 5 header in the response and behave accordingly. If your retry wrapper catches RequestFailedException, it will be thrown the same way it would be thrown against real Azure.
There are two controls. A global on/off switch, which I called chaos mode because it has to be explicit, and individual fault rules that define what to inject, at what rate, and against which service namespace. Nothing fires unless chaos mode is enabled, so you cannot accidentally leave a throttle rule active and wonder why your tests are slow the next morning.
Creating a fault rule
The topaz CLI manages everything. To verify that your Key Vault retry logic actually works:
topaz chaos enable
topaz chaos rule create \
--rule-id kv-throttle \
--namespace Microsoft.KeyVault \
--fault-type Throttle \
--rate 0.5
With this rule active, roughly half of all Key Vault requests will receive a 429 Too Many Requests with a Retry-After: 5 header. The other half go through normally. That is intentional. A --rate 1.0 rule that throttles every request is useful for verifying that your retry policy eventually gives up correctly, but it is not a very interesting test. A --rate 0.5 rule means some requests succeed without any retry, some succeed after one retry, and occasionally the SDK exhausts its retry budget on a bad run. That mirrors how throttling actually behaves in a loaded Azure environment.
The four fault types cover the failure modes that Azure SDKs are expected to handle:
Fault typeWhat the SDK seesTransientError500 Internal Server Error, standard Azure error bodyThrottle429 Too Many Requests with Retry-After: 5Timeout408 Request Timeout, delayed 30 secondsServiceUnavailable503 Service Unavailable, delayed 60 seconds<br>The Timeout and ServiceUnavailable faults are the ones that expose a different class of bugs. They are not retry bugs. They are...