I Filed a GitHub Issue Against Google's GenKit AI Framework
Sign in
Subscribe
Update: A more formal propagation report for this case is now available here:<br>When Retry-After Gets Lost: A Case Study in AI Execution Semantic Degradation<br>Three weeks ago I filed an issue against genkit-ai/genkit.<br>Not a bug report. A finding. There’s a difference.<br>The finding was this: the retry middleware in Genkit includes RESOURCE_EXHAUSTED in its default retry statuses. When Anthropic returns HTTP 429 with Retry-After: 60, the middleware fires at 1000ms intervals — inside the cooldown window. All retries fail until the window expires naturally.<br>I’d seen this pattern before. It’s in the corpus.<br>What I didn’t expect was what happened next.<br>The architecture debate<br>Michael Doyle asked for a real-world example. I gave him one: Anthropic’s actual 429 response headers. He escalated to Pavel Gajdošík, a core Genkit maintainer.<br>Pavel didn’t just acknowledge the issue. He opened an architecture discussion in the thread.<br>Three options for how to propagate the Retry-After signal through GenkitError. Option 1: a dedicated retryAfterMs field. Option 2: convention in the existing detail field. Option 3: a new responseMetadata bag.<br>The thread stayed open. Two days later he linked PR #5343.<br>358 lines across 11 files. Four provider integrations: Anthropic, OpenAI, Google GenAI, Vertex AI.<br>It merged yesterday.<br>What the fix actually was<br>This is the part that matters.<br>The fix wasn’t “check the Retry-After header.” Every engineer reading the issue knew that.<br>The fix was adding a transport mechanism:<br>responseMetadata.retryAfterMs<br>A dedicated field on GenkitError that carries retry timing semantics across the framework boundary into the retry middleware.<br>The provider emitted the signal correctly. The HTTP layer delivered it correctly. The framework classified the condition correctly as RESOURCE_EXHAUSTED.<br>And yet the system still failed — because retry timing semantics didn’t survive the boundary crossing. The middleware had no channel to receive them.<br>The signal existed. The classification existed. The recovery path existed. What was missing was the propagation channel between them.<br>That’s not a retry bug. That’s a signal integrity failure.<br>Why this compounds as AI systems scale<br>A single-layer system fails loudly. You see the error. You fix the call.<br>A layered system fails silently. Provider SDKs abstract HTTP responses. Middleware frameworks abstract SDK errors. Runtimes abstract middleware state. Orchestration layers abstract runtime behavior. By the time a failure surfaces at the tool or UI boundary, the signal that would explain it — and govern the correct response — may have degraded through four or five transformations.<br>Genkit is a clean example because the layers are visible and the fix is public. But the same structural pathology appears across the corpus:<br>Retry-After parsed correctly, lost at the serialization boundary<br>Quota signal classified correctly, enforced at the wrong scope<br>STOP condition reached internally, not propagated to the parent session<br>Rate limit signal present in the runtime, non-actionable at the CLI surface<br>The layer that needs the signal doesn’t have it. Or has it in a form it can’t use.<br>Every one of those is a signal integrity failure. The failure pattern is the same. Only the boundary changes.<br>What the fix validates<br>Pavel’s PR didn’t add more retries. It added a structured field that carries cooldown semantics across a framework boundary. Then the retry logic could do its job.<br>That distinction matters. The fix itself validates the thesis:<br>Retryability classification and retry timing semantics must propagate together.<br>That’s not a tip. It’s a systems invariant. And it applies far beyond Genkit.<br>The receipt<br>The canonical receipt for this finding is now in pitstop-truth:<br>PT-2026-05-20-github-genkit-ai-genkit-5270-hidden<br>signal_failure_type: hidden<br>Hidden — not ignored. The signal wasn’t present at the decision layer until the fix added the propagation channel. That distinction produces different fixes.<br>The deeper lesson from Genkit is that:<br>modern AI systems increasingly fail not because signals are absent, but because execution-critical semantics degrade as they cross layers.<br>The systems that survive will be the ones that preserve signal integrity end-to-end.<br>Brent Williams builds Pitstop — execution reliability infrastructure for AI/API pipelines.<br>64 receipts. 35+ production systems.
github.com/SirBrenton/pitstop-truth<br>pitstop.dev<br>MACHINE READABLE SUMMARY<br>```json<br>"post": "I Filed a GitHub Issue Against Google's AI Framework. Here's What Happened.",<br>"published": "2026-05-21",<br>"thesis": "AI execution systems fail when decision-relevant signals lose integrity across layers — not because retry logic is missing",<br>"primary_finding": {<br>"repo": "genkit-ai/genkit",<br>"issue": 5270,<br>"pr": 5343,<br>"pr_lines": 358,<br>"providers_fixed": ["Anthropic", "OpenAI", "Google GenAI", "Vertex AI"],<br>"signal_failure_type":...