The Problem is Prompt Debt: You can't be model agnostic and hand-tune prompts

The Problem is Prompt Debt

Jun 22, 2026

CONTEXT

PROMPTING

LLMS

ENGINEERING

The Problem is Prompt Debt

You can’t be model agnostic if you’re hand-tuning prompts

Thanks to natural language interfaces, AI applications can be prototyped quickly. You write what you want in English, hand it to a frontier model, and a working prototype appears in an afternoon. This is extraordinarily powerful and for one-off tasks, optimal. But as a way to build reliable systems, the natural language prompt is a trap.

The plain-English prompt that makes prototypes effortless turns out to be a poor way to specify how a system should behave, and the bill arrives slowly, disguised as ordinary progress, until the application can barely move. The problem is not any single prompt. It is that natural language was never meant to be a specification language for engineering, and treating it as one quietly caps what you can build.

The Prompt Debt Trap

The first symptom of prompt debt is slowing iteration. As users flag errors and spot edge cases, additional guidance is added to the instructions, nudging the model into line. If unwanted behaviors persist, instructions are repeated, with increasing severity. Pretty soon, the prompt isn’t straightforward and quick fixes regress previous instructions. Errors can no longer be handled with one-line “hot fixes” and your development cycle slows to a crawl.

Fable's system prompt repeats copyright guidance up to six times, under sections named search_instructions, search_usage_guidelines, mandatory_copyright_requirements, hard_limits, self_check_before_responding, and critical_reminders.

Next, prompt debt incapacitates your team. Your brittle prompt full of edge cases and all-caps threats is barely legible to you, and it’s downright impenetrable to your colleagues. Many teams mitigate this issue by breaking prompts into complicated templates assembled at run-time, each isolated to specific concerns. But these prompt segments evolve, too, growing into a thicket of conditions.

Finally, prompt debt ties you to a single model. Your hot fixes work on GPT-4o, but fail in entirely new ways when you point your inference call at GPT-5.4-mini. So you stay with 4o, hope the increasingly frequent deprecation emails from your inference provider are empty threats, and forgo the possibility of potentially cheaper, faster, better models. A recent report from Datadog suggests this is a common situation: the most-used model in traffic they observed is GPT-4o1.

Any one of these issues is a nuisance, but together they are the difference between a glorified prototype and a product that can grow with you, your customers, and your business. Your shiny new AI features are frozen, can only be improved through a full rebuild, and are locked to an aging model.

Why Prompt Debt Happens

Natural language interfaces are wonderful. They’re the right mechanism for one-off tasks and broad conversational threads. We get into trouble when we rely on natural language to define durable system behavior.

The imprecision of natural language paired with probabilistic language models means different words expressing the same intent, can yield different outputs. In a recent study, a clinical question asked in a patient’s voice and then re-asked in a physician’s, with identical facts, flipped Opus from declining all ten times to answering all ten.

And it’s not only word choice that matters. Seemingly unrelated statements, in the same prompt, can affect results. In a Harvard study, researchers found that merely stating which NFL team the user rooted for changed how often the model refused to answer questions regarding sensitive topics. Spurious statements influence the inference pass in ways we can’t predict. Which is why prompts become more brittle as you add fixes. An additional instruction to quell a stubborn error could affect how the model interprets a separate instruction that worked yesterday.

Repeating instructions propels us towards prompt debt, but it’s necessary when the behavior we want is at odds with a model’s training. This is fighting the weights, and once you recognize it you see it in system prompts everywhere. For example, ChatGPT’s image prompts used to instruct the LLM eight times to not reply when a generated image was returned, because it had been trained to always keep the conversation going.

Every coding agent system prompt we analyzed featured repeated instructions, stern warnings, and all-caps demands. Claude Code tells Opus seven times to return multiple tool calls in a single response. And even the most advanced models force prompt authors to fight the weights: Fable’s leaked system prompt restates one specific copyright rule six times.

None of these examples occurred in isolation. Multiple repeated rules are woven throughout the system prompts we examine. Stubborn errors grow our prompts quickly, with each increasing the brittleness, the risk of regression with every edit.

And worse: these...

The Problem is Prompt Debt: You can't be model agnostic and hand-tune prompts

Related Articles

Claude Fable 5

US Government directive to suspend access to Fable 5 and Mythos 5

Is AI ruining our skills? Early results are in – and they're not good

The Anatomy of an AI-Native Org

Apertus – Open Foundation Model for Sovereign AI