Prompt engineering anti-patterns are the failure modes most teams ship without naming. A prompt that works in a demo notebook can decay silently in production for months — the model still responds, outputs still look plausible, customer-facing quality still passes casual review — while accuracy drifts five points a week and the team has no instrumentation to notice. Naming the failure modes is the first defence; an eval suite is the second.

This essay catalogues ten anti-patterns we encounter repeatedly across prompt-library audit engagements. The cluster is opinionated on purpose: these are the ten that account for the majority of silent regressions, not an academic taxonomy. Each anti-pattern below has a diagnostic signal you can grep for, a severity ranking relative to the others, and a corrective pattern teams can adopt without a rewrite.

The framing is contrarian by design. The prompt engineering canon still leans heavily on tactics — "add a persona," "use few-shot examples," "chain-of-thought it" — without acknowledging that each of those tactics is also an anti-pattern when over-applied or applied without measurement. The goal here is to make the failure modes visible so prompt engineering can be treated as the engineering discipline it actually is.

Key takeaways

01
Few-shot examples need auditing as carefully as prompts.Example contamination — an old example that contradicts current instructions — is the single most common silent-failure source in production prompts. Audit the example set whenever the instruction set changes.
02
Instruction stacking degrades quality monotonically.Every additional rule erodes attention on the prior ones. Past about eight to ten distinct instructions, model attention to any one of them drops noticeably. Compress, do not append.
03
Format must generalise, not leak from examples.If the model only outputs JSON when the example was in JSON, the format is brittle. Test format with at least one input that has no structural analogue in the examples — the failure rate there is the true format-stability metric.
04
Persona is a hammer, not a default.Persona prompts add style and ego at the cost of factual precision and instruction-following. Use only when style is the goal; never as a knee-jerk opener for a technical task.
05
Eval-blind iteration is the slowest path to a good prompt.Hand-tuning a prompt without a structured eval is the engineering equivalent of refactoring without tests. Each edit looks better in the moment and silently regresses other cases the engineer is not currently looking at.

01 — Why Prompts RotPrompts work in demo, decay in production.

A prompt is born clean. An engineer sketches it in a notebook, iterates against a handful of inputs, gets the output shape right, and pastes it into the codebase. For the first week of production traffic, it works. The team marks the feature shipped, moves on to the next problem, and the prompt enters the long phase of its life where nobody looks at it again until a customer complains.

What happens in that interval is not dramatic. The model provider ships an incremental update. A teammate "tightens" the wording in response to a single bad output. A product manager requests a softer tone. A new edge case shows up in the wild, and somebody adds a clause to handle it. Each of those edits is individually reasonable. Together, over six months, they convert a crisp instruction into a paragraph of overlapping, contradictory directives that the model has to triage every single inference.

The result is silent regression. The prompt still runs, the model still produces output, the engineering team still considers the feature shipped — but the accuracy on the inputs that mattered originally has drifted ten or twenty points downward, and nobody knows because no eval is running against the inputs that mattered originally. The failure modes catalogued below are the specific mechanisms by which that rot happens.

The anti-pattern test

The fastest diagnostic for a rotting prompt is the commit-history grep: run git log -p against the prompt file and count distinct authors and distinct edits over the last six months. More than three authors or more than ten edits without a corresponding eval update is a strong signal that one or more of these anti-patterns has compounded inside the prompt.

The ten anti-patterns below are roughly ranked by severity — the ones earlier in the list compound faster and produce more customer-visible failures. Sections 02 through 05 cover the first four individually because the diagnostic and corrective patterns are distinct enough to warrant the space. Section 06 collapses the remaining four into a single grid because they share a common corrective pattern. Section 07 closes with the meta-failure of shipping any of the prior ten without an eval suite or a model-version watch — the silent drift that makes all of them harder to detect.

"A prompt that works in a demo is a hypothesis. A prompt that works in production for six months is an asset. Anti-patterns are what happens when the team treats a hypothesis as an asset."— Common refrain from prompt-library audit engagements

02 — Few-Shot PollutionExamples that contradict instructions.

Few-shot pollution is the most common silent regression we see and the cheapest to remediate once spotted. The pattern: an engineer adds three or five examples to a prompt months ago, the instruction text gets edited several times in the interim — adding a new requirement, tightening a constraint, switching to a new output format — and nobody updates the examples to match. The model receives an instruction that says "always return JSON with a confidence score" alongside three examples that return plain text with no confidence score. The model resolves the conflict by following the examples, not the instruction.

The diagnostic signal is starkly simple. Take the instruction paragraph and the example block. Write down every requirement the instruction states. Then check each example against that list. Every requirement an example violates is a vote against the instruction inside the model's attention. Three contradicting examples reliably beats one paragraph of instruction text on the current generation of frontier models — the instruction is a single statement, the example is repeated demonstration.

Severity

Production-grade regression source

Single most common silent-failure cause across audited libraries. Compounds fastest because example sets are rarely revisited when instruction text changes. Always check examples first when a prompt regresses.

Critical · audit weekly

Diagnostic

3+contradictions

Examples vs instructions delta

Read the instruction. Read each example. Count requirements an example violates. Three or more contradictions across the example set is a near-certain regression source, regardless of how the prompt is currently performing in casual review.

Audit drill

Pattern

Sync

Examples version with instructions

Treat the example block as part of the instruction artefact, not as an immutable seed. Every instruction edit triggers an example-block review. PR review for prompt files should explicitly check that examples match the current instruction text.

Corrective discipline

Eval hook

1case

Test the format constraint

Write a single eval case that asserts the instruction-stated output shape. If the model violates that shape, the example block is almost always the cause. The eval catches the regression on the next CI run instead of next quarter.

Minimal viable eval

The remediation is mechanically simple and culturally hard. The mechanical step is to put the examples in the same file as the instruction, version them together, and require PR review on any instruction change to explicitly inspect the example block. The cultural step is to break the habit of treating examples as a one-time seeding decision. Once the team starts thinking of examples as a synchronised artefact, the pollution rate drops close to zero on new edits — and the eval suite catches the remaining slips on the next CI run.

One concrete pattern from a recent engagement: a customer-support classification prompt had been edited eleven times over four months, adding new categories, tightening the rubric, and switching to a JSON output schema with a reason field. The example block — six examples written before any of those changes — still returned plain-text single-word labels. The classification accuracy on the production traffic had dropped from 91% to 74% with no commit, no code review event, and no incident report — just three months of slow erosion. Resyncing the examples recovered all 17 points in a single PR.

03 — Instruction StackingEach new rule erodes the prior.

Instruction stacking is what happens when a prompt accretes rules in response to bug reports without ever consolidating them. The shape is recognisable: the prompt opens with a clean task statement, then gradually accumulates fifteen to twenty bullet-pointed rules covering every edge case anyone has hit since the prompt shipped. Each rule was added in response to a real failure. Each rule individually makes sense. Together, they saturate the model's attention so badly that the original task statement loses most of its weight.

The diagnostic signal is the rule count. Past about eight to ten distinct instructions, attention to any one of them measurably drops on current frontier models — the model starts triaging which rules to follow rather than following them all. The rules that survive triage are usually the most specific or most recent, not the most important. That is how you end up with a prompt that handles a rare edge case beautifully and gets the core task slightly wrong.

Rule count vs core task accuracy · indicative trend

Source: observed pattern across audit engagements · approximate severities

1–4 rulesModel follows all reliably · core task accuracy unaffected

Stable

5–7 rulesMild attention dilution · core task accuracy −2 to −4 pts

Acceptable

8–10 rulesNoticeable triage · core task accuracy −5 to −10 pts

Caution

11–15 rulesHeavy triage · accuracy drop highly input-dependent

Refactor

16+ rulesSevere attention saturation · unpredictable rule-following

Rewrite

The corrective pattern is compression, not deletion. Most of the rules in a stacked prompt encode genuine constraints — the team cannot simply drop them. The trick is to consolidate them into a smaller number of higher-level principles, with the edge cases documented in a separate examples block rather than as standalone rules. Twelve bullet-pointed edge cases compress reliably into three principled statements plus three illustrative examples; attention recovers and the edge cases still get handled because the examples carry that load.

A useful audit drill: take a stacked prompt, write down every rule on a sticky note, then cluster them by intent. The cluster count is almost always less than half the rule count. The prompt rewrite is one principled sentence per cluster, followed by an example block that demonstrates the edge cases. The team almost always reports that the rewritten prompt outperforms the stacked version across the entire eval suite — not just on the rules that were being saturated.

The accretion habit

The deeper cultural fix is to break the habit of appending rules in response to bug reports. When a new failure mode surfaces, the default response should be a refactor of the existing rule set, not a new bullet point. The refactor takes ten minutes; the bullet point costs the next six months of inference quality.

04 — Format Via ExampleFormat leaks instead of generalising.

Format-via-example is the anti-pattern of demonstrating an output format with one or two examples and assuming the model will generalise. For inputs that look structurally similar to the examples, it does. For inputs that look structurally different, the format silently fragments — the model returns YAML when JSON was implied, drops a required field, adds an unwanted commentary paragraph, or splits the output across markdown blocks the downstream parser does not expect.

The failure mode is invisible to casual review because the production traffic distribution usually skews towards inputs that look like the examples. The format works on 80% of traffic, fails on the 20% the eval suite does not cover, and the team only finds out when downstream parsing breaks loudly enough to trip an alert. By that point the prompt has been failing intermittently for weeks.

Anti-pattern

Format demonstrated by example only

Examples show JSON output, instruction text says nothing structural. Model generalises the format when input looks similar to examples; loses structure on dissimilar inputs. Format stability is a function of the input distribution, not the prompt.

Avoid

Pattern

Explicit schema plus examples

Instruction text states the exact schema with field names and types. Examples demonstrate the schema in action. The schema is the source of truth; examples are pedagogy. Format survives unfamiliar inputs because the model is following the schema, not pattern-matching the examples.

Prefer

Structured output

Provider-level format enforcement

Where available, use the provider's structured output mode (OpenAI response_format, Anthropic tool-call schemas, Gemini response schema). The format is enforced at the API boundary, not by prompt discipline alone. Use in parallel with explicit schema text — defence in depth.

Prefer when available

Eval contract

Format-stability test on adversarial input

Write at least one eval case where the input is intentionally dissimilar from any example. Assert the output schema is intact. The failure rate on that case is the true format-stability metric — far more honest than running the eval suite on inputs that resemble the examples.

Required

The corrective pattern is to make the schema explicit in the instruction text and use the examples for what they are actually good at — demonstrating semantic content, not structural shape. Schema goes in the instruction ("return JSON with fields X, Y, Z; X is a string, Y is a number, Z is an array of strings"); examples demonstrate the kind of content that goes in each field. On modern frontier models, this pattern produces format stability close to 100% across the input distribution, including inputs that look nothing like the examples.

The provider-level structured output modes are a further safety net. OpenAI's response_formatwith a JSON schema, Anthropic's tool-call schema enforcement, and Gemini's response schema all push format validation down to the API boundary. They are not a substitute for clear instruction text — on rare edge cases the model can still produce schema-valid but semantically wrong output — but they make the format failure mode essentially extinct. Use both together when the provider supports it.

05 — Persona StuffingWhen the persona eats the task.

Persona stuffing is the prompt-engineering canon's most durable bad advice. "You are a world-class expert in X." "You are a senior engineer with twenty years of experience." "You are a Pulitzer-winning copywriter." These openers were genuinely useful on earlier model generations where persona priming nudged the model into a higher-quality distribution. On current frontier models, they range from inert to actively harmful — adding stylistic theatre at the cost of factual precision and instruction-following.

The diagnostic is to run the prompt with the persona stripped out and compare eval scores. On most production prompts we audit, the persona-free version performs identically or slightly better on accuracy, with a meaningful improvement on instruction-following. The cases where persona genuinely helps are narrower than the canon suggests: tone-driven creative tasks, role-played dialogue, and a small set of style-sensitive content tasks where the persona is doing actual stylistic work, not just decoration.

Use persona

Style-driven creative tasks

Copywriting where tone is the deliverable. Role-played dialogue (a character, a customer-support voice, a teaching persona). Style-imitation tasks. Here the persona is doing real stylistic work — strip it out and quality drops.

Persona helps

Avoid persona

Technical and analytical tasks

Code generation, data extraction, classification, summarisation, reasoning over structured input. The persona adds stylistic noise without lifting accuracy, and on current frontier models it often degrades instruction-following measurably. Default to no persona.

Persona harms

Persona as anti-pattern

Compound stacking on a persona

Worst version: persona opener plus fifteen accumulated rules. The persona consumes attention budget; the rules then triage against it. The persona is doing nothing for the technical task and is actively stealing weight from instructions that matter. Strip first.

Critical

Audit drill

A/B against persona-free

For any prompt that currently uses a persona on a non-creative task, run the eval suite with the persona removed. If the persona-free version wins or ties, ship the simpler prompt. This drill alone usually recovers two to five accuracy points across a library.

Quick win

The cultural fix is to stop treating persona as a default opener. The persona slot at the top of the prompt is one of the highest- weight slots the model reads; spending it on "you are a world-class assistant" rather than on the actual task framing is a measurable cost. The slot should describe what the model is doing in this prompt, not what it is supposed to be being while doing it.

For the cases where persona does help — the style-driven creative tasks — keep the persona but treat it as a tool with a specific job. A persona for a copywriting prompt should describe the voice characteristics that matter (sentence length, vocabulary register, allowed and forbidden moves), not credentials. The model does not care that the persona has twenty years of experience; it cares what concrete writing behaviour that implies.

"Persona is a hammer. Used on a nail, it works. Used as a default greeting at the top of every prompt, it dents everything it touches."— Audit observation across thirty engagements

06 — Four MoreOver-correction, missing schema, ungrounded CoT, prompt-response coupling.

The remaining anti-patterns share a common shape — each one is the result of a single well-intentioned tactic applied without measurement until it became a habit. They are catalogued here with diagnostic signal and corrective pattern but consolidated into a single grid because the remediation discipline is the same: name the pattern, write an eval that exposes it, refactor until the eval passes.

06 · P1

Over- correction

Tightening past the point of usefulness

The team responds to a single bad output by adding a constraint so tight it strangles the next ten reasonable outputs. Common shape: a model said something slightly off-tone, the prompt now forbids an entire category of phrasing, and helpful outputs collapse alongside the unhelpful one. Diagnostic: any rule added in response to exactly one observed failure with no eval coverage. Pattern: write the eval case first, then constrain only as much as the eval requires.

Severity · High

07 · P1

Missing output schema

Format implied, never stated

Sibling of format-via-example but worse: the prompt has no examples either, just an implicit assumption about output shape. Downstream code parses with regex or string-splitting and breaks the first time the model surprises it. Diagnostic: scan the prompt for any explicit schema definition; if none exists, the downstream parser is the schema, which means the schema is undocumented. Pattern: write the schema in the prompt; validate at the API boundary; cover with at least one adversarial eval case.

Severity · High

08 · P2

Ungrounded chain-of-thought

Reasoning trace with no anchor

The prompt asks the model to think step-by-step but provides no grounding context, no retrieval, and no domain anchor. The model invents plausible-sounding intermediate steps that have no relationship to the actual task. The final answer looks reasoned but is hallucinated. Diagnostic: any chain-of-thought prompt on a knowledge task where retrieval is absent. Pattern: anchor the chain in retrieved context or structured input; if neither is available, drop the chain entirely — the answer will be no worse and the reasoning theatre will not mislead reviewers.

Severity · Medium

09 · P2

Prompt-response coupling

Downstream code assumes specific phrasing

The downstream parser depends not on the schema but on a specific string the model happens to use — 'the answer is X', 'classification: Y', 'verdict: Z'. The model upgrades, the phrasing shifts by a comma, the parser breaks. Diagnostic: any string-matching downstream of an LLM call that is not a schema-validated field. Pattern: separate the schema from the rendering; downstream code reads the schema field, never the freeform prose around it.

Severity · Medium

The common thread across all four is that each one is a habit formed in response to a single observed failure and then applied generically. Over-correction comes from constraining too hard in response to a single bad output. Missing schema comes from shipping a prompt without thinking about the downstream parser. Ungrounded chain-of-thought comes from cargo-culting the chain-of-thought tactic onto tasks where it provides no value. Prompt-response coupling comes from the downstream parser growing up around incidental model phrasing rather than a documented contract.

The corrective discipline is the same in each case: a single eval case that exposes the failure, a small refactor that makes the eval pass, and a note in the prompt's README explaining why the pattern exists so a future engineer does not undo it. For teams running prompt-library audits, these four collectively account for roughly 20 of the 100 audit points spread across the framework in our 100-point prompt library audit — measurable, remediable, and worth the afternoon they take to fix.

07 — Silent DriftEval-blind iteration and model-version drift.

The tenth anti-pattern is the meta-failure that makes all the others harder to detect: shipping prompts without an eval suite and without instrumentation for model-version changes. Every anti-pattern catalogued above can be remediated in an afternoon if the team has an eval suite running against the prompt; without one, the same remediation takes a quarter and a customer complaint to surface.

The diagnostic for eval-blind iteration is the absence of any measurable score against the prompt over time. The team knows the prompt "works" in the sense that nobody has filed a bug recently, but the team cannot answer whether the prompt is better or worse than it was six months ago. That is not a working prompt; that is a hopeful prompt. The corrective is the same one we catalogue in detail in the prompt-library audit: a structured eval suite, scheduled runs, regression alerts routed to a named owner.

Cadence

24h

Daily scheduled eval runs

Daily is the right cadence for production prompts. Daily catches model-vendor updates within twenty-four hours — fast enough to roll back before customer complaints accumulate, slow enough to keep eval cost reasonable. Beta prompts run weekly; deprecated prompts monthly.

Production cadence

Catch rate

85%

Of these anti-patterns, pre-prod

An eval suite covering even the highest-stakes five prompts in a library typically catches roughly 85 percent of the anti-patterns above before they ship to production. The remaining 15 percent are the long-tail cases the eval suite does not yet cover — caught by the second iteration of the suite.

Indicative · audit data

Model watch

1diff

Score delta on model upgrade

When the model vendor ships an update, the eval suite produces a single concrete number: the score delta per prompt on the new model. Some prompts gain points; some lose them. The team makes the upgrade decision per prompt rather than as a binary library-wide call.

Per-prompt decision

Model-version drift is the second half of silent drift and deserves separate framing. Frontier providers now ship incremental updates monthly or faster — Sonnet 4.5 to 4.6, GPT from 5.4 to 5.5, Gemini 3.0 to 3.1 — and the assumption that a prompt tuned for one version works fine on the next is no longer safe. We see roughly 10-15% of prompts in a typical library shift meaningfully on a single model upgrade. Without eval instrumentation, the team finds out by customer complaint; with it, the team finds out the morning of the upgrade.

The remediation is structural. Every prompt in the library documents its primary model and its eval scores on at least one adjacent version. When a vendor ships an update, the team runs the eval suite against the new version and gets a per-prompt delta. Prompts that move favourably can adopt the new version immediately; prompts that regress stay on the prior version until the prompt itself is updated. The decision is data-driven and takes minutes rather than weeks. For teams building this kind of instrumentation from scratch, the same pattern shows up in our Claude Code subagent walkthrough — both prompt libraries and subagent libraries need the same shape of eval rails, and teams that adopt one usually want the other within a quarter.

The silent-drift fix

The simplest, cheapest, most under-adopted intervention in prompt engineering is a nightly cronthat runs the eval suite against production prompts and pages the named owner when scores drop. Infrastructure cost is a GitHub Actions schedule trigger. Behavioural change is treating prompts as first-class production artefacts. The two together compound for the rest of the prompt library's lifetime. For agencies and engineering teams stepping into this discipline, our AI transformation engagements install the eval pipeline as the first deliverable, not the last.

Conclusion

Prompt engineering is engineering — anti-patterns are the failure modes you ship without naming.

The ten anti-patterns catalogued above are not a comprehensive taxonomy of everything that can go wrong with a prompt. They are the ten that account for the majority of silent regressions in audited libraries — the failure modes that compound fastest and produce the most customer-visible damage relative to the effort required to remediate them. Naming them is half the defence; measuring them is the other half. A team that can name a few-shot pollution and write an eval that exposes it has done more for prompt quality than any amount of further tactical advice can buy.

The contrarian framing in this essay is deliberate. Most prompt engineering writing reads as a list of tactics to apply — personas, chain-of-thought, few-shot, explicit instructions — as if every tactic is a free lift. The reality is that every tactic is also an anti-pattern when applied without measurement or stacked without compression. The discipline is not about which tactics to use; it is about which tactics to remove when the eval suite says they are costing you points. Prompt engineering matures the moment the team stops asking "what should we add" and starts asking "what is each token in this prompt earning."

What to do next: pick the single highest-stakes prompt in your library. Run it through the ten anti-patterns as a checklist. Note every one that applies. Write one eval case that exposes the most severe of them. Refactor until the eval passes. The entire exercise takes a single afternoon and produces a prompt that survives the next platform shift, the next model upgrade, and the next six rounds of well-intentioned editing — which is the only durable definition of prompt quality we have found.

Prompt Engineering Anti-Patterns: 10 Mistakes to Avoid 2026