Prompt Engineering

Claude-specific XML structuring outperforms Markdown, 2-5 few-shot examples in example tags, CoT for reasoning tasks but not with Extended Thinking, and DSPy for automated optimisation at scale.

Updated Invalid Date·

prompting xml chain-of-thought dspy claude few-shot context-engineering

The craft of eliciting the best output from a language model through input design. More accurately called context engineering now. The discipline covers what to put in the context window, not just how to phrase a question.

[Source: Perplexity research, 2026-04-29]

Why It's a Real Skill

The gap between a naive prompt and a well-engineered one is routinely 20–40% on task performance. DSPy auto-optimisation can find better prompts than human-written ones 60–80% of the time, but it needs a human-defined evaluation metric to optimise against.

The key insight: LLMs are extremely sensitive to framing, ordering, and structural signals in their input. Understanding why a prompt works makes you better at designing new ones.

Claude-Specific: XML Structuring

Claude is trained on XML-structured documents and responds best to XML-tagged inputs. This is the most important Claude-specific prompt engineering fact.

XML beats Markdown beats numbered lists beats plain prose for Claude.

<role>
You are a senior software engineer reviewing a pull request.
</role>

<context>
The PR adds a new authentication middleware to a Django REST API.
<file name="auth/middleware.py">
{{ code }}
</file>
</context>

<task>
Review for security vulnerabilities, correctness, and code quality.
</task>

<output_format>
Return a JSON object with keys: "verdict" (approve|request_changes), "issues" (list), "suggestions" (list).
</output_format>

Use <example> tags to wrap few-shot examples. Use <scratchpad> to give Claude space to think before committing to an answer.

Few-Shot Prompting

2–5 examples is the sweet spot. More examples help with consistent formatting and edge-case handling; too many dilute the context budget.

Rules for good examples:

Wrap each in <example> ... </example> tags
Include edge cases, not just happy paths
Input and output should match the exact format you expect
Order matters: hardest examples last (Claude is influenced by recency)

<examples>
<example>
<input>Classify sentiment: "The product is amazing!"</input>
<output>positive</output>
</example>
<example>
<input>Classify sentiment: "Worst experience I've had."</input>
<output>negative</output>
</example>
</examples>

Chain-of-Thought (CoT)

Asking the model to reason step-by-step before answering. Significantly improves performance on multi-step reasoning, math, and code.

Classic CoT:

Think step by step before answering.

More structured:

<task>Solve this algebra problem: 3x + 7 = 22</task>
<scratchpad>Work through the solution step by step.</scratchpad>
<answer>State the final answer here.</answer>

When NOT to use CoT:

Extended Thinking models (claude-opus-4-7 with thinking enabled) — the model reasons internally; adding explicit CoT instructions conflicts and degrades performance
Simple classification/extraction tasks — CoT adds latency and tokens for no gain
When you need exactly-formatted output — CoT can bleed into the output format

See apis/anthropic-api for extended thinking configuration.

System Prompt Design

The system prompt sets the operating context for the entire conversation. Best practices:

Role first — establish identity/persona before instructions
Constraints before capabilities — say what Claude should NOT do before what it should
Output format in system, not user message — the format is constant; keep it out of the dynamic turn
Long static context → cache it — anything > 1,024 tokens in the system prompt should use cache_control
Separate concerns with XML — <role>, <constraints>, <tools>, <output_format> as distinct blocks

Zero-Shot vs Few-Shot vs Fine-Tuning Decision

Start with zero-shot (well-structured XML prompt)
  → Wrong format / style? Add few-shot examples
  → Still inconsistent? Add DSPy optimisation
  → Style/domain mismatch that prompting can't fix? Consider fine-tuning

Fine-tuning should be the last resort, not the first. See fine-tuning/decision-framework.

DSPy

Auto-optimising prompt modules. Instead of hand-writing prompt strings, you define:

A signature (input fields → output fields)
An evaluator (ground-truth labels or LLM judge)
An optimizer (BootstrapFewShot, MIPROv2, etc.)

DSPy then searches the space of prompts and few-shot examples to find the best combination. Typical improvement: 10–40% over hand-written prompts on constrained tasks.

import dspy

class Classify(dspy.Signature):
    """Classify customer support tickets by urgency."""
    ticket: str = dspy.InputField()
    urgency: Literal["low", "medium", "high"] = dspy.OutputField()

classifier = dspy.ChainOfThought(Classify)
# Then optimise with dspy.MIPROv2 against your labelled dataset

Best used when: you have a repeatable task with measurable correctness, and you're running it at scale (thousands of calls per day).

Prompt Injection Defence

When user-provided content is included in prompts (RAG context, tool results, user messages in agents), it becomes an attack surface. See security/prompt-injection for full treatment.

Quick mitigations:

Always separate user content from instructions with XML tags
Never let user content appear before core instructions in the prompt
Validate tool results before including them as context
Use a separate model call to screen untrusted content before giving it to the main agent

Advanced Techniques

The following are less commonly used but have well-evidenced gains for specific scenarios.

Tree of Thoughts (ToT)

Instead of a single reasoning chain (CoT), generate multiple candidate reasoning paths, evaluate each, and select the best. Improves performance on tasks with multiple plausible solution paths (math puzzles, creative planning, search problems).

Cost: significantly more tokens and latency. Use only when CoT produces inconsistent results on a high-value task.

Self-Consistency

Generate the same prompt multiple times with temperature > 0, then take the majority answer. The ensemble effect reduces variance on reasoning tasks.

responses = [generate(prompt, temperature=0.7) for _ in range(5)]
final = majority_vote(responses)

Improvement: 10-20% on math/reasoning tasks. Cost: 5x tokens. Use when accuracy matters more than cost.

Reflexion

After an initial response (especially a failed tool call or code output), feed the result back to the model with an explicit reflection prompt: "Review what you did, identify errors, try again."

Useful in agent loops where the model can observe the outcome of its actions and self-correct. Similar to the human debugging loop. See agents/react-pattern.

Prompt Chaining

Break complex tasks into a sequence of focused prompts, feeding each output as input to the next. Each step is simpler and more verifiable than doing everything in one prompt.

Example pipeline:

1. Extract key claims from document → claims list
2. Verify each claim against database → verified/unverified list
3. Summarise verified claims into report → final output

When to use: tasks that require distinct reasoning steps where intermediate outputs benefit from review or branching.

Meta Prompting

Use the model itself to generate or improve prompts for a target task. Provide examples of the task and ask the model to write the best prompt to solve it. This is the manual version of what DSPy automates.

Useful for one-off tasks or as a starting point before DSPy optimisation.

Context Engineering: Beyond the Prompt

The broader discipline of managing what goes into the context window:

Prompt compression — LLMLingua and RECOMP reduce long contexts by 3-10x with minimal quality loss
Memory management — for long agent runs, summarise old turns rather than dropping them
Tool result filtering — strip verbose tool outputs before passing to the LLM
Dynamic system prompts — inject only the relevant instructions for each request (reduces tokens, reduces confusion)

At scale, context engineering decisions affect cost as much as model selection. See prompting/context-engineering for context rot, compaction, and JIT retrieval.

Quick Reference: What Works

Technique	Improvement	Cost
XML structuring (Claude)	~15–20% on formatting	Zero
Few-shot examples (3–5)	~20–30% on consistency	+tokens
Chain-of-thought	~20–40% on reasoning	+tokens + latency
DSPy optimisation	10–40% on constrained tasks	Engineering time
Prompt caching	0% quality, 90% cost reduction	Minimal setup

Key Facts

XML vs Markdown for Claude: XML tags outperform Markdown and numbered lists; this is the single most impactful Claude-specific technique
Few-shot sweet spot: 2-5 examples; wrap each in <example> tags; put hardest examples last (recency effect)
CoT improvement: ~20-40% on multi-step reasoning; zero or negative effect with Extended Thinking enabled
DSPy MIPROv2: 10-40% improvement over hand-written prompts; needs 50-100 labelled examples
Prompt caching: system prompts >1,024 tokens should use cache_control; saves 90% on repeated calls
Zero-shot → few-shot → DSPy → fine-tuning is the correct escalation order

Common Failure Cases

Chain-of-thought conflicts with Extended Thinking and degrades output
Why: adding "think step by step" instructions to a prompt using Extended Thinking (thinking: enabled) introduces conflicting reasoning pathways; the model's internal reasoning and the explicit CoT prompt interfere.
Detect: output quality drops when Extended Thinking is enabled and CoT instructions are present; answers are shorter or less coherent than with thinking alone.
Fix: remove all explicit CoT prompting when Extended Thinking is enabled; let the model reason internally via budget_tokens.

Few-shot examples include the wrong output format, model drifts to match them
Why: examples were copied from an older version of the task spec with a different output format; the model follows the examples rather than the <output_format> instruction.
Detect: output format matches examples, not the spec; switching to zero-shot produces the correct format.
Fix: always keep examples consistent with the current output format spec; when format changes, update all examples simultaneously.

XML tag names collide with user-supplied content
Why: user message contains text that looks like XML tags matching your structural tags (e.g., a user writes </task>); this closes the structural tag prematurely.
Detect: Claude treats user-supplied content after the tag as a structural boundary; output is truncated or misformatted.
Fix: use unique, unlikely tag names (e.g., <SYSTEM_TASK> not <task>); or sanitise user content to escape < and > before injection.

DSPy optimisation overfits to the eval set
Why: the eval set used during DSPy optimisation is the same set used to measure improvement; the optimiser finds prompts that score well on these specific examples.
Detect: DSPy-optimised prompt scores 30% better on the optimisation set but shows no improvement on a held-out test set.
Fix: split data into optimisation set and held-out test set before running DSPy; report improvement on the held-out set only.

Prompt compression removes critical context
Why: LLMLingua or similar compression aggressively removes tokens; it removes a constraint or caveat that the full-length prompt contained.
Detect: compressed-prompt answers violate a constraint that the full prompt enforced; faithfulness drops after compression.
Fix: mark critical constraints as anchor tokens that the compressor must preserve; validate compressed prompt against a checklist of required instructions.

Connections

apis/anthropic-api — extended thinking, prompt caching, tool use
evals/methodology — measuring whether prompt changes actually improve task performance
rag/pipeline — structuring retrieved context in prompts
security/prompt-injection — prompt injection attack patterns and defences
prompting/dspy — automated prompt optimisation
prompting/context-engineering — managing context window beyond just prompt phrasing

Open Questions

Does XML structuring advantage persist for Claude 5+ or was it an artefact of specific training data?
How does the optimal few-shot count vary by task domain — is 2-5 still right for highly technical tasks?
When DSPy-optimised prompts are 40% better than hand-written ones, what does that imply about the upper bound of manual prompt engineering skill?