Prompt Engineering
Claude-specific XML structuring outperforms Markdown, 2-5 few-shot examples in example tags, CoT for reasoning tasks but not with Extended Thinking, and DSPy for automated optimisation at scale.
The craft of eliciting the best output from a language model through input design. More accurately called context engineering now. The discipline covers what to put in the context window, not just how to phrase a question.
[Source: Perplexity research, 2026-04-29]
Why It's a Real Skill
The gap between a naive prompt and a well-engineered one is routinely 20–40% on task performance. DSPy auto-optimisation can find better prompts than human-written ones 60–80% of the time, but it needs a human-defined evaluation metric to optimise against.
The key insight: LLMs are extremely sensitive to framing, ordering, and structural signals in their input. Understanding why a prompt works makes you better at designing new ones.
Claude-Specific: XML Structuring
Claude is trained on XML-structured documents and responds best to XML-tagged inputs. This is the most important Claude-specific prompt engineering fact.
XML beats Markdown beats numbered lists beats plain prose for Claude.
<role>
You are a senior software engineer reviewing a pull request.
</role>
<context>
The PR adds a new authentication middleware to a Django REST API.
<file name="auth/middleware.py">
{{ code }}
</file>
</context>
<task>
Review for security vulnerabilities, correctness, and code quality.
</task>
<output_format>
Return a JSON object with keys: "verdict" (approve|request_changes), "issues" (list), "suggestions" (list).
</output_format>Use <example> tags to wrap few-shot examples. Use <scratchpad> to give Claude space to think before committing to an answer.
Few-Shot Prompting
2–5 examples is the sweet spot. More examples help with consistent formatting and edge-case handling; too many dilute the context budget.
Rules for good examples:
- Wrap each in
<example>...</example>tags - Include edge cases, not just happy paths
- Input and output should match the exact format you expect
- Order matters: hardest examples last (Claude is influenced by recency)
<examples>
<example>
<input>Classify sentiment: "The product is amazing!"</input>
<output>positive</output>
</example>
<example>
<input>Classify sentiment: "Worst experience I've had."</input>
<output>negative</output>
</example>
</examples>Chain-of-Thought (CoT)
Asking the model to reason step-by-step before answering. Significantly improves performance on multi-step reasoning, math, and code.
Classic CoT:
Think step by step before answering.
More structured:
<task>Solve this algebra problem: 3x + 7 = 22</task>
<scratchpad>Work through the solution step by step.</scratchpad>
<answer>State the final answer here.</answer>When NOT to use CoT:
- Extended Thinking models (claude-opus-4-7 with
thinkingenabled) — the model reasons internally; adding explicit CoT instructions conflicts and degrades performance - Simple classification/extraction tasks — CoT adds latency and tokens for no gain
- When you need exactly-formatted output — CoT can bleed into the output format
See apis/anthropic-api for extended thinking configuration.
System Prompt Design
The system prompt sets the operating context for the entire conversation. Best practices:
- Role first — establish identity/persona before instructions
- Constraints before capabilities — say what Claude should NOT do before what it should
- Output format in system, not user message — the format is constant; keep it out of the dynamic turn
- Long static context → cache it — anything > 1,024 tokens in the system prompt should use
cache_control - Separate concerns with XML —
<role>,<constraints>,<tools>,<output_format>as distinct blocks
Zero-Shot vs Few-Shot vs Fine-Tuning Decision
Start with zero-shot (well-structured XML prompt)
→ Wrong format / style? Add few-shot examples
→ Still inconsistent? Add DSPy optimisation
→ Style/domain mismatch that prompting can't fix? Consider fine-tuning
Fine-tuning should be the last resort, not the first. See fine-tuning/decision-framework.
DSPy
Auto-optimising prompt modules. Instead of hand-writing prompt strings, you define:
- A signature (input fields → output fields)
- An evaluator (ground-truth labels or LLM judge)
- An optimizer (BootstrapFewShot, MIPROv2, etc.)
DSPy then searches the space of prompts and few-shot examples to find the best combination. Typical improvement: 10–40% over hand-written prompts on constrained tasks.
import dspy
class Classify(dspy.Signature):
"""Classify customer support tickets by urgency."""
ticket: str = dspy.InputField()
urgency: Literal["low", "medium", "high"] = dspy.OutputField()
classifier = dspy.ChainOfThought(Classify)
# Then optimise with dspy.MIPROv2 against your labelled datasetBest used when: you have a repeatable task with measurable correctness, and you're running it at scale (thousands of calls per day).
Prompt Injection Defence
When user-provided content is included in prompts (RAG context, tool results, user messages in agents), it becomes an attack surface. See security/prompt-injection for full treatment.
Quick mitigations:
- Always separate user content from instructions with XML tags
- Never let user content appear before core instructions in the prompt
- Validate tool results before including them as context
- Use a separate model call to screen untrusted content before giving it to the main agent
Advanced Techniques
The following are less commonly used but have well-evidenced gains for specific scenarios.
Tree of Thoughts (ToT)
Instead of a single reasoning chain (CoT), generate multiple candidate reasoning paths, evaluate each, and select the best. Improves performance on tasks with multiple plausible solution paths (math puzzles, creative planning, search problems).
Cost: significantly more tokens and latency. Use only when CoT produces inconsistent results on a high-value task.
Self-Consistency
Generate the same prompt multiple times with temperature > 0, then take the majority answer. The ensemble effect reduces variance on reasoning tasks.
responses = [generate(prompt, temperature=0.7) for _ in range(5)]
final = majority_vote(responses)Improvement: 10-20% on math/reasoning tasks. Cost: 5x tokens. Use when accuracy matters more than cost.
Reflexion
After an initial response (especially a failed tool call or code output), feed the result back to the model with an explicit reflection prompt: "Review what you did, identify errors, try again."
Useful in agent loops where the model can observe the outcome of its actions and self-correct. Similar to the human debugging loop. See agents/react-pattern.
Prompt Chaining
Break complex tasks into a sequence of focused prompts, feeding each output as input to the next. Each step is simpler and more verifiable than doing everything in one prompt.
Example pipeline:
1. Extract key claims from document → claims list
2. Verify each claim against database → verified/unverified list
3. Summarise verified claims into report → final output
When to use: tasks that require distinct reasoning steps where intermediate outputs benefit from review or branching.
Meta Prompting
Use the model itself to generate or improve prompts for a target task. Provide examples of the task and ask the model to write the best prompt to solve it. This is the manual version of what DSPy automates.
Useful for one-off tasks or as a starting point before DSPy optimisation.
Context Engineering: Beyond the Prompt
The broader discipline of managing what goes into the context window:
- Prompt compression — LLMLingua and RECOMP reduce long contexts by 3-10x with minimal quality loss
- Memory management — for long agent runs, summarise old turns rather than dropping them
- Tool result filtering — strip verbose tool outputs before passing to the LLM
- Dynamic system prompts — inject only the relevant instructions for each request (reduces tokens, reduces confusion)
At scale, context engineering decisions affect cost as much as model selection. See prompting/context-engineering for context rot, compaction, and JIT retrieval.
Quick Reference: What Works
| Technique | Improvement | Cost |
|---|---|---|
| XML structuring (Claude) | ~15–20% on formatting | Zero |
| Few-shot examples (3–5) | ~20–30% on consistency | +tokens |
| Chain-of-thought | ~20–40% on reasoning | +tokens + latency |
| DSPy optimisation | 10–40% on constrained tasks | Engineering time |
| Prompt caching | 0% quality, 90% cost reduction | Minimal setup |
Key Facts
- XML vs Markdown for Claude: XML tags outperform Markdown and numbered lists; this is the single most impactful Claude-specific technique
- Few-shot sweet spot: 2-5 examples; wrap each in
<example>tags; put hardest examples last (recency effect) - CoT improvement: ~20-40% on multi-step reasoning; zero or negative effect with Extended Thinking enabled
- DSPy MIPROv2: 10-40% improvement over hand-written prompts; needs 50-100 labelled examples
- Prompt caching: system prompts >1,024 tokens should use
cache_control; saves 90% on repeated calls - Zero-shot → few-shot → DSPy → fine-tuning is the correct escalation order
Common Failure Cases
Chain-of-thought conflicts with Extended Thinking and degrades output
Why: adding "think step by step" instructions to a prompt using Extended Thinking (thinking: enabled) introduces conflicting reasoning pathways; the model's internal reasoning and the explicit CoT prompt interfere.
Detect: output quality drops when Extended Thinking is enabled and CoT instructions are present; answers are shorter or less coherent than with thinking alone.
Fix: remove all explicit CoT prompting when Extended Thinking is enabled; let the model reason internally via budget_tokens.
Few-shot examples include the wrong output format, model drifts to match them
Why: examples were copied from an older version of the task spec with a different output format; the model follows the examples rather than the <output_format> instruction.
Detect: output format matches examples, not the spec; switching to zero-shot produces the correct format.
Fix: always keep examples consistent with the current output format spec; when format changes, update all examples simultaneously.
XML tag names collide with user-supplied content
Why: user message contains text that looks like XML tags matching your structural tags (e.g., a user writes </task>); this closes the structural tag prematurely.
Detect: Claude treats user-supplied content after the tag as a structural boundary; output is truncated or misformatted.
Fix: use unique, unlikely tag names (e.g., <SYSTEM_TASK> not <task>); or sanitise user content to escape < and > before injection.
DSPy optimisation overfits to the eval set
Why: the eval set used during DSPy optimisation is the same set used to measure improvement; the optimiser finds prompts that score well on these specific examples.
Detect: DSPy-optimised prompt scores 30% better on the optimisation set but shows no improvement on a held-out test set.
Fix: split data into optimisation set and held-out test set before running DSPy; report improvement on the held-out set only.
Prompt compression removes critical context
Why: LLMLingua or similar compression aggressively removes tokens; it removes a constraint or caveat that the full-length prompt contained.
Detect: compressed-prompt answers violate a constraint that the full prompt enforced; faithfulness drops after compression.
Fix: mark critical constraints as anchor tokens that the compressor must preserve; validate compressed prompt against a checklist of required instructions.
Connections
- apis/anthropic-api — extended thinking, prompt caching, tool use
- evals/methodology — measuring whether prompt changes actually improve task performance
- rag/pipeline — structuring retrieved context in prompts
- security/prompt-injection — prompt injection attack patterns and defences
- prompting/dspy — automated prompt optimisation
- prompting/context-engineering — managing context window beyond just prompt phrasing
Open Questions
- Does XML structuring advantage persist for Claude 5+ or was it an artefact of specific training data?
- How does the optimal few-shot count vary by task domain — is 2-5 still right for highly technical tasks?
- When DSPy-optimised prompts are 40% better than hand-written ones, what does that imply about the upper bound of manual prompt engineering skill?
Related reading
More in Prompting