AI Engineering Learning Path

Four-stage curriculum for software engineers entering AI engineering — Foundations (1-2 weeks), Building (2-3 weeks), Production (2-3 weeks), Advanced (ongoing) — each stage has a concrete project to build.

Updated Invalid Date·

learning-path curriculum roadmap beginners progression software-engineer

A structured progression for software engineers moving into AI engineering. Assumes Python proficiency and general backend/frontend experience. Organised into four stages. Complete each stage before moving to the next.

Time estimates assume active building, not just reading. Reading alone won't close the gap. Each stage has a project to build.

Prerequisites — Computer Science Fundamentals

If you are not yet a working software engineer, start with cs-fundamentals/python-basics and work through the layer below. These are the underlying concepts every stage assumes you know. If you're already a working SE, skim the pages that cover areas you haven't touched recently.

Page	What it covers	Priority if you're a working SE
cs-fundamentals/data-structures	Arrays, hash tables, linked lists, trees, heaps, graphs, Big O	Skim — just make sure you know Big O
cs-fundamentals/algorithms	Sorting, binary search, recursion, DP, two pointers, backtracking	Skim — refresh DP if rusty
cs-fundamentals/system-design	Load balancing, caching, databases, CAP theorem, microservices vs monolith	Read — directly maps to AI infra decisions
cs-fundamentals/sql	SELECT/JOIN/GROUP BY, indexes, ACID, transactions, SQLAlchemy ORM	Read — pgvector is PostgreSQL; you'll use this
cs-fundamentals/git	Staging, branching, merge vs rebase, PR workflow, conventional commits	Skim — refresh interactive rebase if unfamiliar
cs-fundamentals/networking	HTTP/HTTPS, DNS, TCP/IP, status codes, headers, SSE, WebSockets	Read — LLM streaming uses SSE; you'll hit 429s
cs-fundamentals/oop-patterns	Classes, inheritance, composition, SOLID, Factory/Observer/Strategy/Repository	Read — these patterns appear in every framework

Time to complete (from zero): 3–4 weeks of evening reading + exercises. Working SEs: 2–3 days of targeted gaps.

Stage 1 — Foundations (1–2 weeks)

Goal: understand what LLMs are, how to call them, and what you can build with a single API call.

Read (in this order)

synthesis/getting-started — make your first API call before reading anything else
llms/claude — understand the model family you'll use most; which model for which task
apis/anthropic-api — the full API surface: system prompts, tool use, caching, batch, streaming
llms/transformer-architecture — how the model actually works; you don't need the math yet, read for intuition
prompting/techniques — XML structuring, few-shot examples, chain-of-thought; this changes output quality immediately
llms/hallucination — what can go wrong and why; shapes how you design every system

Build

A CLI tool that takes a question and answers it using Claude. Add a system prompt. Add multi-turn history. Add streaming output. Roughly 100–150 lines of Python.

Done when

You can explain what tokens are, why max_tokens matters for cost, and the difference between a system prompt and a user message.

Stage 2 — Building (2–3 weeks)

Goal: build the two most common AI application patterns. RAG and agents. These cover ~80% of production AI systems.

Read (in this order)

rag/pipeline — the full RAG stack end to end; read this before the detail pages
rag/chunking — how you split documents; chunking is underrated as a quality lever
rag/embeddings — what embeddings are and which model to use
infra/vector-stores — where you store and search embeddings; start with pgvector or Chroma
rag/reranking — the single biggest quality improvement after basic RAG works
agents/react-pattern — the agent loop: think, act, observe, repeat
agents/langgraph — the framework for building agents that need state and checkpointing
protocols/tool-design — how to write tool definitions the model will use correctly
synthesis/architecture-patterns — the 7 blueprints; shows how RAG and agents combine

Build

A RAG application over a document set you care about (your company's documentation, a technical spec, a book). Add a chat interface. Add citations. Then extend it with one tool (e.g. a web search or a calculator). Roughly 300–500 lines of Python.

Done when

You can explain the difference between retrieval and generation, why reranking helps, what stop_reason: "tool_use" means, and how an agent loop terminates.

Stage 3 — Production (2–3 weeks)

Goal: learn what separates a demo from a system you can ship and maintain.

Read (in this order)

evals/methodology — evaluating LLM output quality; the most important thing most engineers skip
evals/llm-as-judge — how to use Claude to score Claude's outputs automatically
test-automation/testing-llm-apps — how to write pytest tests for LLM applications without real API calls
synthesis/cost-optimisation — how to reduce costs 60–90% before your bill surprises you
observability/platforms — tracing every LLM call in production; Langfuse is the default
observability/tracing — OpenTelemetry for LLMs; what to instrument
security/prompt-injection — the #1 attack surface; understand it before you ship
security/owasp-llm-top10 — the full threat model for LLM applications
infra/deployment — Docker, CI/CD, Vercel/Fly.io, environment management

Build

Take the RAG app from Stage 2 and make it production-grade: add prompt caching, add pytest tests with mocked API calls, add a Langfuse integration to trace every call, write 10 eval cases with LLM-as-judge scoring. Deploy it.

Done when

You can explain what an eval golden set is, why you mock the API in tests, what prompt injection is, and how to estimate the monthly cost of your app before deploying it.

Stage 4 — Advanced (ongoing)

Goal: the deeper topics that make you a stronger AI engineer over time. These don't need to be read in strict order. Follow what's relevant to what you're building.

Model internals

llms/transformer-architecture — revisit with the math this time; attention formula, KV cache
math/transformer-math — shapes, memory calculations, why context length affects cost
math/probability — softmax, temperature, sampling strategies — why temperature 0 for factual tasks
llms/tokenisation — why "1 token ≠ 1 word" matters for cost and for prompt design

Advanced RAG

rag/hybrid-retrieval — BM25 + dense + RRF; better than pure vector search
rag/graphrag — for complex multi-hop reasoning over large document sets
prompting/dspy — automated prompt optimisation; replaces hand-tuning at scale
prompting/context-engineering — managing large context windows without degrading quality

Agents and protocols

agents/multi-agent-patterns — Supervisor, Swarm, Parallel fan-out; when single agent isn't enough
agents/memory — how agents remember things across sessions
protocols/mcp — the standard protocol for agent tool connectivity; you'll see it everywhere
agents/openai-agents-sdk — OpenAI's agent framework; you'll encounter it in production repos

Fine-tuning (when prompting + RAG isn't enough)

synthesis/rag-vs-finetuning — read this first; most teams fine-tune when they shouldn't
fine-tuning/decision-framework — the decision tree for when fine-tuning actually makes sense
fine-tuning/lora-qlora — LoRA and QLoRA; how to fine-tune without a datacenter
fine-tuning/dpo-grpo — DPO and GRPO; training on preferences not just examples

Cloud and infrastructure

infra/cloud-platforms — AWS Bedrock, GCP Vertex AI, Azure OpenAI; when to use managed cloud vs self-hosted
infra/inference-serving — vLLM and llama.cpp for self-hosted inference
infra/gpu-hardware — GPU selection, VRAM requirements, cloud vs on-prem cost

Safety and alignment

safety/constitutional-ai — how Claude is trained; shapes how you prompt and evaluate it
safety/alignment — Anthropic's RSP; the capability thresholds that govern deployment
safety/mechanistic-interpretability — what's actually happening inside the model

The Project Ladder

The fastest path to AI engineering competence is a series of real projects, each adding one new concept:

Project	Concepts practiced
CLI question-answering tool	Basic API, system prompts, streaming
RAG over your own documents	Chunking, embeddings, vector search, retrieval
RAG + citations + reranker	Reranking, faithfulness, source attribution
Support ticket classifier	Classification, model routing, Haiku for cheap tasks
Agent with web search tool	Agent loop, tool use, stop_reason handling
Django endpoint streaming LLM responses	Async, SSE, FastAPI/Django integration
Eval pipeline for any of the above	LLM-as-judge, golden sets, pytest evals
Multi-agent research pipeline	LangGraph, state management, handoffs

Build these roughly in order. Each one compounds the last.

What Makes a Good AI Engineer

Technical skills matter, but these habits separate good from great:

Evals first. Define how you'll measure quality before you write any LLM code.
Test the plumbing. Mock the API in tests. Never call the real model in CI.
Cost awareness. Know your cost per call before you scale. Prompt caching and model routing are decisions, not afterthoughts.
Scepticism about the model. The model will hallucinate. Design for it.
Read the response object. stop_reason, usage, cache hit counts — this data tells you what's actually happening.

Key Facts

Stage 1 (Foundations, 1-2 weeks): first API call, system prompts, transformer intuition, prompting techniques, hallucination awareness
Stage 2 (Building, 2-3 weeks): RAG pipeline, chunking, embeddings, vector stores, reranking, ReAct agent loop, LangGraph, tool design
Stage 3 (Production, 2-3 weeks): evals, LLM-as-judge, testing with mocked API calls, cost optimisation, observability, prompt injection, OWASP, deployment
Stage 4 (Advanced, ongoing): model internals, hybrid retrieval, GraphRAG, DSPy, multi-agent, MCP, fine-tuning, cloud infra, safety
Project ladder: CLI tool → RAG app → RAG+citations+reranker → classifier → agent → Django SSE → eval pipeline → multi-agent
"Evals first" is the most important habit: define how you'll measure quality before writing any LLM code
Mock the API in tests — never call the real model in CI
Read stop_reason, usage, and cache hit counts from every response object

Connections

synthesis/getting-started — the first page to read; your first working API call
synthesis/architecture-patterns — the 7 blueprints that cover 90% of AI applications
synthesis/llm-decision-guide — which model, embedding, vector store, and framework for each decision
overview — the current state of the field in one page
evals/methodology — the Stage 3 cornerstone; most engineers skip it and regret it
synthesis/rag-vs-finetuning — the Stage 4 fine-tuning decision read

Open Questions

Does the Stage 1 → Stage 2 → Stage 3 ordering hold for engineers whose primary interest is model internals rather than applications?
Is LangGraph still the right Stage 2 agent framework recommendation, or has a simpler alternative emerged that reduces the learning curve?
At what project complexity does the project ladder diverge for backend-focused vs frontend-focused engineers?