Backpropagation and Gradient Flow
How neural networks learn: compute the loss, attribute it to each weight via the chain rule, update weights to reduce the loss.
Information Theory for AI Engineers
The mathematical language for uncertainty, surprise, and divergence between distributions.
Linear Algebra for AI Engineers
The linear algebra that makes LLMs legible — matrix multiplication as attention, SVD as the reason LoRA works, cosine similarity as the embedding search standard.
Numerical Precision — fp32, fp16, bf16, int8, fp8
Every LLM inference and training decision involves a precision trade-off: lower precision = smaller memory footprint + faster compute, but risks numerical instability and accuracy loss.
Optimisation for Deep Learning
Gradient descent, Adam/AdamW mechanics, cosine LR schedules, gradient clipping, and a diagnostic table for the six most common training instability symptoms.
Probability and Information Theory for AI Engineers
Probability and information theory foundations every AI engineer needs — covers cross-entropy loss, KL divergence (used in DPO/RLHF), softmax/temperature, perplexity, and sampling strategies that drive LLM training and inference.
Transformer Mathematics
Full attention formula (O(n²d) time, O(n²) memory), LoRA's 256x parameter reduction via low-rank updates, KV cache memory calculation, and quantisation quality tradeoffs by format.