The Axiom

Math

7 pages

Backpropagation and Gradient Flow

How neural networks learn: compute the loss, attribute it to each weight via the chain rule, update weights to reduce the loss.

backpropagationgradientschain-rulevanishing-gradients

Information Theory for AI Engineers

The mathematical language for uncertainty, surprise, and divergence between distributions.

information-theoryentropykl-divergencecross-entropy

Linear Algebra for AI Engineers

The linear algebra that makes LLMs legible — matrix multiplication as attention, SVD as the reason LoRA works, cosine similarity as the embedding search standard.

linear-algebramatricesvectorseigenvalues

Numerical Precision — fp32, fp16, bf16, int8, fp8

Every LLM inference and training decision involves a precision trade-off: lower precision = smaller memory footprint + faster compute, but risks numerical instability and accuracy loss.

numerical-precisionfp16bf16int8

Optimisation for Deep Learning

Gradient descent, Adam/AdamW mechanics, cosine LR schedules, gradient clipping, and a diagnostic table for the six most common training instability symptoms.

optimisationgradient-descentadamadamw

Probability and Information Theory for AI Engineers

Probability and information theory foundations every AI engineer needs — covers cross-entropy loss, KL divergence (used in DPO/RLHF), softmax/temperature, perplexity, and sampling strategies that drive LLM training and inference.

probabilityinformation-theoryentropykl-divergence

Transformer Mathematics

Full attention formula (O(n²d) time, O(n²) memory), LoRA's 256x parameter reduction via low-rank updates, KV cache memory calculation, and quantisation quality tradeoffs by format.

linear-algebraattentionsoftmaxoptimisation