LLMs

12 pages

Start here

AI Engineering Brain

Central hub for the AI Engineering brain — every major category linked from one place. Start here to navigate foundation models, agents, RAG, fine-tuning, evals, prompting, security, and infrastructure.

Claude

The Claude 4.x model family (Opus 4.7, Sonnet 4.6, Haiku 4.5) — model selection guide, extended thinking, prompt caching, and the RSP safety framework underlying all Claude deployments.

claudeanthropicopussonnet

DeepSeek R1 / R2

DeepSeek R1 is a 671B MoE reasoning model trained entirely via reinforcement learning (GRPO, no PPO reward model) that matched OpenAI o1 on AIME and MATH-500 at 96% lower API cost with MIT-licensed open weights — the most disruptive open model release since Llama.

deepseekdeepseek-r1grporeasoning

Foundation Models

Foundation models are large neural networks pretrained on massive datasets that can be adapted to many tasks via prompting or fine-tuning — the paradigm shift underlying modern AI engineering.

foundation-modelsllmpretrainingtransfer-learning

Hallucination

Hallucination is a fundamental property of LLMs (not a bug) — covering why it happens, six types, detection methods (faithfulness checks, self-consistency sampling), and six mitigation strategies with RAG as the most effective.

hallucinationgroundingfactualityrag

Inference-Time Scaling (Test-Time Compute)

Allocating more compute at inference time — through sampling, search, or extended reasoning traces — produces quality gains that compound independently of training compute, with math and code tasks benefiting most.

inference-time-scalingtest-time-computePRMMCTS

LLM Model Families

The eight major LLM families (OpenAI, Anthropic, Google, Meta, Mistral, DeepSeek, Qwen, Cohere) compared by capability tier, licensing, and best use case.

gptgeminillamamistral

ML Fundamentals

Traditional ML foundations — supervised (regression, classification), unsupervised (clustering, dimensionality reduction), and RL — with key algorithms, evaluation metrics, and the ML lifecycle. AIF-C01 Domain 1 core.

mlsupervised-learningunsupervised-learningreinforcement-learning

Multi-head Latent Attention (MLA)

MLA compresses K and V into a single low-rank latent vector that is cached instead of full K/V tensors, cutting KV cache size by 93% vs standard MHA while preserving model quality — enabling 128K-context inference at scale.

attentionmladeepseekkv-cache

Small Language Models (SLMs)

Small language models (1B-14B parameters) run on consumer hardware and mobile devices; a fine-tuned SLM on a narrow task often beats frontier models at 1/100th the serving cost.

slmsmall-language-modelsphi-4llama

Tokenisation

LLMs read tokens not text — BPE algorithm, tiktoken and Anthropic tokenisers, non-English cost penalty, and context window budgeting at production scale.

tokenisationbpetiktokentokenizer

Transformer Architecture

The transformer's core operations — scaled dot-product attention (O(n²)), KV cache, RoPE positional encoding, MoE routing, and Chinchilla scaling laws — and why each matters operationally.

transformerattentionarchitecturekv-cache