The Axiom

Infra

19 pages

AI Gateway

An AI gateway is a proxy layer between your application and LLM providers that centralises auth, routing, failover, caching, and observability — so your app code talks to one endpoint regardless of which provider is called.

ai-gatewaylitellmportkeykong

AWS Bedrock AgentCore

AWS Bedrock AgentCore is a managed runtime for deploying, running, and monitoring AI agents at production scale — handles memory, tool execution, session management, and observability so teams don't build agent infrastructure from scratch.

awsbedrockagentsagentcore

Cloud Platforms for AI Engineering

AWS (Bedrock + SageMaker), GCP (Vertex AI), and Azure (Azure OpenAI) each offer distinct AI stacks — choice depends on existing cloud contracts, compliance requirements, and whether you need frontier or self-hosted open models.

cloudawsgcpazure

DeepSpeed ZeRO

Zero Redundancy Optimizer — Microsoft's distributed training system that partitions model training state across GPUs to eliminate memory redundancy.

deepspeedzerodistributed-trainingmulti-gpu

Deploying LLM Applications

LLM application deployment patterns covering Docker multi-stage builds, GitHub Actions CI/CD, and platform selection — Vercel for Next.js streaming, Fly.io for persistent FastAPI services, Modal for serverless GPU inference.

deploymentdockergithub-actionsci-cd

Experiment Tracking

Logging and comparing ML training runs. Distinct from production LLM observability: experiment tracking is for the training phase — comparing hyperparameter runs, catching overfitting, reproducing results.

weights-and-biasesmlflowneptuneweave

Flash Attention

IO-aware exact attention algorithm that reduces GPU memory usage from O(N²) to O(N) and achieves 2–10× speedup over standard attention. Standard in all modern LLM training and inference stacks.

flash-attentionattentiontraininginference

GitHub Marketplace Listing and Billing

GitHub Marketplace supports free, flat-rate, and per-unit billing. Paid plans require 100+ App installations. You must handle purchase lifecycle webhooks. Verified Publisher status is separate from listing.

githubmarketplacebillinglisting

GPU Hardware for LLMs

GPU selection guide for LLM inference and training — VRAM is the binding constraint (2 bytes per parameter in BF16), with H100 at ~3x A100 throughput for inference and RTX 4090 as the consumer sweet spot for fine-tuning.

gpuhardwarevramh100

HuggingFace

HuggingFace is the central infrastructure of the open-source LLM ecosystem — 700K+ models and 200K+ datasets on the Hub, with the transformers/datasets/PEFT/TRL library stack underpinning essentially all open model work.

huggingfacetransformersdatasetshub

Inference Serving

Production LLM inference is memory-bandwidth-bound, not compute-bound — vLLM solves this with paged attention (2-4x throughput over naive serving) and continuous batching; llama.cpp handles quantised local inference.

inferencevllmllama-cppserving

LiteLLM

LiteLLM is a Python SDK and self-hosted proxy that gives a single OpenAI-compatible interface to 100+ LLM providers — Claude, GPT, Gemini, Bedrock, Mistral, and more. Drop it in to switch providers without rewriting code.

litellmprovider-abstractionai-gatewayopenai-compatible

LLM Response Caching

LLM response caching combines semantic caching (Redis + vector similarity, eliminates API calls on hits) with Anthropic prompt caching (reduces token cost to 0.1x on repeated prefixes) — complementary strategies at different layers.

cachingredissemantic-cachellm

ML Pipeline Orchestration

ML pipeline orchestration automates the multi-step ML workflow (data → train → eval → deploy) with reproducibility, lineage, and scheduling — distinct from agent orchestration, which coordinates LLM tool calls at runtime.

mlopspipelinezenmlmetaflow

Model Routing

Model routing dynamically directs each LLM request to the cheapest model capable of answering it — trained classifiers or cascade strategies cut frontier-model call volume by 45-98% with under 5% quality loss.

model-routingcost-optimisationroutellminference

pgvector

pgvector adds vector similarity search to PostgreSQL — nearest-neighbour queries over embedding columns alongside ordinary SQL, no separate vector database required.

pgvectorpostgresqlvector-searchembeddings

Serverless Inference Platforms

Serverless inference for open-weight models is a distinct market from managed proprietary APIs — Together AI leads on model breadth, Fireworks on raw latency, Groq/Cerebras on exotic silicon, Modal/Replicate on custom weights.

inferenceservingtogether-aifireworks

Vector Stores

Vector stores are the storage layer of RAG systems — pgvector for existing Postgres stacks, Chroma for local dev, Qdrant for production self-hosted, Pinecone for zero-ops managed, Weaviate for built-in hybrid search.

vector-storeembeddingspgvectorchroma

Weaviate

Open-source vector database with built-in hybrid search (BM25 + dense vector), a GraphQL API, and first-class support for module-based vectorisation.

weaviatevector-storehybrid-searchgraphql