Deploying LLM Applications
LLM application deployment patterns covering Docker multi-stage builds, GitHub Actions CI/CD, and platform selection — Vercel for Next.js streaming, Fly.io for persistent FastAPI services, Modal for serverless GPU inference.
Getting AI applications from laptop to production. The stack: containerise with Docker, CI/CD with GitHub Actions, and deploy to the platform that matches your scale and ops budget.
Docker for LLM Services
API Service (FastAPI + Anthropic)
# Dockerfile
FROM python:3.12-slim
WORKDIR /app
# Install uv for fast dependency installation
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY src/ ./src/
# Non-root user for security
RUN adduser --disabled-password --gecos '' appuser
USER appuser
EXPOSE 8000
CMD ["uv", "run", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]
# docker-compose.yml — local dev with dependencies
services:
api:
build: .
ports: ["8000:8000"]
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- DATABASE_URL=postgresql://postgres:password@db:5432/myapp
depends_on:
db:
condition: service_healthy
db:
image: pgvector/pgvector:pg16
environment:
POSTGRES_PASSWORD: password
POSTGRES_DB: myapp
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
retries: 5
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:GPU Inference Container
# CUDA base for running open models
FROM nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv
WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev
COPY . .
# Model weights mounted at runtime, not baked in
ENV MODEL_PATH=/models/llama-3-8b-instruct
CMD ["uv", "run", "python", "-m", "src.serve"]
# Run with GPU access
docker run --gpus all \
-v /data/models:/models \
-p 8000:8000 \
my-inference-serviceGitHub Actions CI/CD
Standard Pipeline
# .github/workflows/deploy.yml
name: Deploy
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
with:
version: "latest"
- name: Install dependencies
run: uv sync --frozen
- name: Run tests (unit only, no real API calls)
run: uv run pytest -m "not integration" --tb=short
- name: Type check
run: uv run mypy src/
- name: Lint
run: uv run ruff check src/
integration-test:
runs-on: ubuntu-latest
needs: test
if: github.event_name == 'push'
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v3
- run: uv sync --frozen
- name: Run integration tests
run: uv run pytest -m "integration" --tb=short
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
deploy:
runs-on: ubuntu-latest
needs: [test, integration-test]
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Deploy to Fly.io
uses: superfly/flyctl-actions@v1
with:
args: "deploy --remote-only"
env:
FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}Docker Build + Push
build-push:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- name: Login to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
push: true
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
cache-from: type=gha
cache-to: type=gha,mode=maxDeployment Platforms
Vercel (Next.js / edge functions)
Best for: Next.js frontends with LLM-powered APIs. Streaming works out of the box.
# Install Vercel CLI
npm i -g vercel
# Deploy
vercel --prod
# Environment variables
vercel env add ANTHROPIC_API_KEY production// app/api/chat/route.ts — streams to browser automatically
import { streamText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'
export const runtime = 'edge' // runs on Vercel Edge, low latency
export const maxDuration = 60 // 60s timeout for streaming
export async function POST(req: Request) {
const { messages } = await req.json()
const result = streamText({ model: anthropic('claude-sonnet-4-6'), messages })
return result.toDataStreamResponse()
}Fly.io (Persistent servers)
Best for: FastAPI services, anything needing persistent connections or background workers.
# fly.toml
app = "my-llm-api"
primary_region = "lhr" # London
[build]
dockerfile = "Dockerfile"
[env]
PORT = "8000"
[http_service]
internal_port = 8000
force_https = true
auto_stop_machines = "stop" # scale to zero when idle
auto_start_machines = true
min_machines_running = 0
*vm*
memory = "2gb"
cpu_kind = "shared"
cpus = 2
fly launch # first time setup
fly deploy # subsequent deployments
fly secrets set ANTHROPIC_API_KEY=sk-ant-...Modal (Serverless GPU)
Best for: open model inference, fine-tuning jobs, batch processing. Serverless with GPU.
import modal
app = modal.App("llm-inference")
image = modal.Image.debian_slim().pip_install("vllm", "fastapi")
@app.cls(
image=image,
gpu="A100",
container_idle_timeout=300, # keep warm for 5 min
secrets=[modal.Secret.from_name("my-secrets")],
)
class InferenceService:
@modal.enter()
def load_model(self):
from vllm import LLM
self.llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")
@modal.method()
def generate(self, prompt: str) -> str:
outputs = self.llm.generate([prompt])
return outputs[0].outputs[0].text
@app.local_entrypoint()
def main():
service = InferenceService()
print(service.generate.remote("Explain RAG in one paragraph."))modal deploy inference_service.py # deploy
modal run inference_service.py # test locallyRailway / Render
Simpler alternatives to Fly.io for teams that want less configuration:
# Railway
railway login
railway up
# Render: connect GitHub repo in dashboard, add env vars, doneEnvironment Variables and Secrets
Never hardcode API keys. Standard pattern:
# src/config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
anthropic_api_key: str
database_url: str
environment: str = "development"
log_level: str = "INFO"
class Config:
env_file = ".env" # local dev
env_file_encoding = "utf-8"
settings = Settings()# .env (never commit)
ANTHROPIC_API_KEY=sk-ant-...
DATABASE_URL=postgresql://...
# Production: set via platform secrets
# Fly.io: fly secrets set KEY=value
# Vercel: vercel env add KEY production
# GitHub Actions: Settings → Secrets → ActionsHealth Checks and Readiness
# FastAPI health endpoint — required by most deployment platforms
from fastapi import FastAPI
import anthropic
app = FastAPI()
@app.get("/health")
async def health():
return {"status": "ok"}
@app.get("/ready")
async def ready():
# Check dependencies are available
try:
client = anthropic.Anthropic()
# Cheap API check (count tokens, not a full message)
client.messages.count_tokens(
model="claude-haiku-4-5-20251001",
messages=[{"role": "user", "content": "ping"}],
)
return {"status": "ready", "api": "ok"}
except Exception as e:
return {"status": "not ready", "error": str(e)}, 503Key Facts
- Use
uvfor dependency installation in Docker; replaces pip + venv, 10-100x faster - Never bake model weights into Docker images — mount at runtime via
-v /data/models:/models - Run containers as non-root user for security (
adduser --disabled-password) - Vercel
export const runtime = 'edge'for lowest latency;maxDuration = 60for streaming - Fly.io
auto_stop_machines = "stop"scales to zero;min_machines_running = 0for cost savings - Modal serverless GPU cost: ~$0.00164/s billed per request; cold start 2-5s
- API keys go via platform secrets (fly secrets set, vercel env add), never in source code
Common Failure Cases
Docker image builds fine locally but fails in CI with python: not found
Why: the Dockerfile uses python but the base image only has python3; macOS has a symlink, the container doesn't.
Detect: docker build fails in GitHub Actions with executable file not found; works on developer laptop.
Fix: use python3 explicitly in CMD/ENTRYPOINT; or add RUN ln -s /usr/bin/python3 /usr/bin/python to the Dockerfile.
Fly.io machine doesn't scale to zero because a background connection is held open
Why: auto_stop_machines = "stop" only works when the machine has no active connections; a Redis connection or database pool held open prevents scale-down.
Detect: Fly machines never show as stopped despite no traffic; monthly compute cost is higher than expected.
Fix: ensure connection pools have max_idle_time configured; close background connections when no requests are in-flight.
Modal cold start adds 5-15 seconds for first request after idle period
Why: container_idle_timeout=300 keeps the container warm for 5 minutes; after that, the next request waits for a cold start including model loading.
Detect: p99 latency has a bimodal distribution — fast (<1s) and slow (5-15s); the slow tail corresponds to cold starts.
Fix: increase container_idle_timeout for latency-sensitive workloads; or keep a warm instance with a periodic ping; or accept cold starts and set user expectations.
GitHub Actions deploys succeed but production still runs the old Docker image
Why: the deployment step pushed a new image but the service was not restarted; or the image tag is latest and the platform cached the old image.
Detect: docker inspect on the running container shows the old image digest; the new code is not reflected in behavior.
Fix: use commit SHA tags (ghcr.io/org/repo:${{ github.sha }}) not latest; ensure the deploy step restarts the service after image push.
Secrets set in Fly.io are not reflected after fly secrets set
Why: fly secrets set requires a machine restart to take effect; the current machines still have the old secret values.
Detect: log output shows the old API key being used despite fly secrets set reporting success.
Fix: run fly machines restart after setting secrets; or deploy a new version (fly deploy) which restarts machines automatically.
Connections
- infra/inference-serving — serving open models (vLLM) vs API-based serving
- infra/gpu-hardware — GPU options for self-hosted model serving on cloud instances
- web-frameworks/fastapi — the API framework used in the deployment examples
- web-frameworks/nextjs — Next.js streaming deployment on Vercel via Vercel AI SDK
- observability/tracing — what to instrument once deployed (spans, latency, token cost)
- infra/cloud-platforms — Vercel, Fly, Modal, Railway, AWS, GCP comparison for these workloads
Open Questions
- What is the practical cold start time difference between Modal and Fly.io for GPU workloads?
- When does Kubernetes become the right choice over Fly.io for LLM API services?
- How do you handle health checks for models with slow startup (large VRAM loads) without Kubernetes-style init containers?
Related reading