Deploying LLM Applications

LLM application deployment patterns covering Docker multi-stage builds, GitHub Actions CI/CD, and platform selection — Vercel for Next.js streaming, Fly.io for persistent FastAPI services, Modal for serverless GPU inference.

Updated Invalid Date·

deployment docker github-actions ci-cd modal kubernetes vercel production

Getting AI applications from laptop to production. The stack: containerise with Docker, CI/CD with GitHub Actions, and deploy to the platform that matches your scale and ops budget.

Docker for LLM Services

API Service (FastAPI + Anthropic)

# Dockerfile
FROM python:3.12-slim

WORKDIR /app

# Install uv for fast dependency installation
COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

COPY src/ ./src/

# Non-root user for security
RUN adduser --disabled-password --gecos '' appuser
USER appuser

EXPOSE 8000
CMD ["uv", "run", "uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000"]

# docker-compose.yml — local dev with dependencies
services:
  api:
    build: .
    ports: ["8000:8000"]
    environment:
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - DATABASE_URL=postgresql://postgres:password@db:5432/myapp
    depends_on:
      db:
        condition: service_healthy

  db:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_PASSWORD: password
      POSTGRES_DB: myapp
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      retries: 5
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

GPU Inference Container

# CUDA base for running open models
FROM nvidia/cuda:12.1-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*

COPY --from=ghcr.io/astral-sh/uv:latest /uv /usr/local/bin/uv

WORKDIR /app
COPY pyproject.toml uv.lock ./
RUN uv sync --frozen --no-dev

COPY . .

# Model weights mounted at runtime, not baked in
ENV MODEL_PATH=/models/llama-3-8b-instruct
CMD ["uv", "run", "python", "-m", "src.serve"]

# Run with GPU access
docker run --gpus all \
  -v /data/models:/models \
  -p 8000:8000 \
  my-inference-service

GitHub Actions CI/CD

Standard Pipeline

# .github/workflows/deploy.yml
name: Deploy

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
        with:
          version: "latest"

      - name: Install dependencies
        run: uv sync --frozen

      - name: Run tests (unit only, no real API calls)
        run: uv run pytest -m "not integration" --tb=short

      - name: Type check
        run: uv run mypy src/

      - name: Lint
        run: uv run ruff check src/

  integration-test:
    runs-on: ubuntu-latest
    needs: test
    if: github.event_name == 'push'
    steps:
      - uses: actions/checkout@v4
      - uses: astral-sh/setup-uv@v3
      - run: uv sync --frozen
      - name: Run integration tests
        run: uv run pytest -m "integration" --tb=short
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

  deploy:
    runs-on: ubuntu-latest
    needs: [test, integration-test]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to Fly.io
        uses: superfly/flyctl-actions@v1
        with:
          args: "deploy --remote-only"
        env:
          FLY_API_TOKEN: ${{ secrets.FLY_API_TOKEN }}

Docker Build + Push

  build-push:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3

      - name: Login to GitHub Container Registry
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Build and push
        uses: docker/build-push-action@v5
        with:
          push: true
          tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
          cache-from: type=gha
          cache-to: type=gha,mode=max

Deployment Platforms

Vercel (Next.js / edge functions)

Best for: Next.js frontends with LLM-powered APIs. Streaming works out of the box.

# Install Vercel CLI
npm i -g vercel

# Deploy
vercel --prod

# Environment variables
vercel env add ANTHROPIC_API_KEY production

// app/api/chat/route.ts — streams to browser automatically
import { streamText } from 'ai'
import { anthropic } from '@ai-sdk/anthropic'

export const runtime = 'edge'  // runs on Vercel Edge, low latency
export const maxDuration = 60  // 60s timeout for streaming

export async function POST(req: Request) {
  const { messages } = await req.json()
  const result = streamText({ model: anthropic('claude-sonnet-4-6'), messages })
  return result.toDataStreamResponse()
}

Fly.io (Persistent servers)

Best for: FastAPI services, anything needing persistent connections or background workers.

# fly.toml
app = "my-llm-api"
primary_region = "lhr"  # London

[build]
  dockerfile = "Dockerfile"

[env]
  PORT = "8000"

[http_service]
  internal_port = 8000
  force_https = true
  auto_stop_machines = "stop"    # scale to zero when idle
  auto_start_machines = true
  min_machines_running = 0

*vm*
  memory = "2gb"
  cpu_kind = "shared"
  cpus = 2

fly launch          # first time setup
fly deploy          # subsequent deployments
fly secrets set ANTHROPIC_API_KEY=sk-ant-...

Best for: open model inference, fine-tuning jobs, batch processing. Serverless with GPU.

import modal

app = modal.App("llm-inference")

image = modal.Image.debian_slim().pip_install("vllm", "fastapi")

@app.cls(
    image=image,
    gpu="A100",
    container_idle_timeout=300,  # keep warm for 5 min
    secrets=[modal.Secret.from_name("my-secrets")],
)
class InferenceService:
    @modal.enter()
    def load_model(self):
        from vllm import LLM
        self.llm = LLM(model="meta-llama/Meta-Llama-3-8B-Instruct")

    @modal.method()
    def generate(self, prompt: str) -> str:
        outputs = self.llm.generate([prompt])
        return outputs[0].outputs[0].text

@app.local_entrypoint()
def main():
    service = InferenceService()
    print(service.generate.remote("Explain RAG in one paragraph."))

modal deploy inference_service.py   # deploy
modal run inference_service.py      # test locally

Railway / Render

Simpler alternatives to Fly.io for teams that want less configuration:

# Railway
railway login
railway up

# Render: connect GitHub repo in dashboard, add env vars, done

Environment Variables and Secrets

Never hardcode API keys. Standard pattern:

# src/config.py
from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    anthropic_api_key: str
    database_url: str
    environment: str = "development"
    log_level: str = "INFO"

    class Config:
        env_file = ".env"         # local dev
        env_file_encoding = "utf-8"

settings = Settings()

# .env (never commit)
ANTHROPIC_API_KEY=sk-ant-...
DATABASE_URL=postgresql://...

# Production: set via platform secrets
# Fly.io: fly secrets set KEY=value
# Vercel: vercel env add KEY production
# GitHub Actions: Settings → Secrets → Actions

Health Checks and Readiness

# FastAPI health endpoint — required by most deployment platforms
from fastapi import FastAPI
import anthropic

app = FastAPI()

@app.get("/health")
async def health():
    return {"status": "ok"}

@app.get("/ready")
async def ready():
    # Check dependencies are available
    try:
        client = anthropic.Anthropic()
        # Cheap API check (count tokens, not a full message)
        client.messages.count_tokens(
            model="claude-haiku-4-5-20251001",
            messages=[{"role": "user", "content": "ping"}],
        )
        return {"status": "ready", "api": "ok"}
    except Exception as e:
        return {"status": "not ready", "error": str(e)}, 503

Key Facts

Use uv for dependency installation in Docker; replaces pip + venv, 10-100x faster
Never bake model weights into Docker images — mount at runtime via -v /data/models:/models
Run containers as non-root user for security (adduser --disabled-password)
Vercel export const runtime = 'edge' for lowest latency; maxDuration = 60 for streaming
Fly.io auto_stop_machines = "stop" scales to zero; min_machines_running = 0 for cost savings
Modal serverless GPU cost: ~$0.00164/s billed per request; cold start 2-5s
API keys go via platform secrets (fly secrets set, vercel env add), never in source code

Common Failure Cases

Docker image builds fine locally but fails in CI with python: not found
Why: the Dockerfile uses python but the base image only has python3; macOS has a symlink, the container doesn't.
Detect: docker build fails in GitHub Actions with executable file not found; works on developer laptop.
Fix: use python3 explicitly in CMD/ENTRYPOINT; or add RUN ln -s /usr/bin/python3 /usr/bin/python to the Dockerfile.

Fly.io machine doesn't scale to zero because a background connection is held open
Why: auto_stop_machines = "stop" only works when the machine has no active connections; a Redis connection or database pool held open prevents scale-down.
Detect: Fly machines never show as stopped despite no traffic; monthly compute cost is higher than expected.
Fix: ensure connection pools have max_idle_time configured; close background connections when no requests are in-flight.

Modal cold start adds 5-15 seconds for first request after idle period
Why: container_idle_timeout=300 keeps the container warm for 5 minutes; after that, the next request waits for a cold start including model loading.
Detect: p99 latency has a bimodal distribution — fast (<1s) and slow (5-15s); the slow tail corresponds to cold starts.
Fix: increase container_idle_timeout for latency-sensitive workloads; or keep a warm instance with a periodic ping; or accept cold starts and set user expectations.

GitHub Actions deploys succeed but production still runs the old Docker image
Why: the deployment step pushed a new image but the service was not restarted; or the image tag is latest and the platform cached the old image.
Detect: docker inspect on the running container shows the old image digest; the new code is not reflected in behavior.
Fix: use commit SHA tags (ghcr.io/org/repo:${{ github.sha }}) not latest; ensure the deploy step restarts the service after image push.

Secrets set in Fly.io are not reflected after fly secrets set
Why: fly secrets set requires a machine restart to take effect; the current machines still have the old secret values.
Detect: log output shows the old API key being used despite fly secrets set reporting success.
Fix: run fly machines restart after setting secrets; or deploy a new version (fly deploy) which restarts machines automatically.

Connections

infra/inference-serving — serving open models (vLLM) vs API-based serving
infra/gpu-hardware — GPU options for self-hosted model serving on cloud instances
web-frameworks/fastapi — the API framework used in the deployment examples
web-frameworks/nextjs — Next.js streaming deployment on Vercel via Vercel AI SDK
observability/tracing — what to instrument once deployed (spans, latency, token cost)
infra/cloud-platforms — Vercel, Fly, Modal, Railway, AWS, GCP comparison for these workloads

Open Questions

What is the practical cold start time difference between Modal and Fly.io for GPU workloads?
When does Kubernetes become the right choice over Fly.io for LLM API services?
How do you handle health checks for models with slow startup (large VRAM loads) without Kubernetes-style init containers?