AI Gateway

An AI gateway is a proxy layer between your application and LLM providers that centralises auth, routing, failover, caching, and observability — so your app code talks to one endpoint regardless of which provider is called.

This page covers the category and compares the main options. For LiteLLM implementation detail, see infra/litellm. For difficulty-based model routing (cheap vs frontier models by query difficulty), see infra/model-routing.


What an AI Gateway Is

Application code
      |
      v
[ AI Gateway ]   ← single endpoint; handles auth, retries, caching, logging
      |
  +---+---+---+
  |       |   |
OpenAI  Claude  Bedrock  (any provider)

Your application issues standard LLM calls to the gateway. The gateway handles the concerns your app code should not need to care about:

  • Auth — one API key to the gateway; the gateway holds provider credentials
  • Routing and failover — if OpenAI is overloaded, fall back to Azure or Bedrock transparently
  • Semantic caching — return a cached response for semantically equivalent queries; eliminates API calls entirely on hits; see infra/caching
  • Budget limits and cost attribution — hard stops per team, per user, or per feature; tracked at the gateway layer
  • Observability — logs every request with latency, token count, and cost before forwarding; feeds observability/platforms
  • Rate limiting — protect downstream providers from runaway agent loops
  • Guardrails / PII filtering — inspect and sanitise inputs and outputs before they reach the model

Without a gateway, each application service manages provider credentials, retry logic, fallback chains, and cost tracking independently. At team scale this becomes unmaintainable.


Gateway vs Model Router

These are complementary layers, not alternatives:

ConcernAI GatewayModel Router
Auth and credential managementYesNo
Retry and failoverYesNo
Semantic cachingYesNo
Observability and cost trackingYesPartial
Route by query difficulty (cheap vs frontier)NoYes
Budget limitsYesNo
Guardrails and PII filteringSomeNo

A model router (see infra/model-routing) decides which model tier to call based on query difficulty. A gateway handles how that call is made — reliably, cheaply, and with full audit trail. In production they stack: the router picks the model; the gateway executes the call.


Options

LiteLLM

Open-source (MIT). Runs as a Python library (in-process) or a self-hosted proxy server (AI gateway mode). Provides a single OpenAI-compatible API endpoint across 100+ providers.

Strengths:

  • Provider coverage is the best in class — Claude, OpenAI, Gemini, Bedrock, Mistral, HuggingFace, and more
  • Model aliasing: clients use stable names like gpt-4o; the proxy maps them to actual deployments
  • Router with retry, fallback, and load balancing across deployments of the same alias
  • Virtual keys for team access control — each team/service gets its own key with its own spend limit
  • Budget enforcement: hard-stop users or keys when spend limit is hit
  • Integrates with Langfuse, LangSmith, Helicone for observability callbacks

Weaknesses:

  • No native enterprise governance (RBAC, workspaces, approval workflows) out of the box
  • Observability is callback-based — you need to wire up a separate tracing platform
  • No built-in guardrails or PII filtering
  • Setup requires YAML configuration and Docker knowledge; 15–30 minutes vs minutes for managed options

Best for: self-hosted infrastructure, cost-conscious teams, multi-tenant internal services, teams that already run their own infra.

See infra/litellm for implementation detail, proxy configuration, and common failure cases.


Portkey

Open-source core, managed cloud option. Built for teams that need governance and compliance as first-class features rather than add-ons.

Strengths:

  • Guardrails and PII filtering built into the gateway — 20+ PII categories, jailbreak detection, output validation
  • Semantic caching built-in; up to 40% cost reduction [unverified]
  • Prompt management with versioning and environment promotion from the UI
  • RBAC, workspaces, and audit logs without additional tooling
  • SOC 2 and ISO 27001 certified — relevant for regulated industries
  • One-line integration (base URL swap) for the managed option; no infrastructure to run

Weaknesses:

  • Managed option means data leaves your infrastructure (self-hosted option available but requires more ops)
  • More opinionated than LiteLLM — you accept Portkey's governance model
  • Enterprise features (SLA, dedicated support) are paid tier

Best for: regulated industries (fintech, healthcare), teams with compliance requirements, teams that want guardrails and PII filtering without building them.


Kong AI Gateway

Kong's enterprise API gateway with an AI plugin layer. Sits in the Kong ecosystem (KongHQ) rather than being AI-first.

Strengths:

  • Raw throughput: Kong benchmarks show 228% faster than Portkey and 859% faster than LiteLLM [unverified — figures from Kong's own benchmark]
  • Full Kong plugin ecosystem: rate limiting, authentication, analytics, logging, mTLS
  • MCP and A2A support added in v3.12 (October 2025) — see protocols/mcp
  • RAG pipeline plugin (v3.10): automatic vector DB query to augment prompts on-the-fly
  • PII sanitisation plugin: 20+ PII categories across 12 languages (v3.10)
  • Multicloud: routes to OpenAI, Anthropic, GCP Gemini, AWS Bedrock, Azure AI, Mistral, HuggingFace

Weaknesses:

  • Significant ops overhead: Kong requires its own cluster, database (Postgres or Cassandra), and operational expertise
  • Overkill for teams not already running Kong
  • AI features are plugins on top of a general API gateway, not purpose-built for LLM workflows
  • Not open-source in the same sense as LiteLLM — Kong Gateway has a community edition but enterprise features require a licence

Best for: organisations already running Kong as their API gateway who want to extend it to LLM traffic; large enterprises with platform engineering teams who can absorb the operational overhead.


OpenRouter

Hosted service (not self-hosted). Routes to 200+ models across providers via a single API key. You do not deploy OpenRouter — you call their API.

What it is:

  • Sign up, get one API key, access every major model (OpenAI, Anthropic, Google, Meta, Mistral, and smaller open-source models)
  • Model catalog updated continuously; includes models not available via direct API
  • Cost-based and availability-based routing: sends your call to the cheapest available endpoint for the model you request

Common misconception: OpenRouter's "auto" routing is load-balancing across equivalent endpoints and availability routing, not difficulty-based routing. It does not analyse query complexity to pick a cheaper model for easy questions. For difficulty-based routing, see infra/model-routing.

Strengths:

  • Zero infrastructure — pure API service
  • Instant access to models during prototyping without managing multiple provider accounts
  • Useful for comparing model outputs in development

Weaknesses:

  • Data goes through OpenRouter's infrastructure — not suitable for sensitive data
  • No self-hosted option
  • No semantic caching, no guardrails, no PII filtering
  • Limited governance: not designed for multi-team production environments
  • Pricing adds OpenRouter's margin on top of provider costs

Best for: prototyping, personal projects, benchmarking multiple models, situations where infrastructure overhead matters more than cost or compliance.


Helicone

Observability-first proxy. Adds monitoring, semantic caching, and routing via a single base URL change. No infrastructure required.

Strengths:

  • Minimal integration: change the base URL, add two headers — done
  • Semantic caching: 20–30% cost reduction on repetitive query workloads; see infra/caching
  • Strong observability: per-request cost, latency, token count; session and user-level dashboards
  • Budget alerts and cost attribution by user/session via headers
  • Self-hosted option (Docker)
  • Open-source (Apache 2.0)

Weaknesses:

  • Less feature-complete as a router than LiteLLM or Portkey
  • No built-in guardrails or PII filtering
  • Routing is provider-level failover, not difficulty-based
  • Observability is its primary differentiator — if you already have Langfuse, the overlap is high

Best for: teams that want fast observability with caching and minimal setup, particularly when already calling a single provider and not yet needing multi-provider routing.

See observability/helicone for implementation patterns and failure cases.


Feature Comparison

FeatureLiteLLMPortkeyKongOpenRouterHelicone
Self-hostedYesYes / cloudYesNoYes / cloud
Provider coverage100+100+8+200+100+
Semantic cachingYes (proxy)YesYes (v3.10)NoYes
Guardrails / PII filteringNoYesYes (v3.10)NoNo
Budget trackingYesYesYes (plugin)NoYes
Model routing / fallbackYesYesYesNo (availability only)Yes (basic)
Prompt versioningNoYesNoNoYes
RBAC / workspacesBasic (virtual keys)YesYes (enterprise)NoBasic
Enterprise SLANoYes (paid)Yes (enterprise)NoYes (paid)
Raw throughputModerateModerateHighestN/AModerate
Setup time15–30 min< 5 minHours< 5 min< 5 min
Open-source licenceMITApache 2.0 (core)Community / EnterpriseNoApache 2.0

When You Need a Gateway

Add a gateway when any of these are true:

  • Multiple LLM providers — you call OpenAI for some tasks, Anthropic for others, or need fallback between them
  • Multiple teams or services — virtual keys give each team its own spend limit and audit trail
  • Cost attribution — you need to know which feature or user is spending what
  • Semantic caching is viable — you have a FAQ bot, support agent, or knowledge base Q&A where queries repeat
  • Compliance or audit logging required — every LLM call must be logged with inputs, outputs, and metadata
  • Agent loops — runaway agents need rate limiting and hard budget stops at the infrastructure layer

When You Do Not Need a Gateway

Skip the gateway when:

  • Single provider, stable usage — one API key, one provider, no fallback needed
  • Low volume — under ~500 calls/day, the engineering overhead and added latency outweigh the benefits
  • Prototyping — add complexity only when it earns its place; use direct SDK calls first
  • All traffic is async batch — Anthropic's Batch API already gives 50% cost reduction on non-real-time workloads; apis/anthropic-api

The correct order: direct provider call → add infra/caching when you see repetition → add LiteLLM when you add providers or need fallback → add a full gateway when you need team governance or compliance.


Connections

  • infra/litellm — LiteLLM implementation detail: proxy configuration, virtual keys, router patterns, failure cases
  • infra/model-routing — difficulty-based routing (cheap vs frontier model selection per query) — the complementary layer to a gateway
  • infra/caching — semantic caching architecture with Redis and RediSearch; the mechanism gateways plug into
  • observability/helicone — Helicone as gateway + observability combined
  • observability/platforms — Langfuse, LangSmith, Arize Phoenix for tracing LLM calls routed through gateways
  • protocols/mcp — Kong AI Gateway v3.12 added MCP support; MCP tool calls can be routed through gateway layers
  • synthesis/cost-optimisation — gateway features (caching, routing, batch) as cost levers in the seven-lever framework
  • apis/anthropic-api — prompt caching is an Anthropic-side mechanism that complements gateway-level semantic caching
  • security/owasp-llm-top10 — excessive agency (A09) and model denial-of-service (A04) are mitigated by gateway-level rate limiting and budget controls

Open Questions

  • At what request volume do semantic cache hit rates stabilise enough to justify the added infrastructure overhead?
  • How does LiteLLM handle Anthropic-specific features (prompt caching, extended thinking) that aren't portable to other providers?
  • Is the operational overhead of a self-hosted AI gateway worth it before roughly 10M LLM calls/month?