Serverless Inference Platforms
Serverless inference for open-weight models is a distinct market from managed proprietary APIs — Together AI leads on model breadth, Fireworks on raw latency, Groq/Cerebras on exotic silicon, Modal/Replicate on custom weights.
Every AI engineer deploying open-weight models (Llama, Mistral, Qwen, DeepSeek R1, Gemma) faces the same infrastructure decision: manage your own GPU cluster (via infra/inference-serving), or buy API access from a serverless platform. The serverless route removes ops overhead but introduces vendor dependency, variable pricing, and model selection constraints.
The decision is non-trivial. Cost per million tokens varies by 5-10x across providers for the same model. Latency varies by 3-5x. Enterprise SLAs and compliance posture vary enormously. And not every model is available everywhere.
Why This Decision Matters
You are choosing between fundamentally different cost structures. Anthropic and OpenAI charge $3-15/M tokens for frontier proprietary models. Llama 3.3 70B on Together or Fireworks runs ~$0.90/M — a 3-15x reduction depending on workload. For bulk batch processing or high-volume classification, this gap is the difference between a product being viable or not.
Latency differences are real and hardware-driven. Groq's LPU produces 500+ tokens/second on Llama 3.3 70B. Standard GPU-based providers produce 150-300 tokens/second. For real-time voice agents or interactive chat where time-to-first-token dominates UX, hardware architecture matters.
Model availability constrains what you can run. Not every open-weight model is available everywhere. Groq's curated selection (~30 models) is narrower than Together AI's 200+. If you need a fine-tuned variant, a recently released model, or a custom architecture, Together, Modal, or Replicate may be the only viable options.
Provider Profiles
Together AI
Position: Broadest model catalog, best for teams wanting flexibility.
200+ open-weight models including Llama 4, DeepSeek R1, Qwen 2.5, Gemma 3, Mistral/Mixtral, and many fine-tuned variants. The largest catalog of any dedicated inference provider.
- OpenAI-compatible API (
base_url = "https://api.together.xyz/v1") - Fine-tuning pipeline built in — train on Together's GPUs, deploy from Together
- Bare GPU cluster rentals for workloads that outgrow serverless
- Llama 3.3 70B: ~$0.90/M tokens, ~917 TPS throughput, ~0.78s TTFT
Best for: Teams needing an unusual model or fine-tuned variant, researchers, or applications that cycle across many model families.
Trade-off: Slower TTFT than Fireworks on standard models (~220ms vs ~150ms for Llama 3.3 70B in published benchmarks).
Fireworks AI
Position: Fastest raw latency, best for tool-heavy agentic workloads.
Proprietary FireAttention inference kernel — a custom attention implementation that outperforms standard GPU serving on throughput and TTFT. Curated catalog (~50 models) selected for production readiness.
- OpenAI-compatible API (
base_url = "https://api.fireworks.ai/inference/v1") - First-class function calling and structured output — critical for agent tool loops
- Llama 3.3 70B: ~$0.90/M tokens, ~747 TPS throughput, ~150ms TTFT
- Aggressive speculative decoding for Llama family models
Best for: Latency-bound agent loops, function-calling-heavy pipelines, production applications where tail latency drives UX. Fireworks is the default latency recommendation for interactive workloads that are not real-time audio.
Trade-off: Narrower model catalog than Together AI; if your model is not in their catalog, it may not be available at all.
[Source: Northflank comparison, Infrabase AI Q2 2026, 2026-05-03]
Groq
Position: Exotic silicon for maximum tokens-per-second; real-time voice and interactive apps.
Groq built a Language Processing Unit (LPU) — a deterministic, memory-bandwidth-optimised chip that is fundamentally not a GPU. SRAM-based execution eliminates the memory bottleneck that limits GPU throughput. Acquired by Nvidia for ~$20B in December 2025; Groq continues operating as an independent company under a non-exclusive IP license.
- LPU architecture: SRAM-based, deterministic execution (no CUDA scheduler overhead)
- Llama 3 8B: ~877-2,100 tokens/second; Llama 3.3 70B: ~284-500 tokens/second
- Model selection: ~30 curated models (Llama, Mixtral, Gemma, Whisper) — narrower than competitors
- OpenAI-compatible API (
base_url = "https://api.groq.com/openai/v1") - Llama 3.1 8B: ~$0.06/M; Llama 3.3 70B: ~$0.64/M
- Meta partnership for official Llama API (April 2025)
Best for: Real-time voice agents where 200ms latency targets are hard constraints, interactive chat where streaming-start latency matters more than throughput, any workload where the user experiences the speed directly.
Trade-off: Narrowest model selection of the three major providers. No support for custom or fine-tuned models. Rate limits are tighter on free/starter tiers.
[Source: Groq newsroom, Artificial Analysis, Introl blog, 2026-05-03]
Cerebras
Position: Wafer-scale silicon for extreme throughput; enterprise OpenAI infrastructure partner.
The Wafer Scale Engine (WSE-3) is a single chip the size of a full 300mm silicon wafer: 46,225 mm² die area (57x larger than Nvidia H100), 4 trillion transistors, 900,000 cores, 44 GB on-chip SRAM. [unverified — based on Cerebras published specs]
- $10B multi-year compute deal with OpenAI announced January 2026; powering OpenAI's Codex-Spark model (February 2026) [unverified — mark for verification against Cerebras/OpenAI press releases]
- AWS partnership for Bedrock integration using WSE-3, announced March 2026 [unverified]
- Claims 15x lower latency than GPU-based solutions for supported model sizes
- IPO filing in 2026 at ~$23B valuation [unverified]
- Primary market: enterprise customers via cloud partnerships, not direct developer API
Best for: Enterprise teams accessing Cerebras via AWS Bedrock or OpenAI infrastructure; not a direct developer API in the same sense as Together/Fireworks/Groq.
Trade-off: Not a general-purpose developer API — availability is primarily through platform partners. Model selection is constrained to what Cerebras has optimised for their architecture.
Modal
Position: Serverless GPU platform for custom containers and fine-tuned models.
Modal is not an inference provider in the same sense as Together or Fireworks — it is a serverless compute platform where you deploy your own model weights, your own serving code, and your own inference stack. You write Python functions, Modal handles packaging, scaling, and billing per second of GPU time.
- $87M Series B (September 2025), $1.1B valuation
- Cold starts: 2-4 seconds via warm container pooling
- Any model, any framework — vLLM, TGI, llama.cpp, custom code
- Pay per second of GPU time (no minimum, scales to zero)
- Full custom container support — install anything, expose any port
Best for: Teams with fine-tuned model weights, non-standard inference setups, custom tokenisers or serving logic, or experimental model architectures not available on managed providers.
Trade-off: More ops overhead than Together/Fireworks. You own the serving code, which means you own the failure modes. Cold starts matter if your traffic is bursty. Not suitable as a drop-in replacement for managed API providers if you want zero-maintenance serving.
Replicate
Position: Model marketplace for fast prototyping and low-volume experimentation.
Replicate is a marketplace of model-as-API endpoints. Any HuggingFace model can be deployed and exposed as a REST endpoint in minutes. Thousands of models available, including image generation, audio, and video alongside LLMs.
- Widest raw model variety of any platform (including image gen, audio, video)
- Easy to try any model in minutes — no infra setup
- Cold starts: 16-60+ seconds for custom models (unsuitable for latency-sensitive production)
- Pay-per-prediction pricing
Best for: Prototyping, demos, low-volume research experiments, accessing obscure model variants. Not for production-scale cost efficiency or latency-sensitive applications.
Trade-off: Cold start latency is prohibitive for production. Per-prediction pricing is expensive at volume compared to dedicated providers.
Baseten
Position: Enterprise-grade custom model serving with SLAs.
Baseten focuses on production inference with compliance, SLAs, and enterprise support. Raised $150M Series D in late 2025. Truss framework for packaging PyTorch, TensorFlow, and HuggingFace models into serving containers.
- SLA-backed uptime guarantees — differentiator from Modal/Replicate
- Compliance posture for regulated industries
- Cold starts: 5-10 seconds with container caching; sub-second with pre-warming
- Custom model deployment — any weights, any framework
- Private model hosting — model weights do not leave your deployment environment
Best for: Teams with specific compliance requirements (HIPAA, SOC2, financial services), need for private model hosting, or explicit SLA requirements that rule out Replicate/Modal.
Trade-off: Per-minute billing hurts short-duration requests. Higher baseline cost than self-managed Modal for equivalent workloads. Less developer-friendly onboarding than Together/Fireworks.
Comparison Matrix
| Provider | Latency | Cost Tier | Model Coverage | OpenAI Compatible | Enterprise SLA | Custom Weights |
|---|---|---|---|---|---|---|
| Together AI | Medium (220ms TTFT) | Low ($0.90/M 70B) | Very high (200+) | Yes | Partial | Via fine-tune |
| Fireworks AI | Low (150ms TTFT) | Low ($0.90/M 70B) | Medium (~50) | Yes | Partial | Via fine-tune |
| Groq | Very low (LPU) | Very low ($0.06-0.64/M) | Low (~30) | Yes | Improving (Nvidia) | No |
| Cerebras | Very low (WSE) | Enterprise | Very low | Via partners | Yes (enterprise) | No |
| Modal | Variable (cold start 2-4s) | Pay/sec | Any | Self-host | No | Yes |
| Replicate | High (cold start 16-60s) | Pay/prediction | Thousands | Partial | No | Yes |
| Baseten | Medium (cold start 5-10s) | Medium | Any | Partial | Yes | Yes |
Cost tier reflects relative pricing against proprietary frontier APIs (Anthropic/OpenAI at $3-15/M). "Low" means ~$0.90/M for 70B models — a 3-15x reduction vs frontier APIs.
The OpenAI-Compatible Endpoint Pattern
Every major dedicated inference provider implements the OpenAI Chat Completions API schema. The only change required to switch providers is base_url and api_key. This is the foundation of multi-provider architecture.
from openai import OpenAI
# Together AI
client = OpenAI(
api_key="TOGETHER_API_KEY",
base_url="https://api.together.xyz/v1",
)
# Fireworks AI
client = OpenAI(
api_key="FIREWORKS_API_KEY",
base_url="https://api.fireworks.ai/inference/v1",
)
# Groq
client = OpenAI(
api_key="GROQ_API_KEY",
base_url="https://api.groq.com/openai/v1",
)
# Same call works for all three
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct", # model naming varies by provider
messages=[{"role": "user", "content": "Explain KV cache."}],
max_tokens=512,
)This pattern enables infra/litellm-style provider abstraction or infra/model-routing without architectural changes. The base_url becomes a configuration value, not a code dependency.
Multi-Provider Routing Strategy
Do not commit to a single provider. Route by workload class:
| Workload class | Provider | Reason |
|---|---|---|
| Real-time voice, sub-200ms required | Groq | LPU throughput; lowest TTFT |
| Interactive chat, agent tool loops | Fireworks | FireAttention + function calling |
| Batch processing, diverse model needs | Together AI | Best cost at volume; widest catalog |
| Fine-tuned or custom weights | Modal | Full container control |
| Enterprise with SLA requirement | Baseten | SLA-backed, compliance posture |
| Prototyping / exploration | Replicate | Fastest to try any model |
The canonical production architecture pairs Groq or Fireworks (real-time) with Together AI (batch/bulk) and Modal (custom) as three independent clients behind a routing layer. Switching between them requires only changing base_url.
This also provides resilience: if one provider has an outage or rate limits, the routing layer falls back without application code changes. See infra/litellm for a proxy that manages this across providers.
Cost Context vs Proprietary APIs
Rough order of magnitude for Llama 3.3 70B vs frontier proprietary APIs:
| Input $/M | Output $/M | |
|---|---|---|
| Anthropic Claude Sonnet 4.6 | $3.00 | $15.00 |
| OpenAI GPT-4o | $2.50 | $10.00 |
| Together / Fireworks Llama 3.3 70B | ~$0.90 | ~$0.90 |
| Groq Llama 3.3 70B | $0.59 | $0.79 |
| Groq Llama 3.1 8B | $0.06 | $0.06 |
Open-weight models on dedicated inference providers run 3-15x cheaper than frontier proprietary APIs. The quality gap depends on task: for structured extraction, summarisation, and classification, Llama 3.3 70B is often competitive with GPT-4o. For complex reasoning, code generation at frontier difficulty, and multi-step agent tasks, the proprietary gap remains real.
See synthesis/cost-optimisation for the full cost reduction stack combining routing, caching, and batch processing.
Connections
- infra/inference-serving — self-hosting via vLLM, llama.cpp, TensorRT-LLM when you manage your own GPUs
- infra/litellm — OpenAI-compatible proxy that abstracts all providers into one interface
- infra/model-routing — difficulty-based routing (RouteLLM, FrugalGPT) to cut frontier model usage
- infra/gpu-hardware — when to own GPUs vs buy serverless
- llms/model-families — which open-weight models are worth deploying on these platforms
- agents/voice-agents — Groq is the primary recommendation for real-time voice agent latency budgets
- synthesis/cost-optimisation — full seven-lever cost framework
- apis/anthropic-api — the proprietary alternative (Claude); prompt caching and Batch API for cost reduction
Open Questions
- How does Groq's LPU perform post-Nvidia acquisition — does the technology roadmap change?
- At what token volume does self-hosting vLLM on reserved cloud GPUs break even against Together AI pricing?
- Does Cerebras's WSE become available as a direct developer API or remain enterprise-only through cloud partners?
Related reading