Cloud Platforms for AI Engineering
AWS (Bedrock + SageMaker), GCP (Vertex AI), and Azure (Azure OpenAI) each offer distinct AI stacks — choice depends on existing cloud contracts, compliance requirements, and whether you need frontier or self-hosted open models.
The three major cloud providers each have a distinct AI stack. AWS dominates enterprise; GCP has the deepest model integration (Gemini + Vertex); Azure has the strongest OpenAI relationship. Most production AI systems run on one of these three, even when the model itself is Claude or GPT accessed via API.
The Three Stacks at a Glance
| AWS | GCP | Azure | |
|---|---|---|---|
| Managed model API | Bedrock | Vertex AI Model Garden | Azure OpenAI Service |
| Training platform | SageMaker | Vertex AI Training | Azure Machine Learning |
| Serverless inference | Lambda + API Gateway | Cloud Run | Azure Functions |
| GPU compute | EC2 (p3/p4/p5) | GCE (A100/H100) | NDv5 (H100) |
| Object storage | S3 | GCS | Azure Blob |
| Container registry | ECR | Artifact Registry | ACR |
| Kubernetes | EKS | GKE | AKS |
| Secrets | Secrets Manager | Secret Manager | Key Vault |
AWS
Bedrock — Claude on AWS
Amazon Bedrock is the managed API gateway for frontier models including Claude, Llama, Mistral, and others. If your infrastructure is already AWS, Bedrock means no VPC egress to api.anthropic.com. Model calls stay inside AWS.
import boto3
import json
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
response = bedrock.invoke_model(
modelId="anthropic.claude-sonnet-4-6-20251001-v1:0",
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "What is RAG?"}],
}),
contentType="application/json",
)
body = json.loads(response["body"].read())
print(body["content"][0]["text"])Bedrock uses the same model IDs as the Anthropic API but with a different SDK and different auth (IAM roles, not API keys). The request/response shape is almost identical.
Bedrock vs direct Anthropic API:
| Bedrock | Anthropic API | |
|---|---|---|
| Auth | IAM roles | API key |
| Pricing | Same model prices + small AWS markup | Direct |
| Data residency | Stays in your AWS region | Anthropic's infrastructure |
| Compliance | SOC2, HIPAA, FedRAMP | SOC2 |
| Prompt caching | Available | Available |
| Latency | Slightly higher | Lower |
Use Bedrock when: enterprise compliance requires data to stay in AWS, you already have AWS IAM infrastructure, or your security team won't approve external API keys.
SageMaker — Training and Hosting Open Models
SageMaker handles the full ML lifecycle (training, hosting, monitoring) for open-source models you bring yourself (Llama, Mistral, etc.).
import sagemaker
from sagemaker.huggingface import HuggingFaceModel
role = sagemaker.get_execution_role()
hub = {
"HF_MODEL_ID": "meta-llama/Meta-Llama-3-8B-Instruct",
"HF_TASK": "text-generation",
"SM_NUM_GPUS": "1",
}
huggingface_model = HuggingFaceModel(
image_uri=sagemaker.image_uris.retrieve(
framework="huggingface-llm",
region="us-east-1",
version="2.0.0",
),
env=hub,
role=role,
)
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type="ml.g5.2xlarge", # A10G GPU
)
response = predictor.predict({"inputs": "What is the capital of France?"})SageMaker instances: ml.g5.2xlarge (A10G, 24GB VRAM, ~$1.21/hr), ml.p4d.24xlarge (8×A100, for large models).
When to use SageMaker vs Bedrock:
- Open model you're fine-tuning → SageMaker
- Frontier API model (Claude, GPT) → Bedrock or direct API
- Need autoscaling inference endpoints → SageMaker
Lambda — Serverless LLM Endpoints
For lightweight inference tasks (classification, short generation) that don't need a persistent server:
# lambda_function.py
import json
import anthropic
client = anthropic.Anthropic() # reads ANTHROPIC_API_KEY from Lambda env vars
def handler(event, context):
body = json.loads(event["body"])
query = body["query"]
response = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=256,
messages=[{"role": "user", "content": query}],
)
return {
"statusCode": 200,
"body": json.dumps({"answer": response.content[0].text}),
}Lambda cold start adds 200–500ms latency for Python. For streaming responses, Lambda doesn't support SSE well. Use API Gateway WebSocket or ECS instead.
IAM for AI Workloads
Key principle: least privilege. Each service gets only the permissions it needs.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"bedrock:InvokeModel",
"bedrock:InvokeModelWithResponseStream"
],
"Resource": "arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-*"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:PutObject"],
"Resource": "arn:aws:s3:::my-ai-data-bucket/*"
},
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": "arn:aws:secretsmanager:us-east-1:123456789:secret:anthropic-api-key-*"
}
]
}Store API keys in Secrets Manager, not environment variables or SSM Parameter Store for sensitive values. Rotate them with Lambda rotation functions.
GCP
Vertex AI — Google's AI Platform
Vertex AI is GCP's unified ML platform. It hosts Google's own models (Gemini) plus third-party models via Model Garden, and handles training, fine-tuning, and serving.
import vertexai
from vertexai.generative_models import GenerativeModel
vertexai.init(project="my-project-id", location="us-central1")
model = GenerativeModel("gemini-2.5-pro")
response = model.generate_content("Explain RAG in one paragraph.")
print(response.text)Claude is also available on Vertex AI (via Model Garden) with the same IAM auth pattern:
import anthropic
from google.auth import default
from google.auth.transport.requests import Request
credentials, project = default()
credentials.refresh(Request())
client = anthropic.AnthropicVertex(
project_id=project,
region="us-east5",
)
response = client.messages.create(
model="claude-sonnet-4-6@20251001",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello"}],
)Cloud Run — Serverless Containers for AI APIs
Cloud Run is GCP's managed container runtime. Deploy a Docker container, it scales to zero, handles traffic. Better than Lambda for AI workloads because it supports longer timeouts and streaming.
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]
gcloud run deploy my-ai-api \
--image gcr.io/my-project/my-ai-api \
--region us-central1 \
--set-env-vars ANTHROPIC_API_KEY=projects/my-project/secrets/anthropic-key/versions/latest \
--allow-unauthenticatedCloud Run concurrency: set --concurrency 80 (default) for stateless API endpoints; lower it if each request is memory-heavy.
GCS — Storage for Training Data
from google.cloud import storage
client = storage.Client()
bucket = client.bucket("my-training-data")
# Upload a training dataset
blob = bucket.blob("datasets/fine-tune-data.jsonl")
blob.upload_from_filename("local/fine-tune-data.jsonl")
# Stream large files without loading into memory
with blob.open("r") as f:
for line in f:
record = json.loads(line)
# process each training recordAzure
Azure OpenAI Service
Azure OpenAI hosts GPT-4o, GPT-4, and o-series models inside Azure infrastructure. Required for enterprises that need Azure compliance (Microsoft 365 integration, existing Azure contracts, EU data residency via Azure regions).
from openai import AzureOpenAI
client = AzureOpenAI(
api_key=os.environ["AZURE_OPENAI_KEY"],
api_version="2024-02-01",
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
# endpoint format: https://<resource>.openai.azure.com/
)
response = client.chat.completions.create(
model="gpt-4o", # your deployment name, not the model name
messages=[{"role": "user", "content": "Hello"}],
max_tokens=512,
)
print(response.choices[0].message.content)Note: Azure OpenAI uses deployment names not model names. You deploy a model, give it a name, and use that name in API calls.
Claude is available via Azure Marketplace (not Azure OpenAI). Use the Anthropic SDK with an Azure-issued key.
Azure ML — Training and MLOps
Azure ML handles experiment tracking, model registry, and deployment pipelines:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
credential=DefaultAzureCredential(),
subscription_id="...",
resource_group_name="my-rg",
workspace_name="my-workspace",
)
# Submit a fine-tuning job
from azure.ai.ml import command
job = command(
code="./src",
command="python train.py --data ${{inputs.data}} --model ${{inputs.model}}",
inputs={"data": "azureml:training-data:1", "model": "gpt-4o"},
environment="azureml:AzureML-sklearn-1.0-ubuntu20.04:1",
compute="gpu-cluster",
)
returned_job = ml_client.jobs.create_or_update(job)Key Vault — Secrets for Azure
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
credential = DefaultAzureCredential()
client = SecretClient(
vault_url="https://my-keyvault.vault.azure.net/",
credential=credential,
)
api_key = client.get_secret("anthropic-api-key").valueChoosing a Cloud Provider
| Situation | Recommendation |
|---|---|
| Already deep in AWS | Bedrock for Claude, S3 + Lambda for lightweight serving |
| Need to run open models (Llama, Mistral) | SageMaker (AWS) or Vertex AI (GCP) |
| Google workspace / BigQuery shop | Vertex AI; Gemini native integration |
| Microsoft / Azure enterprise | Azure OpenAI; Key Vault for secrets |
| Greenfield, no existing cloud | GCP Cloud Run is the simplest for containerised AI APIs |
| Compliance: HIPAA / FedRAMP | AWS Bedrock or Azure OpenAI (both have compliance certs) |
| EU data residency | AWS eu-west or Azure EU regions |
| Cheapest experimentation | Direct Anthropic / OpenAI API — no cloud middleman markup |
Cost Comparison: Managed vs Self-Hosted
For frontier models (Claude, GPT-4o) you can't self-host. You pay the API price regardless of which cloud you route through.
For open models (Llama 3 8B, Mistral 7B):
| Option | Cost (8B model) | Latency | Ops overhead |
|---|---|---|---|
| SageMaker ml.g5.2xlarge | ~$1.21/hr (24/7 = ~$873/mo) | Low | Medium |
| Runpod (spot GPU) | ~$0.20/hr | Low | High |
| Modal serverless GPU | ~$0.00164/s (billed by request) | Cold start 2-5s | Low |
| Fly.io GPU (A10) | ~$0.60/hr | Low | Low |
| Managed inference (Together AI, Fireworks) | $0.10–$0.20 per M tokens | Low | None |
Below ~10M tokens/month, managed inference APIs (Together AI, Fireworks, Groq) are cheaper than self-hosting anything.
Key Facts
- Bedrock uses IAM roles (not API keys); adds small AWS markup on top of model prices
- Bedrock has SOC2, HIPAA, and FedRAMP compliance; Anthropic API has SOC2 only
- SageMaker ml.g5.2xlarge (A10G 24GB): ~$1.21/hr; ml.p4d.24xlarge (8× A100): ~$32/hr
- Lambda cold start: 200-500ms for Python; SSE streaming not well-supported, use ECS or API Gateway WebSocket
- Azure OpenAI uses deployment names, not model names in API calls
- Below ~10M tokens/month, managed inference APIs (Together AI, Fireworks, Groq) are cheaper than self-hosting
- Store API keys in Secrets Manager (AWS), Secret Manager (GCP), or Key Vault (Azure), not env vars
Common Failure Cases
Lambda SSE streaming drops mid-response due to 29-second timeout
Why: AWS API Gateway has a 29-second integration timeout; LLM responses for complex queries regularly exceed this.
Detect: streaming response cuts off at ~29 seconds; the client receives a partial response followed by a gateway timeout.
Fix: switch from Lambda to ECS or Fargate for streaming LLM endpoints; or use WebSocket + API Gateway WebSocket API.
Bedrock model access not enabled in the target region causes runtime failure
Why: Claude models on Bedrock must be individually enabled per region in the AWS console; they're not on by default.
Detect: botocore.exceptions.ClientError: Access denied to model at runtime despite correct IAM permissions.
Fix: navigate to Bedrock → Model access in each required region and enable the models before deployment.
Cloud Run container fails health check and never receives traffic
Why: the health check endpoint returns non-2xx during the LLM model warm-up period; Cloud Run marks the container unhealthy and stops routing to it.
Detect: Cloud Run logs show container restarting repeatedly; health check endpoint returns 503 during startup.
Fix: add a startup probe with a long initialDelaySeconds; return 200 from /health immediately even before the model is fully loaded, and use /ready for readiness.
Azure OpenAI deployment name vs model name confusion causes DeploymentNotFound
Why: Azure OpenAI uses the deployment name (user-defined) not the model name in API calls; mixing them up returns a 404.
Detect: openai.NotFoundError: The API deployment ... does not exist.
Fix: verify deployment names in the Azure OpenAI Studio portal; model is gpt-4o from OpenAI but your deployment might be named gpt-4o-prod — use the deployment name.
Secrets stored in environment variables are exposed in crash dumps and logs
Why: unhandled exceptions that print the full environment expose API keys stored as ANTHROPIC_API_KEY=... in the stack trace.
Detect: check your application logs for lines containing API_KEY or SECRET; rotate any keys found.
Fix: store secrets in Secrets Manager / Secret Manager / Key Vault; retrieve at runtime, never export to process environment for non-critical apps.
Connections
- infra/inference-serving — vLLM, llama.cpp for self-hosted open model inference on cloud GPUs
- infra/deployment — Docker, GitHub Actions CI/CD, Vercel, Fly.io deployment patterns
- infra/gpu-hardware — GPU selection before choosing an instance type
- apis/anthropic-api — direct Anthropic API as the baseline comparison for Bedrock
- apis/aws-bedrock — deep dive on Converse API, Knowledge Bases, Guardrails, boto3 patterns
- security/owasp-llm-top10 — cloud-specific AI security considerations (IAM, supply chain)
Open Questions
- When does the Bedrock latency overhead become significant enough to switch to direct Anthropic API?
- What are the data residency guarantees for Bedrock in non-US regions?
- How does Vertex AI Model Garden pricing compare to direct Anthropic API for Claude in production workloads?
Related reading