AWS SageMaker

Managed ML platform: training, model registry, real-time inference endpoints, and batch transform — the bridge between cloud engineering and AI engineering.

Updated Invalid Date·

sagemaker ml-serving endpoints training-jobs model-registry batch-transform mlops

Managed ML platform: training, model registry, real-time inference endpoints, and batch transform. The bridge between cloud engineering and AI engineering.

SageMaker vs Alternatives

SageMaker:
  - Managed training + inference + model registry in one platform
  - Tight AWS integration (S3, ECR, IAM, CloudWatch)
  - Supports any framework (PyTorch, TF, HuggingFace, XGBoost, custom Docker)
  - Cost: endpoint hours + training hours + storage

When to use SageMaker:
  - You're already on AWS and need managed ML infra
  - You need autoscaling inference endpoints without managing EC2
  - You need model versioning and A/B testing
  - You run batch predictions (batch transform)

When NOT to use SageMaker:
  - You're serving open-weight LLMs at scale → vLLM on ECS/EKS is cheaper
  - You need GPU inference under 100ms with full control → custom EC2/container
  - Your team doesn't know SageMaker (steep learning curve)

Real-Time Inference Endpoint

import boto3
import sagemaker
from sagemaker.huggingface import HuggingFaceModel

# Deploy a HuggingFace model from Hub to a SageMaker endpoint
session = sagemaker.Session()
role = "arn:aws:iam::123456789:role/SageMakerExecutionRole"

# Model configuration
hub_model = HuggingFaceModel(
    model_data=None,            # pull directly from Hub (no S3 needed)
    env={
        "HF_MODEL_ID": "sentence-transformers/all-MiniLM-L6-v2",
        "HF_TASK": "feature-extraction",
    },
    role=role,
    transformers_version="4.37",
    pytorch_version="2.1",
    py_version="py310",
)

# Deploy to a real-time endpoint
predictor = hub_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g4dn.xlarge",   # GPU instance for transformers
    endpoint_name="embedding-endpoint-prod",
)

# Invoke the endpoint
response = predictor.predict({
    "inputs": ["semantic search query", "document to embed"],
})
# Returns list of embeddings

# Clean up
predictor.delete_endpoint()

Autoscaling Endpoints

import boto3

sm_runtime = boto3.client("application-autoscaling")

# Register the endpoint variant as a scalable target
sm_runtime.register_scalable_target(
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/embedding-endpoint-prod/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    MinCapacity=1,
    MaxCapacity=10,
)

# Scale on invocations per instance
sm_runtime.put_scaling_policy(
    PolicyName="invocations-scaling",
    ServiceNamespace="sagemaker",
    ResourceId="endpoint/embedding-endpoint-prod/variant/AllTraffic",
    ScalableDimension="sagemaker:variant:DesiredInstanceCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "SageMakerVariantInvocationsPerInstance",
        },
        "TargetValue": 1000,           # scale when > 1000 invocations/instance/min
        "ScaleInCooldown": 300,        # 5 min cooldown before scale-in
        "ScaleOutCooldown": 60,        # 1 min before scale-out
    },
)

Batch Transform

from sagemaker.sklearn import SKLearnModel
import boto3

# Batch transform: run inference on large datasets without a persistent endpoint
# Input: S3 bucket with CSV files. Output: S3 bucket with predictions.

s3 = boto3.client("s3")

model = SKLearnModel(
    model_data="s3://my-bucket/models/classifier.tar.gz",
    role=role,
    framework_version="1.2-1",
    py_version="py3",
    entry_point="inference.py",
)

transformer = model.transformer(
    instance_count=5,
    instance_type="ml.m5.xlarge",
    output_path="s3://my-bucket/predictions/",
    strategy="MultiRecord",           # batch multiple records per request
    assemble_with="Line",
    max_payload=6,                    # MB per batch
)

transformer.transform(
    data="s3://my-bucket/input-data/",
    content_type="text/csv",
    split_type="Line",
    wait=False,                       # async — poll status separately
)

# Poll for completion
import time
job_name = transformer.latest_transform_job.name
while True:
    status = boto3.client("sagemaker").describe_transform_job(
        TransformJobName=job_name
    )["TransformJobStatus"]
    if status in ("Completed", "Failed", "Stopped"):
        break
    time.sleep(30)

Model Registry and A/B Testing

import boto3

sm = boto3.client("sagemaker")

# Register a model version in the model registry
sm.create_model_package(
    ModelPackageGroupName="order-classifier",
    ModelPackageDescription="XGBoost order classifier v2.1.0",
    InferenceSpecification={
        "Containers": [{
            "Image": "123456789.dkr.ecr.eu-west-1.amazonaws.com/order-classifier:v2.1.0",
            "ModelDataUrl": "s3://my-bucket/models/order-classifier-v2.1.0.tar.gz",
        }],
        "SupportedContentTypes": ["text/csv"],
        "SupportedResponseMIMETypes": ["text/csv"],
    },
    ModelApprovalStatus="Approved",
)

# A/B test: send 10% of traffic to new model version
sm.update_endpoint(
    EndpointName="order-classifier-prod",
    EndpointConfigName="ab-test-config",
)

sm.create_endpoint_config(
    EndpointConfigName="ab-test-config",
    ProductionVariants=[
        {
            "VariantName": "ModelV1",
            "ModelName": "order-classifier-v1",
            "InitialVariantWeight": 90,      # 90% of traffic
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 1,
        },
        {
            "VariantName": "ModelV2",
            "ModelName": "order-classifier-v2",
            "InitialVariantWeight": 10,      # 10% of traffic
            "InstanceType": "ml.m5.xlarge",
            "InitialInstanceCount": 1,
        },
    ],
)

SageMaker vs vLLM for LLMs

Use SageMaker when:
  - Model is < 7B parameters and managed inference beats self-hosting cost
  - You need out-of-box monitoring + model registry
  - Team is AWS-focused and doesn't want Kubernetes

Use vLLM on ECS/EKS when:
  - Serving 7B+ open models (Llama 3, Mistral) at scale
  - You need paged attention + continuous batching (vLLM's key advantage)
  - SageMaker container overhead is unacceptable for latency SLA < 100ms
  - Cost: vLLM on EC2 p4d is typically 40-60% cheaper than SageMaker for LLMs

Hybrid: SageMaker for model registry + endpoint management, ECS for the actual serving.

Common Failure Cases

Endpoint deployment fails with "ResourceLimitExceeded" for GPU instances Why: ml.g4dn.xlarge and similar GPU instance types have per-account service quota limits, and you've hit the limit across running endpoints and training jobs. Detect: create_endpoint raises ResourceLimitExceeded: An error occurred ... you have exceeded your service limit for instances of type ml.g4dn.xlarge. Fix: request a quota increase via AWS Service Quotas for the specific ml.* instance type; in the interim, use a smaller instance type or delete unused endpoints.

Endpoint invocation returns 413 — payload too large Why: SageMaker real-time endpoints have a 6 MB payload limit per request; sending raw image bytes or large document batches inline exceeds this. Detect: botocore.exceptions.ClientError: An error occurred (413) ... Payload Too Large. Fix: for large inputs, upload to S3 and pass the S3 URI; for batch predictions, use Batch Transform instead of real-time endpoints.

Autoscaling does not scale down — instances idle at minimum Why: ScaleInCooldown is set too long (e.g., 600s default) relative to traffic patterns, so the scale-in policy never fires during low-traffic windows. Detect: CloudWatch SageMakerVariantInvocationsPerInstance is near zero but instance count stays at MinCapacity > 1; cost is higher than expected. Fix: reduce ScaleInCooldown to match your off-peak window (e.g., 120-300s), and set MinCapacity: 0 if you can tolerate the cold start latency of scaling from zero.

Model container fails health check — endpoint never becomes InService Why: the custom inference container's /ping endpoint returns a non-200 status or takes longer than 60 seconds to respond during startup, causing SageMaker to mark it as unhealthy. Detect: endpoint status stays Creating then transitions to Failed; CloudWatch /aws/sagemaker/Endpoints logs show health check timeouts. Fix: ensure the container starts a web server on port 8080 that responds 200 to GET /ping within the startup window; the model weights should be loaded asynchronously or the health check should return 200 once the port is bound even before weights are loaded.

Connections

cloud/cloud-hub · cloud/aws-core · cloud/aws-fargate · cloud/aws-eks · infra/inference-serving · llms/ae-hub · fine-tuning/frameworks

Open Questions

What monitoring and alerting matter most when this is deployed in production?
At what scale or workload does this approach hit its practical limits?