Blue-Green and Advanced Deployment Strategies
Zero-downtime deployment patterns with instant rollback capability.
Zero-downtime deployment patterns with instant rollback capability.
Strategy Comparison
Strategy Downtime Rollback speed Resource cost Risk
────────────────────────────────────────────────────────────
Recreate Yes Minutes 1× High
Rolling No Slow 1× Medium
Blue-Green No Instant 2× Low
Canary No Instant 1.05-1.2× Very low
Shadow No N/A 2× None (observe only)
Choose:
Blue-Green → stateless services, databases already migrated, need instant rollback
Canary → risk-averse, want gradual confidence, have good observability
Rolling → stateless, can tolerate mixed versions, minimal extra cost
Blue-Green on Kubernetes with ArgoCD
# argo-rollout.yaml — blue-green strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: order-service
spec:
replicas: 4
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: myregistry/order-service:{{ .Values.image.tag }}
ports:
- containerPort: 8000
strategy:
blueGreen:
activeService: order-service-active
previewService: order-service-preview
autoPromotionEnabled: false # manual promotion gate
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: order-service-preview
postPromotionAnalysis:
templates:
- templateName: success-rate
---
apiVersion: v1
kind: Service
metadata:
name: order-service-active
spec:
selector:
app: order-service
ports:
- port: 80
targetPort: 8000
---
apiVersion: v1
kind: Service
metadata:
name: order-service-preview
spec:
selector:
app: order-service
ports:
- port: 80
targetPort: 8000# Promotion workflow
kubectl argo rollouts promote order-service # promote green to active
kubectl argo rollouts abort order-service # abort — traffic stays on blue
kubectl argo rollouts undo order-service # rollback to previous
kubectl argo rollouts status order-service --watch # observe state machineBlue-Green on AWS ECS with CodeDeploy
# CDK: ECS blue-green via CodeDeploy
from aws_cdk import (
aws_ecs as ecs,
aws_codedeploy as codedeploy,
aws_elasticloadbalancingv2 as elbv2,
)
# Two target groups — one per colour
blue_tg = elbv2.ApplicationTargetGroup(self, "BlueTG",
vpc=vpc, port=8000, protocol=elbv2.ApplicationProtocol.HTTP,
health_check=elbv2.HealthCheck(path="/health/live"),
)
green_tg = elbv2.ApplicationTargetGroup(self, "GreenTG",
vpc=vpc, port=8000, protocol=elbv2.ApplicationProtocol.HTTP,
health_check=elbv2.HealthCheck(path="/health/live"),
)
# ECS service with CODE_DEPLOY deployment controller
service = ecs.FargateService(self, "OrderService",
cluster=cluster,
task_definition=task_def,
deployment_controller=ecs.DeploymentController(
type=ecs.DeploymentControllerType.CODE_DEPLOY,
),
load_balancers=[
ecs.EcsTarget(
container_name="order-service",
container_port=8000,
new_target_group_id="blue",
listener=ecs.ListenerConfig.application_listener(
listener, protocol=elbv2.ApplicationProtocol.HTTP,
),
)
],
)
# CodeDeploy deployment group
deployment_group = codedeploy.EcsDeploymentGroup(self, "DeploymentGroup",
service=service,
blue_green_deployment_config=codedeploy.EcsBlueGreenDeploymentConfig(
listener=listener,
blue_target_group=blue_tg,
green_target_group=green_tg,
deployment_approval_wait_time=Duration.hours(1), # manual bake time
terminate_blue_instances_on_deployment_success=codedeploy.TrafficRoutingConfig(
type=codedeploy.TrafficRoutingType.ALL_AT_ONCE,
),
),
)Feature Flags as Deployment Gate
# Use feature flags (LaunchDarkly / Unleash / Flagsmith) to decouple deploy from release.
# Blue-green moves traffic; flags control feature visibility.
import ldclient
from ldclient import Context
ld_client = ldclient.get()
def is_new_checkout_enabled(user_id: str) -> bool:
context = Context.builder(user_id).kind("user").build()
return ld_client.variation("new-checkout-flow", context, False)
# Deployment pattern:
# 1. Deploy new code to green (feature off)
# 2. Smoke test green with feature off (safe)
# 3. Enable feature flag for 5% of users on green
# 4. Monitor metrics, then ramp to 100%
# 5. Shift all traffic to green
# 6. Keep blue on standby for 24h, then terminateAnalysis Templates — Automated Promotion Gates
# argo-rollout-analysis.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
count: 5
successCondition: result[0] >= 0.99 # 99% success rate required
failureLimit: 2
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status!~"5.."
}[2m])) /
sum(rate(http_requests_total{
service="{{args.service-name}}"
}[2m]))
- name: p99-latency
interval: 30s
count: 5
successCondition: result[0] <= 0.5 # 500ms p99 threshold
provider:
prometheus:
address: http://prometheus:9090
query: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}"
}[2m])
)Common Failure Cases
Green environment not receiving production-equivalent traffic during bake time
Why: Green is tested in isolation or with synthetic traffic; real user patterns (session cookies, large payloads, authenticated flows) only appear after the full cutover.
Detect: Error rate spikes within the first 5 minutes after promotion that were absent during bake.
Fix: Route a small slice of real traffic to green during bake using previewService weighted routing, or use canary instead of pure blue-green for stateful-heavy flows.
Active service selector not updating after Argo Rollouts promotion
Why: A misconfigured activeService name in the Rollout spec causes the active Service's selector to remain pointing at the old (blue) pods after promotion.
Detect: kubectl argo rollouts get rollout <name> shows Status: Healthy but live traffic still returns old responses; check kubectl get svc order-service-active -o yaml and confirm pod selector hash.
Fix: Verify the activeService and previewService fields in the Rollout spec exactly match the Service resource names; re-apply the corrected spec and re-promote.
Database schema incompatibility between blue and green versions Why: Green deploys a schema migration (column rename, type change) that blue cannot tolerate; if rollback is triggered, blue writes corrupt data or crashes. Detect: Blue pods emit schema validation errors or ORM mapping failures immediately after rollback. Fix: Enforce expand-contract migrations: green must be backward-compatible with blue's schema for the entire bake window; only run destructive DDL after blue is fully terminated.
CodeDeploy deployment stuck waiting for ELB health checks on green
Why: The green task definition's health check path or port does not match the ALB target group configuration, causing targets to remain unhealthy indefinitely.
Detect: CodeDeploy console shows deployment in Created state with Waiting for ELB health check beyond the expected timeout.
Fix: Confirm the ALB target group health check path (/health/live) and port match the container's exposed port; verify the security group allows the ALB to reach the green tasks on that port.
Connections
cloud-hub · cloud/argo-rollouts · cloud/argocd · cloud/kubernetes · cloud/aws-ecs · cloud/github-actions · cloud/gitops-patterns
Open Questions
- What monitoring and alerting matter most when this is deployed in production?
- At what scale or workload does this approach hit its practical limits?
Related reading