Platform Engineering
Building and operating an Internal Developer Platform (IDP) that enables product teams to self-serve infrastructure, deployments, and tooling — without needing deep ops expertise.
Building and operating an Internal Developer Platform (IDP) that enables product teams to self-serve infrastructure, deployments, and tooling. Without needing deep ops expertise. Platform engineering treats developers as customers.
Why Platform Engineering
Without an IDP:
- Every team manages their own Kubernetes YAML, CI pipelines, monitoring
- Duplicated effort; inconsistent security posture
- Senior engineers become glue — unblocking others instead of building
With an IDP:
- Teams self-serve: "create new service" → golden path handles CI/CD, observability, secrets
- Platform team provides paved roads; teams stay in the fast lane
- Consistency without mandating every decision
The SPACE Framework
Satisfaction — are developers happy with the platform?
Performance — deployment frequency, change lead time
Activity — code commits, PR merges, deployments per team
Communication — documentation usage, support tickets
Efficiency — time to onboard new service, time to prod
Backstage
CNCF project for building IDPs. Provides a service catalog, software templates (scaffolders), TechDocs, and a plugin ecosystem.
npx @backstage/create-app@latest
cd backstage
yarn dev # localhost:3000# catalog-info.yaml (every service registers itself)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: order-service
description: Handles order creation and lifecycle
annotations:
github.com/project-slug: myorg/order-service
backstage.io/techdocs-ref: dir:.
prometheus.io/rule: |
sum(rate(http_requests_total{job="order-service"}[5m]))
tags: [python, fastapi, production]
links:
- url: https://grafana.mycompany.com/d/order-service
title: Grafana Dashboard
spec:
type: service
lifecycle: production
owner: team-commerce
system: checkout
dependsOn:
- resource:default/orders-db
- component:default/payment-serviceSoftware Templates (Scaffolders)
# Template that creates a new Python service with all golden paths pre-wired
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: python-fastapi-service
title: Python FastAPI Service
description: Creates a new service with CI/CD, observability, and secrets management
spec:
owner: platform-team
type: service
parameters:
- title: Service Details
properties:
name:
title: Service Name
type: string
pattern: '^[a-z][a-z0-9-]*$'
owner:
title: Owning Team
type: string
ui:field: OwnerPicker
environment:
title: Initial Environment
type: string
enum: [staging, production]
steps:
- id: fetch-template
name: Fetch Template
action: fetch:template
input:
url: ./content
values:
name: ${{ parameters.name }}
owner: ${{ parameters.owner }}
- id: publish
name: Publish to GitHub
action: publish:github
input:
repoUrl: github.com?owner=myorg&repo=${{ parameters.name }}
defaultBranch: main
- id: register
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yamlGolden Path Elements
A golden path is the recommended, supported way to do common tasks:
New service golden path:
1. Use Backstage template → creates GitHub repo with boilerplate
2. GitHub Actions CI pre-configured → test + build + push image
3. ArgoCD ApplicationSet auto-detects new repo → deploys to staging
4. Observability pre-wired → Prometheus metrics, Loki logs, OTel tracing
5. Secrets via External Secrets Operator → team requests access to their namespace
6. Service catalog entry created automatically
Developer effort: fill in 5 fields in Backstage UI
Platform effort: maintain the template, update it once for all services
Platform Team KPIs
DORA metrics (measure platform impact on product teams):
Deployment frequency: daily → multiple per day (target)
Lead time for changes: < 1 day (target)
Change failure rate: < 5%
Time to restore: < 1 hour
Platform-specific:
Mean time to onboard new service: < 2 hours
% teams using golden path: > 80%
Support tickets per team per month: trending down
Self-service rate: % of requests resolved without platform team involvement
Common Failure Cases
Backstage catalog goes stale and teams stop trusting it
Why: catalog-info.yaml files drift from the actual service state because teams create new repos without going through the golden path, and there is no automated enforcement to register catalog entries.
Detect: team members report finding outdated ownership information; the number of catalog components has not grown in weeks despite new services being deployed.
Fix: add a CI check that fails if a repo lacks a valid catalog-info.yaml; use Backstage's catalog-import GitHub Action to auto-register new repos on PR merge.
Software template (scaffolder) generates broken repos because the template is not tested
Why: the template references a variable or action that changed in the Backstage scaffolder plugin version without the template being updated; teams run it and get a repo in a broken state.
Detect: scaffolder job shows green but the generated repo fails its initial CI run; error messages reference undefined template variables.
Fix: version-lock scaffolder plugin updates; test all templates in a staging Backstage instance before promoting to production; add a post-create CI step that runs the generated repo's npm install and build.
Golden path adoption stalls because teams perceive it as slower than doing it manually
Why: the scaffolder pipeline is a bottleneck — Backstage calls GitHub, waits for repo creation, then waits for ArgoCD to detect the new app, producing multi-minute waits that feel worse than git init + copy-paste.
Detect: platform support tickets contain "it was faster to do it without Backstage"; adoption percentage stalls below 50%.
Fix: benchmark and optimise the critical path (typically ArgoCD ApplicationSet sync interval — reduce from 3m to 30s); show a live progress UI in the scaffolder step so teams see work happening rather than a spinner.
DORA metrics look good but developer experience is poor Why: deployment frequency is high because automated rollbacks count as deployments, and lead time is measured from PR merge to deployment rather than from code-complete to customer; the metrics are gamed by the tooling. Detect: SPACE satisfaction scores are low despite strong DORA numbers; teams report rollbacks as a daily occurrence. Fix: separately track rollback frequency and change failure rate; set alerts when rollback rate exceeds 10%; treat rollback-inflated deployment frequency as a lagging signal, not a leading one.
Connections
cloud-hub · cloud/gitops-patterns · cloud/argocd · cloud/kubernetes · cloud/github-actions · cloud/observability-stack · cloud/secrets-management
Open Questions
- What monitoring and alerting matter most when this is deployed in production?
- At what scale or workload does this approach hit its practical limits?
Related reading