Platform Engineering

Building and operating an Internal Developer Platform (IDP) that enables product teams to self-serve infrastructure, deployments, and tooling — without needing deep ops expertise.

Building and operating an Internal Developer Platform (IDP) that enables product teams to self-serve infrastructure, deployments, and tooling. Without needing deep ops expertise. Platform engineering treats developers as customers.


Why Platform Engineering

Without an IDP:
  - Every team manages their own Kubernetes YAML, CI pipelines, monitoring
  - Duplicated effort; inconsistent security posture
  - Senior engineers become glue — unblocking others instead of building

With an IDP:
  - Teams self-serve: "create new service" → golden path handles CI/CD, observability, secrets
  - Platform team provides paved roads; teams stay in the fast lane
  - Consistency without mandating every decision

The SPACE Framework

Satisfaction — are developers happy with the platform?
Performance — deployment frequency, change lead time
Activity — code commits, PR merges, deployments per team
Communication — documentation usage, support tickets
Efficiency — time to onboard new service, time to prod


Backstage

CNCF project for building IDPs. Provides a service catalog, software templates (scaffolders), TechDocs, and a plugin ecosystem.

npx @backstage/create-app@latest
cd backstage
yarn dev   # localhost:3000
# catalog-info.yaml (every service registers itself)
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: order-service
  description: Handles order creation and lifecycle
  annotations:
    github.com/project-slug: myorg/order-service
    backstage.io/techdocs-ref: dir:.
    prometheus.io/rule: |
      sum(rate(http_requests_total{job="order-service"}[5m]))
  tags: [python, fastapi, production]
  links:
  - url: https://grafana.mycompany.com/d/order-service
    title: Grafana Dashboard
spec:
  type: service
  lifecycle: production
  owner: team-commerce
  system: checkout
  dependsOn:
  - resource:default/orders-db
  - component:default/payment-service

Software Templates (Scaffolders)

# Template that creates a new Python service with all golden paths pre-wired
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: python-fastapi-service
  title: Python FastAPI Service
  description: Creates a new service with CI/CD, observability, and secrets management
spec:
  owner: platform-team
  type: service
  parameters:
  - title: Service Details
    properties:
      name:
        title: Service Name
        type: string
        pattern: '^[a-z][a-z0-9-]*$'
      owner:
        title: Owning Team
        type: string
        ui:field: OwnerPicker
      environment:
        title: Initial Environment
        type: string
        enum: [staging, production]
  steps:
  - id: fetch-template
    name: Fetch Template
    action: fetch:template
    input:
      url: ./content
      values:
        name: ${{ parameters.name }}
        owner: ${{ parameters.owner }}
  - id: publish
    name: Publish to GitHub
    action: publish:github
    input:
      repoUrl: github.com?owner=myorg&repo=${{ parameters.name }}
      defaultBranch: main
  - id: register
    name: Register in Catalog
    action: catalog:register
    input:
      repoContentsUrl: ${{ steps.publish.output.repoContentsUrl }}
      catalogInfoPath: /catalog-info.yaml

Golden Path Elements

A golden path is the recommended, supported way to do common tasks:

New service golden path:
  1. Use Backstage template → creates GitHub repo with boilerplate
  2. GitHub Actions CI pre-configured → test + build + push image
  3. ArgoCD ApplicationSet auto-detects new repo → deploys to staging
  4. Observability pre-wired → Prometheus metrics, Loki logs, OTel tracing
  5. Secrets via External Secrets Operator → team requests access to their namespace
  6. Service catalog entry created automatically

Developer effort: fill in 5 fields in Backstage UI
Platform effort: maintain the template, update it once for all services

Platform Team KPIs

DORA metrics (measure platform impact on product teams):
  Deployment frequency: daily → multiple per day (target)
  Lead time for changes: < 1 day (target)
  Change failure rate: < 5%
  Time to restore: < 1 hour

Platform-specific:
  Mean time to onboard new service: < 2 hours
  % teams using golden path: > 80%
  Support tickets per team per month: trending down
  Self-service rate: % of requests resolved without platform team involvement

Common Failure Cases

Backstage catalog goes stale and teams stop trusting it Why: catalog-info.yaml files drift from the actual service state because teams create new repos without going through the golden path, and there is no automated enforcement to register catalog entries. Detect: team members report finding outdated ownership information; the number of catalog components has not grown in weeks despite new services being deployed. Fix: add a CI check that fails if a repo lacks a valid catalog-info.yaml; use Backstage's catalog-import GitHub Action to auto-register new repos on PR merge.

Software template (scaffolder) generates broken repos because the template is not tested Why: the template references a variable or action that changed in the Backstage scaffolder plugin version without the template being updated; teams run it and get a repo in a broken state. Detect: scaffolder job shows green but the generated repo fails its initial CI run; error messages reference undefined template variables. Fix: version-lock scaffolder plugin updates; test all templates in a staging Backstage instance before promoting to production; add a post-create CI step that runs the generated repo's npm install and build.

Golden path adoption stalls because teams perceive it as slower than doing it manually Why: the scaffolder pipeline is a bottleneck — Backstage calls GitHub, waits for repo creation, then waits for ArgoCD to detect the new app, producing multi-minute waits that feel worse than git init + copy-paste. Detect: platform support tickets contain "it was faster to do it without Backstage"; adoption percentage stalls below 50%. Fix: benchmark and optimise the critical path (typically ArgoCD ApplicationSet sync interval — reduce from 3m to 30s); show a live progress UI in the scaffolder step so teams see work happening rather than a spinner.

DORA metrics look good but developer experience is poor Why: deployment frequency is high because automated rollbacks count as deployments, and lead time is measured from PR merge to deployment rather than from code-complete to customer; the metrics are gamed by the tooling. Detect: SPACE satisfaction scores are low despite strong DORA numbers; teams report rollbacks as a daily occurrence. Fix: separately track rollback frequency and change failure rate; set alerts when rollback rate exceeds 10%; treat rollback-inflated deployment frequency as a lagging signal, not a leading one.

Connections

cloud-hub · cloud/gitops-patterns · cloud/argocd · cloud/kubernetes · cloud/github-actions · cloud/observability-stack · cloud/secrets-management

Open Questions

  • What monitoring and alerting matter most when this is deployed in production?
  • At what scale or workload does this approach hit its practical limits?