GCP Core Services

Google Cloud Platform. Second cloud for most teams; first cloud for ML/AI workloads and analytics. GCP leads on Kubernetes (invented it), BigQuery, and Vertex AI. 12% market share (2026).

Google Cloud Platform. Second cloud for most teams; first cloud for ML/AI workloads and analytics. GCP leads on Kubernetes (invented it), BigQuery, and Vertex AI. 12% market share (2026).


Compute

Compute Engine

Virtual machines. Machine types: general purpose (N2, E2), compute-optimised (C3), memory-optimised (M3), accelerator (A3, H100 GPUs). Preemptible VMs (Spot) up to 91% cheaper, interrupted with 30s notice.

Cloud Run

Serverless containers. Bring your own container image, GCP handles scaling from zero. Request-driven (min-instances=0 for cost) or always-on. No 15-minute cap like Lambda. Requests timeout at 60 min max. Ideal for APIs and event-driven workloads.

# Deploy a container
gcloud run deploy my-api \
  --image gcr.io/my-project/my-api:latest \
  --region europe-west1 \
  --allow-unauthenticated \
  --min-instances 0 \
  --max-instances 100

GKE — Google Kubernetes Engine

Managed Kubernetes. GCP invented Kubernetes; GKE is the most feature-complete managed offering. Autopilot mode (GKE manages nodes, bin-packing, security hardening) vs Standard mode (you manage node pools). As of 2025, Autopilot supports Spot pods and A100/H100 GPU node pools.

Key GKE features:

  • Vertical Pod Autoscaler — automatically right-sizes CPU/memory requests
  • Node Auto Provisioner — creates custom node pools sized for pending pods
  • Binary Authorization — only run signed container images
  • Workload Identity — pod SA maps to Google SA, no service account keys
# Create Autopilot cluster
gcloud container clusters create-auto my-cluster \
  --region europe-west1

# Get credentials
gcloud container clusters get-credentials my-cluster --region europe-west1

Cloud Functions

Event-driven functions (Gen 2). Node.js, Python, Go, Java, .NET, Ruby. 60-min timeout (Gen 2), VPC connector for private network access. Trigger via HTTP, Pub/Sub, Cloud Storage, Firestore, Cloud Scheduler.


Storage

Cloud Storage

Object storage (equivalent to S3). Global namespace, 11-nines durability. Storage classes: Standard → Nearline (30-day min) → Coldline (90-day min) → Archive (365-day min). Autoclass auto-transitions objects between classes based on access.

# Create bucket
gcloud storage buckets create gs://my-bucket --location=europe-west1

# Upload
gcloud storage cp file.txt gs://my-bucket/

# Signed URL (1-hour expiry)
gcloud storage sign-url gs://my-bucket/file.txt --duration=1h

Cloud SQL / Spanner / Firestore

  • Cloud SQL — managed PostgreSQL, MySQL, SQL Server. HA via regional replicas.
  • Spanner — globally distributed RDBMS. Strong consistency across regions. Expensive but unique for global transactional workloads.
  • Firestore — serverless NoSQL (document/collection). Real-time listeners. Native mobile SDK.

Persistent Disk / Filestore

  • Persistent Disk — block storage for Compute Engine and GKE. SSD (pd-ssd) or standard HDD.
  • Filestore — managed NFS. Shared filesystem for GKE workloads needing ReadWriteMany.

Analytics and AI

BigQuery

Serverless data warehouse. Columnar storage, separate compute and storage. Pricing: $6.25/TB scanned (on-demand) or flat-rate slots. Federated queries against Cloud Storage (Iceberg/Parquet/Avro), Cloud Bigtable, and Google Sheets.

-- Query public dataset, only scan what you need
SELECT
  departure_airport,
  COUNT(*) AS flights,
  AVG(arr_delay) AS avg_delay
FROM `bigquery-public-data.airline_ontime_data.flights`
WHERE EXTRACT(YEAR FROM fl_date) = 2023
GROUP BY 1
ORDER BY 2 DESC
LIMIT 20;

BigQuery ML: train models in SQL (CREATE MODEL). Support for linear regression, boosted trees, k-means, matrix factorisation, and importing TensorFlow/PyTorch models.

Vertex AI

GCP's unified ML platform (renamed from AI Platform in 2021, expanded to Gemini Enterprise Agent Platform in 2025). Key components:

  • Model Garden — 130+ foundation models (Gemini, Imagen, Codey, open-source)
  • Gemini API — same models as Google AI Studio, enterprise SLAs, VPC Service Controls
  • Training — managed custom training jobs (GPU/TPU pools)
  • Prediction — online (REST endpoints) and batch prediction
  • Pipelines — KFP-based ML pipelines with caching and artifact tracking
  • Feature Store — centralised feature serving (online: Bigtable-backed, offline: BigQuery-backed)
from google.cloud import aiplatform

aiplatform.init(project="my-project", location="europe-west1")

# Deploy Gemini via API
from google.generativeai import GenerativeModel
model = GenerativeModel("gemini-1.5-pro")
response = model.generate_content("Explain BigQuery slots.")
print(response.text)

Pub/Sub

Managed message queue / event streaming. Guaranteed at-least-once delivery. Push (HTTP webhook) or pull. Exactly-once delivery available. Integrates with Dataflow for streaming pipelines.


Networking

VPC

Same concept as AWS VPC. GCP VPCs are global (subnets are regional). Shared VPC for org-wide network management. VPC peering and Cloud Interconnect (dedicated 10G/100G links).

Cloud Load Balancing

Global anycast L7 (HTTPS LB). Single IP, traffic routed to nearest healthy backend. Regional L4 (TCP/UDP). Cloud Armor for WAF and DDoS protection (integrated with L7 LB).

Cloud DNS

Managed DNS. 100% SLA. Private zones for VPC-internal resolution.


Security and Identity

IAM

Principal → Role → Permissions → Resources

Principals: Google Accounts, Service Accounts, Google Groups, Cloud Identity domains. Roles: Basic (Owner/Editor/Viewer, avoid), Predefined (e.g., roles/storage.objectViewer), Custom.

Workload Identity Federation — allow external identities (GitHub Actions OIDC, AWS IAM, Azure AD) to impersonate GCP service accounts without keys.

# Grant role
gcloud projects add-iam-policy-binding my-project \
  --member="serviceAccount:my-sa@my-project.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

Secret Manager

Store and version secrets. Automatic replication across regions. Rotate via rotation schedule (triggers Pub/Sub notification). Access via SDK or mounted as volume in Cloud Run/GKE.

from google.cloud import secretmanager

client = secretmanager.SecretManagerServiceClient()
name = "projects/my-project/secrets/db-password/versions/latest"
secret = client.access_secret_version(request={"name": name})
print(secret.payload.data.decode("UTF-8"))

Key CLI

# Auth
gcloud auth login
gcloud auth application-default login   # for SDK use in code
gcloud config set project my-project

# List resources
gcloud compute instances list
gcloud run services list --region europe-west1
gcloud container clusters list

# IAM
gcloud iam service-accounts list

GCP vs AWS Equivalents

GCPAWS
Compute EngineEC2
Cloud RunApp Runner / Lambda (container)
GKEEKS
Cloud StorageS3
Cloud SQLRDS
SpannerAurora Global
BigQueryRedshift
Pub/SubSQS + SNS
Vertex AISageMaker
Cloud ArmorWAF / Shield
Secret ManagerSecrets Manager

Common Failure Cases

Cloud Run container exits immediately with exit code 1 because the port does not match PORT env var Why: Cloud Run injects the PORT environment variable (default 8080) and expects the container to listen on it; a hardcoded port in the CMD that differs from PORT causes the health check to fail and the container to be restarted in a loop. Detect: Cloud Run deployment shows Container failed to start. Failed to start and then listen on the port defined by the PORT environment variable; the service never reaches a healthy state. Fix: Read the port from os.environ.get("PORT", "8080") in the application startup code rather than hardcoding; confirm with gcloud run services describe <service> --format="value(status.conditions)".

Workload Identity binding on the wrong namespace or service account causes 403s Why: The Kubernetes service account annotated with iam.gke.io/gcp-service-account must match exactly the Google service account bound in IAM; a typo in the namespace or KSA name means the binding never matches at token request time. Detect: Pod logs show Permission denied when calling GCP APIs despite the pod using the correct service account; gcloud iam service-accounts get-iam-policy shows a binding that references a different namespace or KSA. Fix: Verify the annotation on the KSA with kubectl get sa <ksa-name> -n <namespace> -o yaml; ensure the IAM binding format is exactly serviceAccount:<project>.svc.id.goog[<namespace>/<ksa-name>].

BigQuery on-demand query cost blows the monthly budget due to a missing partition filter Why: A scheduled query or BI tool query omits the partition column in the WHERE clause; BigQuery scans the entire table (potentially petabytes) and charges $5/TB scanned. Detect: BigQuery job history shows Total bytes billed in the TB range for a single query; the cost spike appears in the daily FinOps report. Fix: Enable require_partition_filter on the BigQuery table so unpartitioned queries are rejected at the API level; set a per-project or per-user custom cost control in the BigQuery console to cap bytes billed per query.

Cloud Run min-instances=0 causing P99 latency spikes from cold starts on bursty traffic Why: With min-instances=0 a period of inactivity causes all containers to scale to zero; the next burst of traffic triggers cold starts (300-2000ms for Python/Java) while new containers initialise. Detect: Cloud Run latency metrics show spikes to 1-3 seconds at irregular intervals correlating with periods of low traffic preceding a burst; P50 is fine but P99 is high. Fix: Set min-instances=1 for latency-sensitive APIs to keep one warm instance always running; use Cloud Run's startup CPU boost (--cpu-boost) to reduce cold start duration for Python/JVM apps.

Connections

Open Questions

  • What monitoring and alerting matter most when this is deployed in production?
  • At what scale or workload does this approach hit its practical limits?