AWS Analytics Services

AWS analytics services decision guide — Athena (serverless SQL on S3), EMR (managed Spark), Glue (serverless ETL), Kinesis (real-time streaming), OpenSearch (log search), QuickSight (BI), Redshift (data warehouse).

CLF-C02 and SAA-C03 exam questions require selecting the correct analytics service. The differentiator is almost always: real-time vs batch, serverless vs managed cluster, ad-hoc vs repeated queries.


Services at a Glance

ServiceWhat it isKey differentiator
Amazon AthenaServerless SQL on S3No infrastructure, pay per query
Amazon EMRManaged Spark/Hadoop clusterCustom code, petabyte scale
AWS GlueServerless ETL + Data CatalogSchema discovery, no cluster
Amazon KinesisReal-time streamingSub-second processing, ordered
Amazon OpenSearchLog analytics + full-text searchElasticsearch successor
Amazon QuickSightServerless BI dashboardsBusiness users, SPICE engine
Amazon RedshiftColumnar data warehouseFast repeated structured queries

Amazon Athena

Serverless interactive SQL query service for data in S3.

  • Query CSV, JSON, Parquet, ORC directly in S3 — no ETL required
  • Pay per TB of data scanned (Parquet/ORC columnar formats reduce cost by 60–90%)
  • Powered by Presto; standard SQL syntax
  • Athena results cached; same query within 24h is free if data unchanged

Exam trigger: "ad-hoc queries on S3", "serverless SQL", "analyse log files", "no infrastructure to manage"

vs Redshift: Athena = occasional ad-hoc queries on S3 without a cluster. Redshift = frequent BI queries on structured data where speed matters.


Amazon EMR (Elastic MapReduce)

Managed big-data cluster running Apache Spark, Hadoop, Hive, or Presto.

  • Launch a cluster of EC2 instances running open-source frameworks
  • EMR Serverless: no cluster management; auto-scales workers per job
  • Use Spot Instances for worker nodes (up to 90% cost savings)
  • Best for complex transforms, ML pipelines, custom Spark code at petabyte scale

Exam trigger: "existing Spark code", "Hadoop jobs", "petabyte ETL with custom logic", "bring your own Spark"

vs Glue: EMR = full cluster control, any Spark code; Glue = simpler managed ETL without managing a cluster.


AWS Glue

Fully managed serverless ETL and data catalogue.

  • Glue Data Catalog: central metadata store; integrates with Athena, Redshift Spectrum, EMR
  • Glue Crawlers: automatically discover schema from S3, RDS, DynamoDB, JDBC sources
  • Glue ETL jobs: write PySpark or Python code; runs serverless
  • Glue DataBrew: visual, no-code data preparation

Exam trigger: "serverless ETL", "catalogue data automatically", "discover schema", "prepare data for Athena/Redshift"

vs EMR: Glue = managed, no cluster decisions; EMR = custom Spark with full control.


Amazon Kinesis

Real-time data streaming. Four distinct services:

ServiceWhat it doesLatencyKey trait
Kinesis Data StreamsReal-time ordered stream, custom consumers~200msReplay up to 7 days
Kinesis Data FirehoseLoad stream to S3/Redshift/OpenSearch60–900sNo custom processing
Kinesis Data AnalyticsSQL queries on live streamsSecondsApache Flink
Kinesis Video StreamsIngest and store videoML on video

Exam trigger: "real-time", "streaming", "process data as it arrives", "clickstream", "IoT telemetry"

vs SQS: Kinesis = ordered streaming with replay, multiple consumers; SQS = queue for task decoupling with at-least-once delivery. Kinesis Firehose is load-only — it cannot run custom processing logic.


Amazon OpenSearch Service

Managed search and log analytics (successor to Amazon Elasticsearch Service).

  • Full-text search, log analytics, real-time application monitoring
  • OpenSearch Dashboards (formerly Kibana) for visualisation
  • Integrates with Kinesis Firehose for streaming log ingestion from Lambda, VPC flow logs, CloudTrail

Exam trigger: "log analysis", "full-text search", "ELK stack on AWS", "visualise and search logs"

vs CloudWatch Logs: CloudWatch = AWS service logs with basic metric filters; OpenSearch = rich custom search and dashboard layer over any log data.


Amazon QuickSight

Serverless business intelligence and visualisation.

  • SPICE: in-memory query engine; data imported for fast dashboards
  • Connects to S3, Athena, Redshift, RDS, Salesforce, on-premises databases
  • ML Insights: anomaly detection and forecasting built in
  • Pricing: pay per reader session (not per-seat for all users)

Exam trigger: "dashboards", "reports for business users", "BI tool", "non-technical reporting", "visualise Redshift or Athena data"


Amazon Redshift

Fully managed columnar data warehouse.

  • Columnar storage: reads only the columns a query needs — fast for analytics, slow for OLTP
  • Redshift Spectrum: query S3 data from within Redshift without loading it
  • Result caching: identical queries return instantly from cache
  • Redshift Serverless: auto-scales, pay per compute-second (no cluster management)
  • RA3 nodes: compute and storage scaled independently

Exam trigger: "data warehouse", "fast BI queries on large datasets", "replace on-premises warehouse", "structured analytics at scale"

vs Athena: Redshift = faster for repeated queries on stable structured data; Athena = cheaper for infrequent ad-hoc exploration of S3.


Decision Framework

Data arrives in real time?
├── Yes → Kinesis Data Streams (custom consumers, replay)
│         Kinesis Data Firehose (just load to S3/Redshift)
│         Kinesis Data Analytics (SQL on the stream)
│
└── No (batch/stored) → Where is the data?
    ├── S3, occasional queries → Athena (serverless SQL)
    ├── Complex Spark/Hadoop → EMR
    ├── Serverless ETL, schema discovery → Glue
    └── Repeated BI queries, structured → Redshift

Visualise results for business users → QuickSight
Log search / full-text search → OpenSearch

Exam Traps

Athena vs Redshift: "Ad-hoc, S3, serverless" → Athena. "Data warehouse, fast repeated queries, BI" → Redshift.

Glue vs EMR: "Serverless ETL, no cluster" → Glue. "Custom Spark, full control" → EMR.

Kinesis Firehose vs Kinesis Data Streams: Firehose loads data to destinations only (no custom processing). Data Streams supports custom consumers, replay, and ordering.

QuickSight is always the BI answer. Any mention of "dashboards for business users" or "non-technical visualisation" points here.


Key Facts

  • Athena: serverless SQL on S3, pay per TB scanned; Parquet/ORC reduce cost 60–90%
  • EMR: Spark/Hadoop cluster; EMR Serverless available; Spot Instances cut cost up to 90%
  • Glue: serverless ETL + Data Catalog; Crawlers auto-discover schema from S3/RDS/DynamoDB
  • Kinesis Data Streams: real-time ordered, 7-day replay, multiple consumers, ~200ms latency
  • Kinesis Data Firehose: near-real-time load to S3/Redshift/OpenSearch; 60–900s buffer; no custom processing
  • Kinesis Data Analytics: SQL on live streams (Apache Flink)
  • OpenSearch: managed Elasticsearch; integrates with Kinesis Firehose for log pipelines
  • QuickSight: serverless BI; SPICE in-memory engine; ML anomaly detection built in
  • Redshift: columnar warehouse; Spectrum queries S3 without loading; RA3 separates compute/storage

Connections

Open Questions

  • At what query frequency does Redshift Serverless become more expensive than a provisioned cluster?
  • Is Kinesis Data Analytics (SQL on streams) converging with Apache Flink on EMR Serverless for the same use case?