Logging Best Practices

Structured, queryable logs that help you debug production issues without drowning in noise.

Updated Invalid Date·

logging structured-logging correlation-ids pii log-levels observability structlog

Structured, queryable logs that help you debug production issues without drowning in noise.

Log Levels and When to Use Them

DEBUG   — step-by-step execution details; only in development
          "Processing item 47 of 200"
          "Cache miss for key user:abc123"
          Never in production unless dynamically enabled for a session

INFO    — normal operational events; low-volume, high-signal
          "Order ord_abc123 placed by user usr_456"
          "Background job started: invoice_generation"
          Rule: if the system is working correctly, INFO tells you what it did

WARNING — something unexpected happened but the system recovered
          "Payment retry attempt 2/3 for order ord_abc"
          "Config key FEATURE_X not set, using default false"
          Not an error — worth investigating when aggregated

ERROR   — an operation failed; needs investigation
          "Payment failed for order ord_abc: card declined"
          "Database query timed out after 5s"
          Should trigger an alert if error rate exceeds threshold

CRITICAL — system is in a dangerous state; requires immediate action
          "Database connection pool exhausted"
          "Cannot write to disk — disk full"
          Should always page on-call

Structured Logging with structlog

import structlog
import logging

# Configure once at application startup
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,     # thread-local context
        structlog.stdlib.add_log_level,
        structlog.stdlib.add_logger_name,
        structlog.processors.TimeStamper(fmt="iso", utc=True),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.ExceptionRenderer(),
        structlog.processors.JSONRenderer(),          # machine-readable output
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

log = structlog.get_logger()

# Basic usage — keyword args become structured fields
log.info("order_placed", order_id="ord_abc123", user_id="usr_456", amount=49.99)
# → {"event": "order_placed", "order_id": "ord_abc123", "user_id": "usr_456", "amount": 49.99, ...}

# Bind context for a scope
order_log = log.bind(order_id="ord_abc123", user_id="usr_456")
order_log.info("payment_started")
order_log.info("payment_succeeded", amount=49.99)
order_log.error("notification_failed", exc_info=True)

Correlation IDs

# A correlation ID ties together all logs from a single request/job.
# Without it: impossible to find "all logs for this one failing request".

import uuid
from contextvars import ContextVar
import structlog

correlation_id_var: ContextVar[str] = ContextVar("correlation_id", default="")

# FastAPI middleware to inject correlation IDs
from fastapi import Request, Response
from starlette.middleware.base import BaseHTTPMiddleware

class CorrelationIDMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next) -> Response:
        # Accept from upstream (load balancer, API gateway) or generate
        correlation_id = (
            request.headers.get("X-Correlation-ID")
            or request.headers.get("X-Request-ID")
            or str(uuid.uuid4())
        )
        correlation_id_var.set(correlation_id)

        # Bind to structlog context — auto-included in every log call
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            correlation_id=correlation_id,
            path=request.url.path,
            method=request.method,
        )

        response = await call_next(request)
        response.headers["X-Correlation-ID"] = correlation_id
        return response

# Now every log call in this request automatically includes correlation_id
log.info("processing_payment")
# → {"event": "processing_payment", "correlation_id": "abc-123-def", ...}

PII Scrubbing

import re

# Patterns to scrub before logging
PII_PATTERNS = [
    (re.compile(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b'), "[CARD]"),
    (re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'), "[EMAIL]"),
    (re.compile(r'\b\d{3}-?\d{2}-?\d{4}\b'), "[SSN]"),
    (re.compile(r'password["\s]*[:=]["\s]*\S+', re.IGNORECASE), "password=[REDACTED]"),
    (re.compile(r'(token|api_key|secret)["\s]*[:=]["\s]*\S+', re.IGNORECASE), r'\1=[REDACTED]'),
]

def scrub_pii(message: str) -> str:
    for pattern, replacement in PII_PATTERNS:
        message = pattern.sub(replacement, message)
    return message

# structlog processor for PII scrubbing
def scrub_pii_processor(logger, method, event_dict: dict) -> dict:
    if "event" in event_dict:
        event_dict["event"] = scrub_pii(str(event_dict["event"]))
    # Scrub common field names
    for field in ("email", "card_number", "password", "token"):
        if field in event_dict:
            event_dict[field] = "[REDACTED]"
    return event_dict

Log Query Patterns (CloudWatch Insights / Loki)

# CloudWatch Insights

# Find all errors for a specific order
fields @timestamp, event, error
| filter correlation_id = "abc-123"
| filter level = "error"

# Error rate per endpoint in the last hour
stats count(*) as errors by path
| filter level = "error"
| sort errors desc

# P95 latency by endpoint
stats pct(duration_ms, 95) as p95 by path
| sort p95 desc

# Loki LogQL (structlog JSON)
{app="order-service"} | json | level="error" | line_format "{{.event}} {{.order_id}}"

# Count errors by event type in last 15 minutes
sum by (event) (rate({app="order-service"} | json | level="error" [15m]))

What Not to Log

Never log:
  - Passwords, tokens, API keys (even hashed)
  - Full credit card numbers (log last 4 only)
  - PII in raw form (email, phone, SSN) — redact or omit
  - Request/response bodies containing sensitive data
  - Cryptographic material (private keys, seeds)

Avoid:
  - DEBUG logs in production — high cost, low signal, PII risk
  - Logging inside tight loops — log the aggregate, not each iteration
  - Stack traces in INFO — reserved for ERROR and above
  - "Successfully completed X" on every success — only log failures + key events

Do log:
  - Every state transition (order_placed → payment_started → payment_succeeded)
  - Every external call with its result (db, API, cache hit/miss)
  - Every error with enough context to reproduce
  - Job start/end with duration and count

Log Aggregation in CI/Production

# docker-compose.yml — route logs to CloudWatch
services:
  api:
    image: myapp:latest
    logging:
      driver: awslogs
      options:
        awslogs-group: /ecs/order-service
        awslogs-region: eu-west-1
        awslogs-stream-prefix: api

# ECS task definition — stdout → CloudWatch via awslogs driver
# structlog JSONRenderer ensures logs are machine-parseable

Common Failure Cases

PII written to logs in plain text Why: developers log raw request bodies or user objects for debugging convenience, and those payloads contain email addresses, card numbers, or tokens. Detect: run grep -r "email\|password\|card" /var/log/app/ or enable a static analysis rule that flags logging calls on untrusted input. Fix: add the scrub_pii_processor to the structlog pipeline and audit every log.* call that accepts user-controlled data.

Missing correlation ID makes incidents undebuggable Why: logs from different services share only a timestamp, which is insufficient to join them into a request trace when clock skew or high concurrency is present. Detect: you're spending more than two minutes manually correlating log lines across services for a single failing request. Fix: inject a CorrelationIDMiddleware at the application entry point and pass the ID in every outbound HTTP header and message queue metadata field.

DEBUG logs enabled in production, masking signal in noise Why: a developer enables DEBUG temporarily to investigate an issue, forgets to revert, and the log volume increases 100x, making error-rate alerts meaningless and inflating log storage costs. Detect: log ingestion cost spikes; grep level=debug | wc -l exceeds grep level=error | wc -l by orders of magnitude. Fix: configure log level via an environment variable with a production default of INFO; never hardcode logging.DEBUG in application startup.

Stack traces at INFO or WARNING level Why: logging exc_info=True at INFO floods structured log systems with multi-line stack traces, breaking JSON parsing and alerting thresholds. Detect: log aggregation shows INFO entries with multi-kilobyte payloads; dashboards miscount error rates. Fix: reserve exc_info=True for ERROR and above; at WARNING, log the exception type and message as structured fields (error_type, error_message) without the full traceback.

Connections

cs-fundamentals/se-hub · cs-fundamentals/observability-se · cloud/infrastructure-monitoring · cs-fundamentals/api-security · cs-fundamentals/error-handling-patterns · web-frameworks/fastapi · cloud/lambda-powertools

Open Questions

What are the most common misapplications of this concept in production codebases?
When should you explicitly choose not to use this pattern or technique?