Debug: DNS Resolution Failing

Runbook for diagnosing DNS resolution failures causing services to be unreachable by name.

Symptom: Service unreachable by hostname. Name or service not known. Works with IP address but not domain name. Intermittent resolution failures in Kubernetes.


Quick Diagnosis

PatternLikely cause
Works with IP, fails with hostnameDNS not resolving — resolver unreachable or record missing
Intermittent failures in KubernetesCoreDNS overloaded or pod DNS config incorrect
Fails only inside clusterInternal DNS record not created or service selector mismatch
Fails only outside clusterExternal DNS record not propagated or TTL cached
Was working, broke after deployService name changed, DNS record not updated

Likely Causes (ranked by frequency)

  1. DNS record does not exist or has wrong value
  2. CoreDNS pod overloaded — dropping queries under cluster load
  3. Service selector mismatch — Kubernetes service exists but points to no pods
  4. DNS TTL not expired after record change — clients caching stale record
  5. Search domain misconfiguration — FQDN required but short name used

First Checks (fastest signal first)

  • Test resolution from the failing host — nslookup hostname or dig hostname; confirm whether it is a timeout or NXDOMAIN
  • In Kubernetes: exec into the failing pod and run nslookup service-name.namespace.svc.cluster.local
  • Check CoreDNS pod status — kubectl get pods -n kube-system -l k8s-app=kube-dns
  • Check whether the Kubernetes Service exists and has endpoints — kubectl get endpoints service-name
  • Check DNS TTL — if a record was recently changed, stale entries may still be cached

Signal example: Service-to-service call fails with Name or service not known inside Kubernetes — kubectl get endpoints shows the service has 0 endpoints; the deployment label is app: my-service but the service selector is app: myservice (missing hyphen).


Drill Paths

SuspectGo to
Kubernetes service and endpoint configcloud/kubernetes
CoreDNS configuration and scalingcloud/kubernetes
External DNS and Route 53cloud/aws-networking-advanced
VPC DNS resolution settingscloud/vpc-design-patterns
Service mesh DNS interceptioncloud/service-mesh

Fix Patterns

  • Always use FQDNs inside Kubernetes — service.namespace.svc.cluster.local avoids search domain ambiguity
  • Check endpoints before checking DNS — a Service with no endpoints is a selector problem, not DNS
  • Scale CoreDNS if DNS latency is high under load — kubectl scale deployment coredns -n kube-system --replicas=3
  • Set realistic TTLs — low TTL (60s) for records that change; higher TTL (300s) for stable records
  • Enable DNS caching in the application — reduce CoreDNS pressure by not resolving the same name on every request

When This Is Not the Issue

If DNS resolves correctly but the service is still unreachable:

  • DNS is working but the service is not listening — check whether the process is bound to the correct interface and port
  • A firewall or security group may be blocking the connection after DNS resolution

Pivot to synthesis/debug-api-timeout to diagnose connection failures where DNS succeeds but the connection does not complete.


Connections

cloud/kubernetes · cloud/aws-networking-advanced · cloud/vpc-design-patterns · cloud/service-mesh · synthesis/debug-api-timeout

Open Questions

  • What has changed since this synthesis was written that would alter the conclusions?
  • What evidence would cause you to revise the key recommendation here?