Debug: DNS Resolution Failing

Runbook for diagnosing DNS resolution failures causing services to be unreachable by name.

Updated Invalid Date·

debugging dns networking kubernetes runbook

Symptom: Service unreachable by hostname. Name or service not known. Works with IP address but not domain name. Intermittent resolution failures in Kubernetes.

Quick Diagnosis

Pattern	Likely cause
Works with IP, fails with hostname	DNS not resolving — resolver unreachable or record missing
Intermittent failures in Kubernetes	CoreDNS overloaded or pod DNS config incorrect
Fails only inside cluster	Internal DNS record not created or service selector mismatch
Fails only outside cluster	External DNS record not propagated or TTL cached
Was working, broke after deploy	Service name changed, DNS record not updated

Likely Causes (ranked by frequency)

DNS record does not exist or has wrong value
CoreDNS pod overloaded — dropping queries under cluster load
Service selector mismatch — Kubernetes service exists but points to no pods
DNS TTL not expired after record change — clients caching stale record
Search domain misconfiguration — FQDN required but short name used

First Checks (fastest signal first)

Test resolution from the failing host — nslookup hostname or dig hostname; confirm whether it is a timeout or NXDOMAIN
In Kubernetes: exec into the failing pod and run nslookup service-name.namespace.svc.cluster.local
Check CoreDNS pod status — kubectl get pods -n kube-system -l k8s-app=kube-dns
Check whether the Kubernetes Service exists and has endpoints — kubectl get endpoints service-name
Check DNS TTL — if a record was recently changed, stale entries may still be cached

Signal example: Service-to-service call fails with Name or service not known inside Kubernetes — kubectl get endpoints shows the service has 0 endpoints; the deployment label is app: my-service but the service selector is app: myservice (missing hyphen).

Drill Paths

Suspect	Go to
Kubernetes service and endpoint config	cloud/kubernetes
CoreDNS configuration and scaling	cloud/kubernetes
External DNS and Route 53	cloud/aws-networking-advanced
VPC DNS resolution settings	cloud/vpc-design-patterns
Service mesh DNS interception	cloud/service-mesh

Fix Patterns

Always use FQDNs inside Kubernetes — service.namespace.svc.cluster.local avoids search domain ambiguity
Check endpoints before checking DNS — a Service with no endpoints is a selector problem, not DNS
Scale CoreDNS if DNS latency is high under load — kubectl scale deployment coredns -n kube-system --replicas=3
Set realistic TTLs — low TTL (60s) for records that change; higher TTL (300s) for stable records
Enable DNS caching in the application — reduce CoreDNS pressure by not resolving the same name on every request

When This Is Not the Issue

If DNS resolves correctly but the service is still unreachable:

DNS is working but the service is not listening — check whether the process is bound to the correct interface and port
A firewall or security group may be blocking the connection after DNS resolution

Pivot to synthesis/debug-api-timeout to diagnose connection failures where DNS succeeds but the connection does not complete.

Connections

cloud/kubernetes · cloud/aws-networking-advanced · cloud/vpc-design-patterns · cloud/service-mesh · synthesis/debug-api-timeout

Open Questions

What has changed since this synthesis was written that would alter the conclusions?
What evidence would cause you to revise the key recommendation here?