云栈社区»论坛 › 技术文档「 Note & Doc 」 › DNS解析故障全链路排查手册：从Linux到K8s的实战诊断与治理 ...

发回帖发新帖

5033 积分	0 好友	714 主题

发消息

DNS解析故障全链路排查手册：从Linux到K8s的实战诊断与治理

发表于 2026-3-31 03:13:32 | 查看: 138| 回复: 0

DNS（Domain Name System）是互联网基础设施中最关键的一环。整个解析过程建立在分层授权和缓存的核心机制之上。当客户端发起域名解析请求时，实际经历的查询链路远比多数人想象的复杂。

DNS 查询分为两种模式：递归查询和迭代查询。

递归查询是客户端向本地 DNS 服务器（通常叫 Local DNS 或 Recursive Resolver）发起请求，本地 DNS 服务器负责“跑腿”完成整个解析链路，最终将结果返回给客户端。客户端只需要发一次请求、收一次响应，中间过程完全由 Local DNS 代劳。

迭代查询是 Local DNS 在“跑腿”过程中使用的查询方式。Local DNS 先问根域名服务器（Root DNS），根域名服务器不直接给出最终答案，而是返回“你去问 .com 的权威服务器”这样的引荐（referral）。Local DNS 再去问 .com 权威服务器，得到“你去问 example.com 的权威服务器”。如此逐层迭代，直到获得最终解析结果。

一次完整的 DNS 解析流程，按照经过的缓存层级排列：

应用程序内部缓存（如 JVM DNS Cache、浏览器缓存）
    ↓ 未命中
操作系统 DNS 缓存（nscd / systemd-resolved）
    ↓ 未命中
/etc/hosts 文件（本地静态解析）
    ↓ 未匹配
/etc/resolv.conf 指定的 Local DNS 服务器
    ↓ Local DNS 缓存未命中
根域名服务器（全球 13 组根）
    ↓ 返回 TLD 引荐
顶级域名服务器（.com / .cn / .org 等）
    ↓ 返回权威引荐
权威域名服务器（目标域名的 NS 记录指向）
    ↓ 返回最终解析结果

这条链路上的每一个环节都可能出问题。而且 DNS 解析异常的表现形式非常隐蔽——应用程序报错信息往往不会直接说“DNS 解析失败”，而是表现为连接超时、服务不可达、证书校验失败等各种间接症状。

DNS 异常在运维场景中可以分为以下几个大类：

第一类：解析完全失败。域名无法解析出任何 IP 地址，dig 返回 NXDOMAIN（域名不存在）或 SERVFAIL（服务器内部错误）。常见原因包括域名过期未续费、权威 DNS 服务器宕机、网络不通导致无法到达 DNS 服务器。

第二类：解析结果错误。域名能够解析出 IP，但 IP 地址不正确。这类问题最阴险，因为不会有明显的报错。常见原因包括 DNS 缓存投毒、DNS 劫持（运营商或中间网络设备篡改响应）、DNS 记录配置错误、CDN 调度异常。

第三类：解析延迟过高。域名能正常解析，但解析时间从正常的几毫秒飙升到几秒甚至十几秒。对于依赖大量外部 API 调用的微服务架构，DNS 延迟会被放大到影响整体响应时间。常见原因包括 DNS 服务器过载、网络丢包、递归查询链路过长、DNSSEC 验证失败后回退。

第四类：解析间歇性失败。时好时坏，最难排查。可能是 DNS 服务器集群中部分节点异常、UDP 包偶发丢失、连接跟踪表满导致 DNS 响应包被丢弃（在 Kubernetes 中极为常见）。

第五类：操作系统层面的解析异常。 /etc/resolv.conf 被覆盖、nsswitch.conf 配置不当、systemd-resolved 与传统 DNS 配置冲突。这类问题通常只影响本机，但发生频率很高。

核心概念

DNS 记录类型与排查关联：

A 记录：域名到 IPv4 地址的映射，最常查询的记录类型
AAAA 记录：域名到 IPv6 地址的映射，双栈环境下容易造成解析延迟
CNAME 记录：域名别名，会引入额外一次解析，CNAME 链过长会增加延迟
NS 记录：指定域名的权威 DNS 服务器，NS 记录错误会导致整个域名不可解析
SOA 记录：区域授权信息，包含序列号和缓存控制参数
PTR 记录：反向解析，IP 地址到域名的映射，SSH 登录慢的常见原因
MX 记录：邮件交换记录，邮件发送失败时优先检查
TXT 记录：文本记录，用于 SPF、DKIM、域名验证等
SRV 记录：服务定位记录，Kubernetes 和一些微服务框架依赖它

TTL（Time To Live）机制：

TTL 是 DNS 缓存的核心控制参数，单位为秒。权威 DNS 服务器在返回解析结果时会附带 TTL 值，Local DNS 和客户端据此决定缓存多长时间。TTL 设置不当会导致两类问题：

TTL 过长：DNS 记录变更后，旧记录长时间残留在各级缓存中，故障切换迟迟不生效
TTL 过短：每次查询都穿透到权威 DNS，增加延迟和权威服务器负载

生产环境常见的 TTL 策略：正常运行时设为 300-600 秒，计划做 DNS 切换前 24 小时先把 TTL 降到 60 秒，切换完成后再恢复。

EDNS（Extension Mechanisms for DNS）：

原始的 DNS 协议限制 UDP 响应包不能超过 512 字节。EDNS0 扩展将这个限制提升到 4096 字节（可配置）。DNSSEC 签名、大量 A/AAAA 记录等场景都需要 EDNS 支持。防火墙如果阻断了大于 512 字节的 DNS UDP 包，会导致部分域名解析失败而其他域名正常，排查起来非常头疼。

适用场景

服务器上的应用程序报连接超时，怀疑 DNS 解析异常
curl 请求外部 API 偶发超时，需要定位是 DNS 还是网络问题
Kubernetes Pod 中 DNS 解析偶发 5 秒延迟（经典的 conntrack race 问题）
域名切换后部分用户仍然解析到旧 IP
内网域名解析正常但公网域名解析失败（或反过来）
systemd-resolved 重启后 /etc/resolv.conf 被重置
DNSSEC 验证失败导致特定域名不可访问
邮件发送失败，怀疑 MX 或 SPF 记录问题

环境要求

组件	版本要求	说明
操作系统	Ubuntu 24.04 LTS / Rocky Linux 9.5	主流 LTS 发行版
Linux 内核	6.12+	支持最新网络栈特性
systemd	256+	包含 systemd-resolved 服务
bind-utils / dnsutils	9.20+	提供 dig / nslookup / host 工具
dog	0.1.0+	现代化 DNS 查询工具（Rust 编写）
tcpdump	4.99+	抓包分析 DNS 流量
CoreDNS	1.12+	Kubernetes 默认 DNS 服务端
BIND9	9.20+	传统权威/递归 DNS 服务器
Unbound	1.22+	高性能递归 DNS 服务器
Prometheus	3.x	监控指标采集
Grafana	11.x	监控可视化

详细步骤

准备工作

安装排查工具

Ubuntu 24.04 环境：

# 安装 DNS 排查工具集
sudo apt update
sudo apt install -y dnsutils mtr-tiny tcpdump whois iputils-ping

# dnsutils 包含 dig / nslookup / nsupdate
# mtr-tiny 提供 mtr 网络路径诊断
# tcpdump 用于抓包分析 DNS 流量

Rocky Linux 9.5 环境：

# 安装 DNS 排查工具集
sudo dnf install -y bind-utils mtr tcpdump whois iputils

# bind-utils 包含 dig / nslookup / host

验证工具安装：

# 确认 dig 版本
dig -v
# 预期输出：DiG 9.20.x

# 确认 tcpdump 版本
tcpdump --version

确认当前 DNS 配置

# 查看当前生效的 DNS 服务器
cat /etc/resolv.conf

# 典型输出示例：
# nameserver 10.0.0.2
# nameserver 10.0.0.3
# search corp.example.com example.com
# options ndots:5 timeout:2 attempts:3

每个字段的含义：

nameserver：指定 Local DNS 服务器地址，最多可以配置 3 个，按顺序尝试
search：搜索域列表，当查询的域名不是 FQDN（不以 . 结尾）时，系统会依次追加 search 列表中的域名进行查询
options ndots:N：如果查询的域名中包含的点号数量小于 N，则先追加 search 域再查询。Kubernetes 默认 ndots:5，这会导致非 FQDN 查询产生大量无效 DNS 请求
options timeout:N：单次查询超时时间（秒）
options attempts:N：查询失败后的重试次数

# 查看 DNS 解析顺序配置
cat /etc/nsswitch.conf | grep hosts

# 典型输出：
# hosts: files dns mymachines myhostname

# 解析顺序说明：
# files    - 先查 /etc/hosts 文件
# dns      - 再查 DNS 服务器（/etc/resolv.conf）
# mymachines - systemd-machined 注册的容器
# myhostname - 本机主机名的兜底解析

# 检查 systemd-resolved 是否在管理 DNS
systemctl is-active systemd-resolved

# 如果 active，查看 systemd-resolved 的实际配置
resolvectl status

# 查看 /etc/resolv.conf 是否是软链接
ls -la /etc/resolv.conf

# 如果是符号链接指向 /run/systemd/resolve/stub-resolv.conf
# 说明 systemd-resolved 正在管理 DNS 配置

基准测试

在排查之前先建立正常情况下的 DNS 解析基准数据：

# 测试解析延迟（多次测量取平均）
for i in $(seq 1 10); do
    dig +noall +stats example.com | grep "Query time"
done

# 正常情况下，有缓存时 Query time 应该在 0-5ms
# 无缓存首次查询通常在 20-100ms（取决于到权威 DNS 的网络延迟）

DNS 排查工具详解

dig 命令深度使用

dig 是 DNS 排查的核心工具，需要熟练掌握它的各种用法。

基本查询：

# 查询 A 记录（默认）
dig example.com

# 查询指定记录类型
dig example.com AAAA    # IPv6 地址
dig example.com MX      # 邮件交换记录
dig example.com NS      # 权威 DNS 服务器
dig example.com TXT     # 文本记录（SPF/DKIM/域名验证）
dig example.com SOA     # 区域授权记录
dig example.com ANY     # 查询所有记录类型（部分 DNS 服务器会拒绝）

指定 DNS 服务器查询：

# 使用指定的 DNS 服务器查询（绕过本地配置）
dig @8.8.8.8 example.com         # 使用 Google DNS
dig @1.1.1.1 example.com         # 使用 Cloudflare DNS
dig @10.0.0.2 example.com        # 使用内网 DNS

# 对比不同 DNS 服务器的解析结果，用于排查 DNS 劫持
dig @8.8.8.8 +short example.com
dig @114.114.114.114 +short example.com
dig @223.5.5.5 +short example.com

追踪完整解析链路：

# +trace 选项模拟完整的递归查询过程
dig +trace example.com

# 输出示例（简化）：
# .                  IN  NS  a.root-servers.net.   （从根开始）
# com.               IN  NS  a.gtld-servers.net.    （根返回 .com 的 NS）
# example.com.       IN  NS  ns1.example.com.       （.com 返回 example.com 的 NS）
# example.com.       IN  A   93.184.216.34          （权威服务器返回最终结果）

# +trace 的价值在于可以看到解析链路上每一跳的耗时
# 如果某一跳耗时异常，就能定位到是哪个环节出了问题

精简输出格式：

# 只输出解析结果
dig +short example.com

# 只输出应答部分
dig +noall +answer example.com

# 输出应答和统计信息
dig +noall +answer +stats example.com

# 输出包含所有 section
dig +noall +answer +authority +additional example.com

DNSSEC 验证：

# 检查域名是否启用了 DNSSEC
dig +dnssec example.com

# 查看 DNSKEY 记录
dig example.com DNSKEY

# 查看 DS 记录（在上级域名服务器上）
dig example.com DS

# 验证 DNSSEC 签名链
dig +sigchase +trusted-key=/etc/trusted-key.key example.com

反向解析：

# 查询 IP 对应的 PTR 记录
dig -x 93.184.216.34

# 等价于查询 34.216.184.93.in-addr.arpa 的 PTR 记录
dig 34.216.184.93.in-addr.arpa PTR

nslookup 和 host

# nslookup 交互式查询
nslookup example.com
nslookup example.com 8.8.8.8    # 指定 DNS 服务器

# host 命令（输出简洁）
host example.com
host -t MX example.com          # 查指定类型
host -a example.com             # 查所有记录
host 93.184.216.34              # 反向解析

使用 tcpdump 抓取 DNS 流量

当 dig 等工具无法复现问题时，需要抓取实际的 DNS 流量来分析：

# 抓取所有 DNS 流量（UDP 53 端口）
sudo tcpdump -i any port 53 -nn -l

# 抓取并保存到文件，后续用 wireshark 分析
sudo tcpdump -i any port 53 -w /tmp/dns_capture.pcap -c 1000

# 只抓取指定域名的 DNS 查询
sudo tcpdump -i any port 53 -nn -l | grep “example.com”

# 同时抓取 TCP 53 端口（DNS over TCP，大响应包会用 TCP）
sudo tcpdump -i any ‘port 53’ -nn -l

# 只看 DNS 响应包
sudo tcpdump -i any ‘src port 53’ -nn -l

分析 tcpdump 输出：

15:23:01.123456 IP 10.0.0.5.48273 > 10.0.0.2.53: 12345+ A? api.example.com. (33)
15:23:01.125789 IP 10.0.0.2.53 > 10.0.0.5.48273: 12345 1/0/0 A 192.168.1.100 (49)

第一行是查询请求：源 IP 10.0.0.5 向 DNS 服务器 10.0.0.2 查询 api.example.com 的 A 记录。第二行是响应：DNS 服务器返回 192.168.1.100，“1/0/0”表示 1 个应答记录、0 个权威记录、0 个附加记录。

如果只看到查询没有响应，说明 DNS 服务器不可达或响应被丢弃。

DNS 解析链路逐层排查

第一层：本地文件检查

# 检查 /etc/hosts 是否有异常条目
cat /etc/hosts

# 排查要点：
# 1. 是否有人手动添加了错误的 IP 映射
# 2. 是否有恶意软件添加了劫持条目
# 3. 条目格式是否正确（IP 地址在前，主机名在后）
# 4. 注意 IPv6 的 ::1 条目

# 检查 /etc/hosts 文件权限（不应被普通用户修改）
ls -la /etc/hosts
# 应该是 -rw-r--r-- root root

# 检查 nsswitch.conf 的 hosts 行
grep ^hosts /etc/nsswitch.conf

# 常见的几种配置：
# hosts: files dns            -- 先查 hosts 文件再查 DNS（最常见）
# hosts: files dns myhostname -- Ubuntu/Fedora 系统默认
# hosts: dns files            -- 先查 DNS 再查 hosts（少见，但某些场景需要）
# hosts: files mdns4_minimal [NOTFOUND=return] dns -- 带 mDNS 的配置

# 如果 nsswitch.conf 中 hosts 行缺少 dns，
# 那么系统根本不会去查询 DNS 服务器

第二层：resolv.conf 检查

# 查看 resolv.conf 真实内容
cat /etc/resolv.conf

# 检查是否被 NetworkManager 或 systemd-resolved 管理
ls -la /etc/resolv.conf

# 可能的情况：
# 1. 普通文件 —— 手动管理
# 2. 软链接 -> /run/systemd/resolve/stub-resolv.conf —— systemd-resolved 管理
# 3. 软链接 -> /run/systemd/resolve/resolv.conf —— systemd-resolved 管理（不经过 stub）
# 4. 软链接 -> /run/NetworkManager/resolv.conf —— NetworkManager 管理

resolv.conf 被覆盖的排查：

# 检查是否有 DHCP 客户端在覆盖 resolv.conf
# Ubuntu 使用 systemd-networkd 或 NetworkManager
systemctl is-active NetworkManager
systemctl is-active systemd-networkd

# 检查 NetworkManager 的 DNS 处理模式
cat /etc/NetworkManager/NetworkManager.conf | grep -A5 “\[main\]”

# dns=default    -- NetworkManager 直接覆盖 /etc/resolv.conf
# dns=systemd-resolved  -- 交给 systemd-resolved 处理
# dns=dnsmasq    -- 使用 dnsmasq 做本地缓存
# dns=none       -- 不管理 DNS

# 如果要阻止 resolv.conf 被覆盖（临时方案，不推荐长期使用）
sudo chattr +i /etc/resolv.conf

# 查看文件是否被设置了 immutable 属性
lsattr /etc/resolv.conf

resolv.conf 关键参数调优：

# /etc/resolv.conf 示例（优化版）
nameserver 10.0.0.2
nameserver 10.0.0.3
nameserver 8.8.8.8
search corp.example.com
options timeout:2 attempts:2 rotate single-request-reopen

# 参数说明：
# timeout:2       -- 单次查询超时 2 秒（默认 5 秒太长）
# attempts:2      -- 重试 2 次（默认也是合理的）
# rotate          -- 轮询使用多个 nameserver（默认总是先用第一个）
# single-request-reopen -- A 和 AAAA 查询使用不同的源端口
#   这个选项能解决某些防火墙/NAT 设备将两个查询的响应包搞混的问题

第三层：systemd-resolved 排查

Ubuntu 24.04 默认使用 systemd-resolved 管理 DNS。它在 127.0.0.53:53 上运行一个 stub DNS 服务器，应用程序的 DNS 查询先到达 stub，再由 systemd-resolved 转发到实际的上游 DNS 服务器。

# 查看 systemd-resolved 运行状态
systemctl status systemd-resolved

# 查看详细的 DNS 配置信息
resolvectl status

# 输出示例：
# Global
#   Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
#   resolv.conf mode: stub
#
# Link 2 (eth0)
#   Current Scopes: DNS LLMNR/IPv4 LLMNR/IPv6
#   Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
#   Current DNS Server: 10.0.0.2
#   DNS Servers: 10.0.0.2 10.0.0.3
#   DNS Domain: corp.example.com

# 查看 DNS 缓存统计
resolvectl statistics

# 输出示例：
# DNSSEC supported by current servers: no
#
# Transactions
# Current Transactions: 0
#   Total Transactions: 85432
#
# Cache
#   Current Cache Size: 234
#           Cache Hits: 67891
#         Cache Misses: 17541
#
# DNSSEC Verdicts
#             Secure: 0
#           Insecure: 0
#              Bogus: 0
#      Indeterminate: 0

# 如果 Cache Misses 远大于 Cache Hits，说明缓存效率很低
# 可能是 TTL 设置过小或缓存被频繁清空

# 手动清空 systemd-resolved 缓存
resolvectl flush-caches

# 使用 resolvectl 测试 DNS 解析
resolvectl query example.com

# 查看 systemd-resolved 的日志（排查详细错误）
journalctl -u systemd-resolved -f --no-pager

# 开启 debug 级别日志
sudo mkdir -p /etc/systemd/resolved.conf.d/
cat << ‘EOF’ | sudo tee /etc/systemd/resolved.conf.d/debug.conf
[Resolve]
# 开启调试日志（排查完毕后删除此文件）
EOF

# 通过 SIGUSR1 切换调试日志开关
sudo kill -USR1 $(pidof systemd-resolved)

# 查看调试日志
journalctl -u systemd-resolved -f

systemd-resolved 常见问题：

# 问题 1：/etc/resolv.conf 指向 127.0.0.53 但 systemd-resolved 没有运行
# 症状：所有 DNS 查询都失败
# 修复：
sudo systemctl start systemd-resolved
sudo systemctl enable systemd-resolved

# 问题 2：systemd-resolved 使用的 DNS 服务器不对
# 查看各网络接口的 DNS 配置
resolvectl dns

# 手动指定 DNS 服务器（临时生效）
sudo resolvectl dns eth0 10.0.0.2 10.0.0.3

# 永久修改需要编辑 netplan 或 networkd 配置
# Ubuntu 24.04 使用 netplan：
cat /etc/netplan/*.yaml

# 问题 3：mDNS/LLMNR 干扰正常 DNS 解析
# 某些 .local 域名查询会走 mDNS 而不是 DNS
# 禁用 mDNS 和 LLMNR：
cat << ‘EOF’ | sudo tee /etc/systemd/resolved.conf.d/no-mdns.conf
[Resolve]
LLMNR=no
MulticastDNS=no
EOF
sudo systemctl restart systemd-resolved

第四层：DNS 服务端排查

BIND9 排查：

# 检查 BIND9 服务状态
sudo systemctl status named    # RHEL/Rocky
sudo systemctl status bind9    # Ubuntu

# 检查 BIND9 配置语法
sudo named-checkconf /etc/named.conf

# 检查区域文件语法
sudo named-checkzone example.com /var/named/example.com.zone

# 查看 BIND9 运行日志
sudo journalctl -u named -f --no-pager

# 查看 BIND9 缓存统计
sudo rndc status

# 清除 BIND9 缓存
sudo rndc flush

# 清除特定域名的缓存
sudo rndc flushname example.com

# 查看递归查询统计
sudo rndc stats
cat /var/named/data/named_stats.txt | tail -50

# 查看当前活跃的查询
sudo rndc recursing

CoreDNS 排查（Kubernetes 环境）：

# 查看 CoreDNS Pod 状态
kubectl -n kube-system get pods -l k8s-app=kube-dns

# 查看 CoreDNS 日志
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=100

# 查看 CoreDNS 配置
kubectl -n kube-system get configmap coredns -o yaml

# 从 Pod 内部测试 DNS 解析
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- \
    nslookup kubernetes.default.svc.cluster.local

# 测试外部域名解析
kubectl run dns-test --image=busybox:1.36 --rm -it --restart=Never -- \
    nslookup example.com

# 检查 CoreDNS Service 是否正常
kubectl -n kube-system get svc kube-dns
# 确认 ClusterIP 与 Pod 中 /etc/resolv.conf 的 nameserver 一致

CoreDNS 配置中常见问题：

# CoreDNS Corefile 示例
.:53 {
    errors
    health {
        lameduck 5s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf {
        max_concurrent 1000
    }
    cache 30
    loop
    reload
    loadbalance
}

排查要点：

forward 指令的上游 DNS 服务器是否可达
cache 的 TTL 值是否合理
max_concurrent 是否满足并发需求
loop 插件是否检测到了循环查询（CoreDNS 会自动关闭并重启）

第五层：DNS 劫持检测

# 方法 1：对比多个公共 DNS 的解析结果
echo “=== Google DNS ===”
dig @8.8.8.8 +short target.example.com
echo “=== Cloudflare DNS ===”
dig @1.1.1.1 +short target.example.com
echo “=== 阿里 DNS ===”
dig @223.5.5.5 +short target.example.com
echo “=== 本地 DNS ===”
dig +short target.example.com

# 如果本地 DNS 的结果与公共 DNS 不同，可能存在 DNS 劫持

# 方法 2：使用 TCP 查询绕过 UDP 劫持
dig +tcp target.example.com

# 某些 DNS 劫持设备只拦截 UDP 53 端口的流量
# 使用 TCP 查询如果结果不同，基本确认存在劫持

# 方法 3：使用 DNS-over-HTTPS 验证
# 使用 curl 调用 Cloudflare DOH 接口
curl -s “https://cloudflare-dns.com/dns-query?name=target.example.com&type=A” \
    -H “Accept: application/dns-json” | python3 -m json.tool

# 方法 4：使用 dig +trace 观察解析链路中是否有异常跳转
dig +trace target.example.com
# 如果在某一跳出现了非预期的 NS 记录或 A 记录，需要警惕

第六层：DNSSEC 验证问题排查

# 检查域名是否启用了 DNSSEC
dig +dnssec example.com

# 响应头中如果有 “ad” 标志（Authenticated Data），表示 DNSSEC 验证通过
# 如果有 “cd” 标志（Checking Disabled），表示请求方要求不做验证

# 检查 DNSKEY 记录
dig example.com DNSKEY +multiline

# 检查 DS 记录（在父域）
dig example.com DS

# 检查 RRSIG 签名
dig example.com RRSIG

# 使用 delv 工具进行 DNSSEC 诊断（BIND9 9.10+ 自带）
delv example.com

# 如果 DNSSEC 验证失败，delv 会输出具体的失败原因：
# 可能是签名过期、密钥不匹配、DS 记录缺失等

DNSSEC 验证失败的常见原因和处理：

# 原因 1：域名所有者更换了 DNSKEY 但没有更新父域的 DS 记录
# 症状：SERVFAIL，delv 显示 “broken trust chain”
# 处理：联系域名所有者修复 DS 记录

# 原因 2：RRSIG 签名过期
# 症状：SERVFAIL，日志中有 “RRSIG has expired”
# 处理：权威 DNS 管理员需要重新签名区域

# 原因 3：本地递归服务器的时间不准
# DNSSEC 验证依赖时间戳，服务器时间偏差超过签名有效期会导致验证失败
# 排查：
date
timedatectl status
# 修复：
sudo systemctl start chronyd  # 或 ntpd
sudo chronyc makestep

# 临时绕过 DNSSEC 验证（仅用于确认是否是 DNSSEC 问题）
dig +cd example.com
# +cd 设置 Checking Disabled 标志，告诉递归服务器不做 DNSSEC 验证

启动和验证

完成排查和修复后，执行以下验证步骤：

# 验证 1：基本解析功能
dig +short example.com
# 应返回正确的 IP 地址

# 验证 2：解析延迟
dig example.com | grep “Query time”
# Query time 应在合理范围内（有缓存 <5ms，无缓存 <100ms）

# 验证 3：不同记录类型
dig +short example.com A
dig +short example.com AAAA
dig +short example.com MX
dig +short example.com NS

# 验证 4：反向解析
dig -x <IP地址> +short

# 验证 5：内网域名解析
dig +short internal-service.corp.example.com

# 验证 6：使用应用层工具验证
curl -o /dev/null -s -w “DNS解析耗时: %{time_namelookup}s\n连接耗时: %{time_connect}s\n总耗时: %{time_total}s\n” https://example.com

# 验证 7：批量解析测试
while read domain; do
    result=$(dig +short “$domain” 2>/dev/null)
    if [ -z “$result” ]; then
        echo “FAIL: $domain”
    else
        echo “OK: $domain -> $result”
    fi
done << ‘EOF’
example.com
google.com
internal-api.corp.example.com
EOF

示例代码和配置

完整配置示例

systemd-resolved 生产环境配置

# 文件路径：/etc/systemd/resolved.conf.d/production.conf
# 适用于 Ubuntu 24.04 LTS 生产服务器

[Resolve]
# 主 DNS 服务器（内网 DNS）
DNS=10.0.0.2 10.0.0.3
# 备用 DNS 服务器（公网 DNS，内网 DNS 全部不可用时生效）
FallbackDNS=8.8.8.8 1.1.1.1 223.5.5.5
# 搜索域
Domains=corp.example.com
# 禁用 mDNS（生产环境不需要）
MulticastDNS=no
# 禁用 LLMNR（生产环境不需要）
LLMNR=no
# DNSSEC 策略：allow-downgrade 表示支持就验证，不支持就跳过
DNSSEC=allow-downgrade
# 禁用 DNS-over-TLS（如果内网 DNS 不支持）
DNSOverTLS=no
# 缓存策略
Cache=yes
# DNS stub 监听地址
DNSStubListener=yes

# 应用配置
sudo systemctl restart systemd-resolved

# 验证配置生效
resolvectl status

Unbound 本地缓存配置

对于不使用 systemd-resolved 的场景，可以部署 Unbound 作为本地递归 DNS 缓存服务器：

# 文件路径：/etc/unbound/unbound.conf
# Unbound 1.22+ 本地缓存递归 DNS 配置

server:
# 监听地址
interface: 127.0.0.1
interface: ::1
port: 53

# 访问控制
access-control: 127.0.0.0/8 allow
access-control: ::1/128 allow
access-control: 10.0.0.0/8 allow

# 性能调优
num-threads: 4
msg-cache-slabs: 4
rrset-cache-slabs: 4
infra-cache-slabs: 4
key-cache-slabs: 4

# 缓存大小（根据内存调整）
msg-cache-size: 128m
rrset-cache-size: 256m
key-cache-size: 32m
neg-cache-size: 16m

# 缓存 TTL 限制
cache-min-ttl: 60
cache-max-ttl: 86400
cache-max-negative-ttl: 300

# 预取即将过期的缓存记录
prefetch: yes
prefetch-key: yes

# 隐藏版本信息
hide-identity: yes
hide-version: yes

# DNSSEC 根信任锚
auto-trust-anchor-file: “/var/lib/unbound/root.key”

# 日志
verbosity: 1
log-queries: no
log-replies: no
logfile: “/var/log/unbound/unbound.log”

# 性能优化
so-reuseport: yes
minimal-responses: yes
serve-expired: yes
serve-expired-ttl: 86400

# 转发到上游 DNS（如果不想做完整递归，可以配置转发）
# forward-zone:
#     name: “.”
#     forward-addr: 10.0.0.2
#     forward-addr: 10.0.0.3

# 安装并启动 Unbound
sudo apt install -y unbound

# 检查配置语法
sudo unbound-checkconf

# 启动服务
sudo systemctl enable --now unbound

# 验证
dig @127.0.0.1 example.com

CoreDNS 自定义 Corefile（Kubernetes 环境）

# 文件说明：Kubernetes CoreDNS ConfigMap
# 适用于大规模集群（500+ 节点）的优化配置

.:53 {
    errors
    health {
        lameduck 15s
    }
    ready
    kubernetes cluster.local in-addr.arpa ip6.arpa {
        pods insecure
        fallthrough in-addr.arpa ip6.arpa
        ttl 30
    }
    prometheus :9153
    forward . 10.0.0.2 10.0.0.3 {
        max_concurrent 3000
        policy sequential
        health_check 5s
        expire 10s
    }
    cache 60 {
        success 8192
        denial 4096
    }
    loop
    reload
    loadbalance
    bufsize 1232
}

# 内网域名走专用 DNS 服务器
corp.example.com:53 {
    errors
    cache 120
    forward . 10.0.0.10 10.0.0.11
}

实际应用案例

案例一：resolv.conf 被 NetworkManager 覆盖

场景描述： 运维人员手动修改了 /etc/resolv.conf 指向内网 DNS 服务器，但服务器重启或网络重连后，resolv.conf 被恢复成 DHCP 下发的 DNS 地址，导致内网域名解析失败。

排查过程：

# 第一步：确认 resolv.conf 内容和类型
cat /etc/resolv.conf
ls -la /etc/resolv.conf

# 发现是普通文件，nameserver 指向了 DHCP 分配的地址
# nameserver 192.168.1.1

# 第二步：检查 NetworkManager 状态
systemctl is-active NetworkManager
# active

# 第三步：查看 NetworkManager DNS 模式
grep -i dns /etc/NetworkManager/NetworkManager.conf
# 输出为空，说明使用默认模式（default），会覆盖 resolv.conf

# 第四步：查看 DHCP 下发的 DNS
nmcli device show eth0 | grep DNS
# IP4.DNS[1]: 192.168.1.1

解决方案：

# 方案 A：配置 NetworkManager 不管理 resolv.conf
cat << ‘EOF’ | sudo tee /etc/NetworkManager/conf.d/dns-none.conf
[main]
dns=none
EOF

# 手动编辑 resolv.conf
cat << ‘EOF’ | sudo tee /etc/resolv.conf
nameserver 10.0.0.2
nameserver 10.0.0.3
search corp.example.com
options timeout:2 attempts:2
EOF

sudo systemctl restart NetworkManager

# 方案 B：通过 nmcli 设置静态 DNS（推荐，不需要禁用 NM 的 DNS 管理）
sudo nmcli connection modify “eth0” ipv4.dns “10.0.0.2 10.0.0.3”
sudo nmcli connection modify “eth0” ipv4.dns-search “corp.example.com”
sudo nmcli connection modify “eth0” ipv4.ignore-auto-dns yes
sudo nmcli connection up “eth0”

# 验证
cat /etc/resolv.conf
dig +short internal-api.corp.example.com

案例二：DNS 超时导致服务启动慢

场景描述： Java 应用在启动时卡住 30-60 秒才能完成初始化。应用日志没有明显错误，但启动时间从正常的 10 秒变成了 40-70 秒。

排查过程：

# 第一步：使用 strace 跟踪启动过程中的 DNS 查询
# 先启动应用，获取 PID
strace -e trace=network -f -p <PID> 2>&1 | grep -i “connect\|sendto\|recvfrom”

# 或者在启动时直接 strace
strace -e trace=network -f -tt java -jar app.jar 2>&1 | grep -E “sendto|recvfrom” | head -50

# 第二步：抓取 DNS 流量
sudo tcpdump -i any port 53 -nn -l &

# 启动应用，观察 DNS 查询
# 发现大量 AAAA 查询（IPv6）超时后才发 A 查询

# 典型的 tcpdump 输出：
# 10:00:01.000 IP 10.0.0.5.42371 > 10.0.0.2.53: A? db.corp.example.com
# 10:00:01.001 IP 10.0.0.5.42372 > 10.0.0.2.53: AAAA? db.corp.example.com
# 10:00:06.001 IP 10.0.0.5.42372 > 10.0.0.2.53: AAAA? db.corp.example.com  （5秒后重试）
# 10:00:11.001 （再次超时）
# 10:00:11.002 IP 10.0.0.2.53 > 10.0.0.5.42371: A 10.0.1.100  （A 查询早就返回了）

# 第三步：确认问题根因
# DNS 服务器不支持 AAAA 查询（内网 DNS 没有配置 IPv6），
# 但没有正确返回 NODATA 响应，而是直接丢弃了 AAAA 查询包
# 导致客户端等待超时

解决方案：

# 方案 A：在 resolv.conf 中添加 single-request-reopen
echo “options single-request-reopen” | sudo tee -a /etc/resolv.conf

# 方案 B：JVM 层面禁用 IPv6 DNS 查询
# 在 Java 启动参数中添加：
# -Djava.net.preferIPv4Stack=true

# 方案 C：修复 DNS 服务器，让它对 AAAA 查询返回正确的空响应
# 在 BIND9 中确保区域文件存在（即使没有 AAAA 记录）

# 方案 D：如果整个环境不用 IPv6，在系统层面禁用
cat << ‘EOF’ | sudo tee /etc/sysctl.d/99-disable-ipv6.conf
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
EOF
sudo sysctl --system

# 验证修复效果
time dig +short db.corp.example.com
# 应该在毫秒级完成

案例三：内网 DNS 与公网 DNS 冲突

场景描述： 公司内网使用 corp.example.com 作为内部域名，resolv.conf 同时配置了内网 DNS 和公网 DNS（作为备用）。当内网 DNS 偶尔不稳定时，请求落到公网 DNS 上，公网 DNS 无法解析内网域名，返回 NXDOMAIN。应用程序缓存了这个 NXDOMAIN 结果，导致即使内网 DNS 恢复了，短时间内仍然无法解析内网域名。

排查过程：

# 第一步：复现问题
dig @10.0.0.2 internal-api.corp.example.com
# 正常返回：10.0.1.50

dig @8.8.8.8 internal-api.corp.example.com
# 返回：NXDOMAIN（公网 DNS 不认识这个域名）

# 第二步：检查 resolv.conf
cat /etc/resolv.conf
# nameserver 10.0.0.2
# nameserver 8.8.8.8    <-- 问题：公网 DNS 作为备用

# 第三步：验证 failover 行为
# 当 10.0.0.2 超时时，系统会尝试 8.8.8.8
# 8.8.8.8 返回 NXDOMAIN（不是超时，是明确的否定响应）
# glibc 的解析器会接受这个 NXDOMAIN 作为最终结果

解决方案：

# 方案 A：使用 split-DNS 架构
# 内网域名走内网 DNS，公网域名走公网 DNS
# 使用 systemd-resolved 的 per-link DNS 配置

cat << ‘EOF’ | sudo tee /etc/systemd/resolved.conf.d/split-dns.conf
[Resolve]
DNS=10.0.0.2 10.0.0.3
FallbackDNS=8.8.8.8 1.1.1.1
Domains=corp.example.com
EOF

sudo systemctl restart systemd-resolved

# 这样 corp.example.com 域名只会发到 10.0.0.2 和 10.0.0.3
# 其他域名先尝试 10.0.0.2/10.0.0.3，失败后才 fallback 到公网 DNS

# 方案 B：如果不用 systemd-resolved，使用 dnsmasq 做 split-DNS
sudo apt install -y dnsmasq

cat << ‘EOF’ | sudo tee /etc/dnsmasq.d/split-dns.conf
# 内网域名走内网 DNS
server=/corp.example.com/10.0.0.2
server=/corp.example.com/10.0.0.3

# 反向解析走内网 DNS
server=/10.in-addr.arpa/10.0.0.2

# 其他域名走公网 DNS
server=8.8.8.8
server=1.1.1.1

# 缓存大小
cache-size=10000

# 不读取 /etc/resolv.conf（避免循环）
no-resolv

# 监听地址
listen-address=127.0.0.1
bind-interfaces
EOF

# 修改 resolv.conf 指向 dnsmasq
echo “nameserver 127.0.0.1” | sudo tee /etc/resolv.conf

sudo systemctl enable --now dnsmasq

DNS 健康检测脚本

#!/bin/bash
# 文件名：dns_health_check.sh
# 功能：定期检测 DNS 解析健康状态，发现异常发送告警
# 依赖：dig, curl（告警通知）
# 用法：通过 crontab 每分钟执行一次
#   * * * * * /opt/scripts/dns_health_check.sh >> /var/log/dns_health.log 2>&1

# ============ 配置区域 ============

# 要检测的 DNS 服务器列表
DNS_SERVERS=(“10.0.0.2” “10.0.0.3” “8.8.8.8”)

# 要检测的域名列表（内网 + 公网混合）
declare -A TEST_DOMAINS
TEST_DOMAINS=(
    [“internal-api.corp.example.com”]=“10.0.1.50” # 期望解析到的 IP
    [“db-master.corp.example.com”]=“10.0.2.10”
    [“example.com”]=“” # 公网域名不校验 IP，只检查能否解析
    [“google.com”]=“”
)

# 解析超时阈值（毫秒）
LATENCY_THRESHOLD=500

# 告警 Webhook 地址（企业微信/钉钉/飞书）
WEBHOOK_URL=“https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=YOUR_KEY_HERE”

# 状态文件（用于避免重复告警）
STATE_FILE=“/tmp/dns_health_state”

# 日志时间戳
TIMESTAMP=$(date ‘+%Y-%m-%d %H:%M:%S’)

# ============ 函数定义 ============

# 发送告警通知
send_alert() {
    local message=“$1”
    curl -s -X POST “$WEBHOOK_URL” \
        -H “Content-Type: application/json” \
        -d “{\“msgtype\”:\“text\”,\“text\”:{\“content\”:\“[DNS告警] $TIMESTAMP\n$message\“}}" \
        > /dev/null 2>&1
}

# 检测单个 DNS 服务器的单个域名
check_dns() {
    local server=“$1”
    local domain=“$2”
    local expected_ip=“$3”
    local errors=“”

    # 执行 dig 查询并提取结果和延迟
    local dig_output
    dig_output=$(dig @“$server” +time=3 +tries=1 +noall +answer +stats “$domain” 2>/dev/null)

    local resolved_ip
    resolved_ip=$(echo “$dig_output” | grep -E “^$domain” | grep -oP ‘\d+\.\d+\.\d+\.\d+’ | head -1)

    local query_time
    query_time=$(echo “$dig_output” | grep “Query time” | grep -oP ‘\d+’ | head -1)

    # 检查解析是否成功
    if [ -z “$resolved_ip” ]; then
        errors=“DNS服务器 $server 无法解析 $domain”
    fi

    # 如果指定了期望 IP，检查解析结果是否正确
    if [ -n “$expected_ip” ] && [ -n “$resolved_ip” ] && [ “$resolved_ip” != “$expected_ip” ]; then
        errors=“DNS服务器 $server 解析 $domain 结果异常: 期望 $expected_ip, 实际 $resolved_ip”
    fi

    # 检查延迟
    if [ -n “$query_time” ] && [ “$query_time” -gt “$LATENCY_THRESHOLD” ]; then
        if [ -z “$errors” ]; then
            errors=“DNS服务器 $server 解析 $domain 延迟过高: ${query_time}ms (阈值: ${LATENCY_THRESHOLD}ms)”
        else
            errors=“$errors; 且延迟过高: ${query_time}ms”
        fi
    fi

    if [ -n “$errors” ]; then
        echo “$errors”
        return 1
    fi
    return 0
}

# ============ 主逻辑 ============

all_errors=“”
total_checks=0
failed_checks=0

for server in “${DNS_SERVERS[@]}”; do
    for domain in “${!TEST_DOMAINS[@]}”; do
        expected=“${TEST_DOMAINS[$domain]}”
        total_checks=$((total_checks + 1))

        error_msg=$(check_dns “$server” “$domain” “$expected”)
        if [ $? -ne 0 ]; then
            failed_checks=$((failed_checks + 1))
            all_errors=“$all_errors\n$error_msg”
            echo “[$TIMESTAMP] FAIL: $error_msg”
        else
            echo “[$TIMESTAMP] OK: $server -> $domain”
        fi
    done
done

# 汇总结果
echo “[$TIMESTAMP] 检测完成: $total_checks 项检查, $failed_checks 项失败”

# 发送告警（带去重逻辑）
if [ $failed_checks -gt 0 ]; then
    # 计算当前错误的哈希，避免重复告警
    current_hash=$(echo -e “$all_errors” | md5sum | awk ‘{print $1}’)
    last_hash=$(cat “$STATE_FILE” 2>/dev/null)

    if [ “$current_hash” != “$last_hash” ]; then
        alert_msg=“检测到 $failed_checks/$total_checks 项DNS异常:\n$(echo -e “$all_errors”)”
        send_alert “$alert_msg”
        echo “$current_hash” > “$STATE_FILE”
        echo “[$TIMESTAMP] 告警已发送”
    else
        echo “[$TIMESTAMP] 与上次告警相同，跳过重复发送”
    fi
else
    # 如果之前有告警，现在恢复了，发送恢复通知
    if [ -f “$STATE_FILE” ]; then
        send_alert “DNS解析已恢复正常，全部 $total_checks 项检查通过”
        rm -f “$STATE_FILE”
        echo “[$TIMESTAMP] 恢复通知已发送”
    fi
fi

# 设置定时执行
chmod +x /opt/scripts/dns_health_check.sh
echo “* * * * * root /opt/scripts/dns_health_check.sh >> /var/log/dns_health.log 2>&1” | \
    sudo tee /etc/cron.d/dns-health-check

最佳实践和注意事项

最佳实践

DNS 高可用架构

双活 DNS 架构设计：

生产环境至少部署两台 Local DNS 服务器，分布在不同的可用区或机架。客户端 resolv.conf 中配置两个 nameserver 地址。

# /etc/resolv.conf 高可用配置
nameserver 10.0.0.2    # 主 DNS（可用区 A）
nameserver 10.0.1.2    # 备 DNS（可用区 B）
search corp.example.com
options timeout:2 attempts:2 rotate

Keepalived + Unbound/BIND9 实现 VIP 漂移：

# Keepalived 配置（主节点）
# 文件路径：/etc/keepalived/keepalived.conf

vrrp_script check_dns {
    script “/usr/local/bin/check_dns.sh”
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance DNS_VIP {
    state MASTER
    interface eth0
    virtual_router_id 53
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass dns_ha_secret
    }
    virtual_ipaddress {
        10.0.0.100/24
    }
    track_script {
        check_dns
    }
}

#!/bin/bash
# 文件路径：/usr/local/bin/check_dns.sh
# Keepalived 健康检查脚本

# 尝试解析一个已知域名
result=$(dig @127.0.0.1 +time=2 +tries=1 +short health.check.local 2>/dev/null)

if [ -z “$result” ]; then
    exit 1  # 检查失败，降低优先级
fi
exit 0  # 检查通过

本地 DNS 缓存优化

# systemd-resolved 缓存调优
# 查看当前缓存大小
resolvectl statistics

# systemd-resolved 的缓存大小由编译时参数决定，默认 4096 条
# 如果需要更大的缓存，建议使用 Unbound 或 dnsmasq 替代

# Unbound 缓存预热脚本
#!/bin/bash
# 文件名：dns_cache_warm.sh
# 在 Unbound 重启后执行，预热常用域名的缓存

DOMAINS=(
    “internal-api.corp.example.com”
    “db-master.corp.example.com”
    “redis.corp.example.com”
    “kafka.corp.example.com”
    “registry.corp.example.com”
)

for domain in “${DOMAINS[@]}”; do
    dig @127.0.0.1 +short “$domain” > /dev/null 2>&1
    dig @127.0.0.1 +short “$domain” AAAA > /dev/null 2>&1
done

echo “$(date): DNS 缓存预热完成，共 ${#DOMAINS[@]} 个域名”

安全加固

# 限制 DNS 递归查询范围（BIND9）
# 只允许内网 IP 使用递归查询，防止被用作 DNS 放大攻击

# 在 named.conf 中：
acl “internal” {
    10.0.0.0/8;
    172.16.0.0/12;
    192.168.0.0/16;
    127.0.0.1;
};

options {
    allow-recursion { internal; };
    allow-query { internal; };
    allow-query-cache { internal; };
    rate-limit {
        responses-per-second 10;
        window 5;
    };
};

# 启用 DNS Response Rate Limiting（防止 DNS 放大攻击）
# Unbound 配置
server:
# 限制对同一查询的响应速率
    ratelimit: 1000
# 限制对不存在域名的响应速率
    ip-ratelimit: 200

# 使用 DNS-over-TLS 加密 DNS 查询（防止中间人窃听）
# systemd-resolved 配置
cat << ‘EOF’ | sudo tee /etc/systemd/resolved.conf.d/dot.conf
[Resolve]
DNS=1.1.1.1#cloudflare-dns.com 8.8.8.8#dns.google
DNSOverTLS=yes
EOF
sudo systemctl restart systemd-resolved

注意事项

配置注意事项

/etc/resolv.conf 的 search 和 ndots 陷阱：

Kubernetes 默认配置 ndots:5，意味着查询一个包含少于 5 个点的域名时，系统会先尝试追加 search 域。一个简单的 example.com 查询（只有 1 个点）会产生以下查询序列：

example.com.default.svc.cluster.local  -> NXDOMAIN
example.com.svc.cluster.local         -> NXDOMAIN
example.com.cluster.local              -> NXDOMAIN
example.com.corp.example.com           -> NXDOMAIN
example.com.                           -> 成功

每个外部域名查询都会多产生 4 次无效查询，严重影响性能。

解决方案： 在查询外部域名时使用 FQDN（以点号结尾），例如 example.com.。或者在 Pod spec 中覆盖 dnsConfig：

spec:
  dnsConfig:
    options:
      - name: ndots
        value: “2”

常见错误

错误现象	原因分析	解决方案
dig 正常但 curl 失败	nsswitch.conf 配置问题，或 /etc/hosts 中有错误条目	检查 nsswitch.conf 的 hosts 行顺序，清理 /etc/hosts
解析正常但偶发 5 秒延迟	A/AAAA 并发查询在 conntrack 中冲突（Kubernetes 常见）	resolv.conf 添加 single-request-reopen 选项
域名切换后旧 IP 仍被解析	DNS 缓存未过期（浏览器/OS/Local DNS/JVM 多级缓存）	逐级清除缓存，等待 TTL 过期
SERVFAIL 返回码	DNS 服务器无法完成递归查询，或 DNSSEC 验证失败	dig +trace 追踪失败环节，dig +cd 测试是否是 DNSSEC 问题
REFUSED 返回码	DNS 服务器拒绝为客户端 IP 提供递归服务	检查 DNS 服务器的 allow-recursion 和 ACL 配置
修改 resolv.conf 后不生效	文件被 NM/systemd-resolved 覆盖，或应用有内部 DNS 缓存	正确配置 DNS 管理方式，重启应用清除内部缓存
TCP 53 端口查询失败	防火墙只放行了 UDP 53，未放行 TCP 53	DNS 同时需要 UDP 和 TCP 53 端口，特别是 DNSSEC 和大响应场景

兼容性问题

glibc vs musl libc：Alpine Linux 使用 musl libc，其 DNS 解析行为与 glibc 不同。musl 不支持 resolv.conf 中的 options single-request-reopen，且对 search 域的处理逻辑也有差异。在 Kubernetes 中使用 Alpine 基础镜像时需要特别注意。
systemd-resolved 与 Docker：Docker 容器默认使用宿主机的 /etc/resolv.conf。如果宿主机使用 systemd-resolved（nameserver 127.0.0.53），容器内无法访问这个地址（因为它是宿主机的 loopback）。Docker 18.03+ 会自动检测并替换为实际的上游 DNS 地址，但旧版本需要手动处理。
IPv6 与 DNS：在纯 IPv4 环境中，如果 DNS 服务器返回 AAAA 记录，某些应用可能优先尝试 IPv6 连接，导致连接超时后才 fallback 到 IPv4。通过设置 precedence ::ffff:0:0/96 100 在 /etc/gai.conf 中可以调整优先级。

故障排查和监控

故障排查

日志查看

# 查看 systemd-resolved 日志
sudo journalctl -u systemd-resolved --since “1 hour ago” --no-pager

# 查看 BIND9 日志
sudo journalctl -u named -f --no-pager
# 或者查看文件日志
sudo tail -f /var/log/named/queries.log

# 查看 CoreDNS 日志（Kubernetes 环境）
kubectl -n kube-system logs -l k8s-app=kube-dns --tail=200 -f

# 查看 Unbound 日志
sudo tail -f /var/log/unbound/unbound.log

# 查看系统级 DNS 相关日志
sudo journalctl -g “dns\|resolve\|nameserver” --since “1 hour ago” --no-pager

常见问题排查

问题一：所有域名都无法解析

# 1. 检查 resolv.conf 是否为空或被清空
cat /etc/resolv.conf
# 如果文件为空或没有 nameserver 行，就是这个原因

# 2. 检查 DNS 服务器连通性
ping -c 3 10.0.0.2
# 如果不通，检查网络配置

# 3. 检查 DNS 端口是否可达
nc -zvu 10.0.0.2 53
# 或
nmap -sU -p 53 10.0.0.2

# 4. 检查本地防火墙是否阻断了 DNS 出站流量
sudo iptables -L OUTPUT -n -v | grep 53
sudo nft list ruleset | grep 53

# 5. 检查 systemd-resolved 是否正常运行
systemctl status systemd-resolved
# 如果 inactive，启动它
sudo systemctl start systemd-resolved

问题二：部分域名解析失败

# 1. 用不同 DNS 服务器对比
dig @10.0.0.2 failing-domain.com
dig @8.8.8.8 failing-domain.com
dig @1.1.1.1 failing-domain.com

# 2. 使用 +trace 追踪整个解析链路
dig +trace failing-domain.com

# 3. 检查是否是 EDNS 问题（大响应包被截断）
dig +bufsize=512 failing-domain.com
dig +bufsize=4096 failing-domain.com
# 如果小 buffer 失败而大 buffer 成功，说明中间网络设备截断了大 DNS 包

# 4. 检查是否是 DNSSEC 问题
dig +cd failing-domain.com
# 如果 +cd 能成功但不加 +cd 失败，说明 DNSSEC 验证有问题

问题三：Kubernetes Pod DNS 解析偶发 5 秒延迟

# 这是 Linux 内核 conntrack 模块的已知 race condition
# 当 A 和 AAAA 查询同时发出且使用相同的源端口时
# conntrack 可能丢弃其中一个查询的响应包

# 确认方法：在 Pod 中抓包
kubectl exec -it <pod-name> -- tcpdump -i any port 53 -nn -l

# 观察是否有查询没有收到响应

# 解决方案 1：修改 Pod 的 resolv.conf
# 在 Pod spec 中添加：
spec:
  dnsConfig:
    options:
      - name: single-request-reopen
        value: “”

# 解决方案 2：使用 NodeLocal DNSCache
# 部署 NodeLocal DNSCache DaemonSet，让每个节点上有本地 DNS 缓存
# 参考：https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

# 解决方案 3：升级内核到 5.0+ 并确认修复补丁
# Linux 内核 5.0 合并了 conntrack race condition 的修复

调试模式

# dig 调试模式（显示完整的查询和响应过程）
dig +all +multiline +comments example.com

# 使用 strace 跟踪应用程序的 DNS 解析行为
strace -e trace=network -f -p <PID> 2>&1 | grep -E “connect|sendto|recvfrom”

# 使用 nscd 调试模式（如果使用 nscd 缓存）
sudo nscd -d &
# -d 前台运行并打印调试信息

# systemd-resolved debug 日志
sudo SYSTEMD_LOG_LEVEL=debug /usr/lib/systemd/systemd-resolved &

# Unbound 提高日志级别
sudo unbound-control verbosity 5
# 排查完恢复
sudo unbound-control verbosity 1

# BIND9 开启查询日志
sudo rndc querylog on
# 排查完关闭
sudo rndc querylog off

性能监控

关键指标监控

指标名称	正常范围	告警阈值	说明
DNS 解析延迟	< 50ms	> 200ms	从发出查询到收到响应的时间
DNS 解析失败率	< 0.1%	> 1%	SERVFAIL + NXDOMAIN（排除合法的 NXDOMAIN）
DNS 缓存命中率	> 80%	< 50%	缓存命中次数 / 总查询次数
DNS 服务器 QPS	取决于服务器规格	超过容量的 80%	每秒查询数
DNS TCP 查询比例	< 5%	> 20%	TCP 查询占总查询的比例，过高说明 UDP 响应超限
上游 DNS 响应时间	< 100ms	> 500ms	递归服务器到权威服务器的延迟

Prometheus + Grafana 监控配置

使用 Blackbox Exporter 监控 DNS 解析：

# 文件路径：/etc/prometheus/blackbox.yml
# Blackbox Exporter DNS 探针配置

modules:
  dns_internal:
    prober: dns
    timeout: 5s
    dns:
      query_name: “internal-api.corp.example.com”
      query_type: “A”
      valid_rcodes:
        - NOERROR
      validate_answer_rrs:
        fail_if_none_matches_regexp:
          - “.*10\\.0\\.1\\.50.*”
      preferred_ip_protocol: “ip4”

  dns_external:
    prober: dns
    timeout: 5s
    dns:
      query_name: “example.com”
      query_type: “A”
      valid_rcodes:
        - NOERROR
      preferred_ip_protocol: “ip4”

  dns_soa:
    prober: dns
    timeout: 5s
    dns:
      query_name: “corp.example.com”
      query_type: “SOA”
      valid_rcodes:
        - NOERROR

Prometheus 采集配置：

# 文件路径：/etc/prometheus/prometheus.yml（相关片段）

scrape_configs:
  # DNS 健康探测
  - job_name: ‘dns_probe’
    metrics_path: /probe
    scrape_interval: 30s
    params:
      module: [dns_internal]
    static_configs:
      - targets:
        - ‘10.0.0.2’  # 内网 DNS 1
        - ‘10.0.0.3’  # 内网 DNS 2
        - ‘8.8.8.8’   # Google DNS
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: ‘blackbox-exporter:9115’

  # CoreDNS 指标采集（Kubernetes 环境）
  - job_name: ‘coredns’
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: [‘kube-system’]
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_k8s_app]
        regex: kube-dns
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        regex: “9153”
        action: keep

  # BIND9 指标采集（需要 bind_exporter）
  - job_name: ‘bind9’
    static_configs:
      - targets: [‘10.0.0.2:9119’, ‘10.0.0.3:9119’]

  # Unbound 指标采集（需要 unbound_exporter）
  - job_name: ‘unbound’
    static_configs:
      - targets: [‘localhost:9167’]

告警规则：

# 文件路径：/etc/prometheus/rules/dns_alerts.yml

groups:
  - name: dns_alerts
    rules:
      # DNS 解析延迟过高
      - alert: DNSResolutionSlow
        expr: probe_dns_lookup_time_seconds > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: “DNS 解析延迟过高”
          description: “DNS 服务器 {{ $labels.instance }} 解析延迟 {{ $value | humanizeDuration }}，超过 200ms 阈值”

      # DNS 解析失败
      - alert: DNSResolutionFailed
        expr: probe_success{job=“dns_probe”} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: “DNS 解析失败”
          description: “DNS 服务器 {{ $labels.instance }} 连续 2 分钟解析失败”

      # CoreDNS 错误率过高
      - alert: CoreDNSErrorRateHigh
        expr: |
          sum(rate(coredns_dns_responses_total{rcode=~“SERVFAIL|REFUSED”}[5m])) by (instance)
          /
          sum(rate(coredns_dns_responses_total[5m])) by (instance)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: “CoreDNS 错误率过高”
          description: “CoreDNS {{ $labels.instance }} 错误率 {{ $value | humanizePercentage }}，超过 5% 阈值”

      # DNS 缓存命中率过低
      - alert: DNSCacheHitRateLow
        expr: |
          unbound_response_time_seconds_bucket{le=“0.001”}
          /
          unbound_response_time_seconds_bucket{le=“+Inf”}
          < 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: “DNS 缓存命中率过低”
          description: “Unbound 缓存命中率低于 50%，可能需要增加缓存大小”

备份与恢复

DNS 配置备份

#!/bin/bash
# 文件名：dns_config_backup.sh
# 功能：备份 DNS 相关配置文件

BACKUP_DIR=“/backup/dns/$(date +%Y%m%d_%H%M%S)”
mkdir -p “$BACKUP_DIR”

# 备份客户端 DNS 配置
cp /etc/resolv.conf “$BACKUP_DIR/”
cp /etc/nsswitch.conf “$BACKUP_DIR/”
cp /etc/hosts “$BACKUP_DIR/”
cp -r /etc/systemd/resolved.conf.d/ “$BACKUP_DIR/” 2>/dev/null

# 备份 BIND9 配置（如果存在）
if [ -d /etc/named ]; then
    cp -r /etc/named/ “$BACKUP_DIR/bind9/”
    cp -r /var/named/ “$BACKUP_DIR/bind9-zones/”
fi

# 备份 Unbound 配置（如果存在）
if [ -d /etc/unbound ]; then
    cp -r /etc/unbound/ “$BACKUP_DIR/unbound/”
fi

# 备份 dnsmasq 配置（如果存在）
if [ -d /etc/dnsmasq.d ]; then
    cp -r /etc/dnsmasq.d/ “$BACKUP_DIR/dnsmasq/”
fi

# 记录当前解析状态（用于恢复后验证）
dig +short example.com > “$BACKUP_DIR/resolve_baseline.txt”
dig +short internal-api.corp.example.com >> “$BACKUP_DIR/resolve_baseline.txt”
resolvectl status > “$BACKUP_DIR/resolved_status.txt” 2>/dev/null

# 压缩
tar czf “${BACKUP_DIR}.tar.gz” -C “$(dirname $BACKUP_DIR)” “$(basename $BACKUP_DIR)”
rm -rf “$BACKUP_DIR”

echo “DNS 配置已备份到 ${BACKUP_DIR}.tar.gz”

恢复流程

# 1. 解压备份
BACKUP_FILE=“/backup/dns/20260313_143000.tar.gz”
tar xzf “$BACKUP_FILE” -C /tmp/

# 2. 恢复配置文件
sudo cp /tmp/20260313_143000/resolv.conf /etc/resolv.conf
sudo cp /tmp/20260313_143000/nsswitch.conf /etc/nsswitch.conf

# 3. 恢复 systemd-resolved 配置
sudo cp -r /tmp/20260313_143000/resolved.conf.d/* /etc/systemd/resolved.conf.d/
sudo systemctl restart systemd-resolved

# 4. 验证恢复结果
resolvectl status
dig +short example.com
dig +short internal-api.corp.example.com

总结

技术要点回顾

DNS 解析是分层递归的过程，排查时需要逐层检查：本地文件 -> 系统配置 -> Local DNS -> 递归链路 -> 权威 DNS
/etc/resolv.conf 是客户端 DNS 配置的核心，但在现代 Linux 系统中它经常被 NetworkManager 或 systemd-resolved 覆盖，修改前必须确认谁在管理这个文件
dig +trace 是追踪解析链路最有力的工具，能精确定位到哪一跳出了问题
DNSSEC 验证失败会导致 SERVFAIL，用 dig +cd 可以快速确认是否是 DNSSEC 相关问题
Kubernetes 环境中的 ndots:5 配置会让每个外部域名查询额外产生 4 次无效查询，使用 FQDN 或降低 ndots 值可以显著减少 DNS 延迟
conntrack race condition 导致的偶发 5 秒 DNS 延迟是 Kubernetes 中的经典问题，single-request-reopen 或 NodeLocal DNSCache 可以缓解

进阶学习方向

DNS-over-HTTPS (DoH) / DNS-over-TLS (DoT)：加密 DNS 查询防止中间人攻击和窃听。systemd-resolved 从 247 版本开始支持 DoT，主流浏览器已支持 DoH。生产环境部署时需要评估对内网 DNS 架构的影响。
DNSSEC 签名管理：如果管理自己的权威 DNS，需要掌握 DNSSEC 密钥生成、区域签名、密钥轮转（KSK/ZSK rotation）的操作流程。BIND9 的 inline-signing 和 auto-dnssec 可以简化管理。
DNS 性能优化与容量规划：大规模环境（万级服务器、百万级 QPS）下的 DNS 架构设计，包括 anycast DNS、多级缓存层次、DNS 流量分析等。

参考资料

BIND9 Administrator Reference Manual: https://bind9.readthedocs.io/
Unbound Documentation: https://unbound.docs.nlnetlabs.nl/
CoreDNS Manual: https://coredns.io/manual/toc/
systemd-resolved 手册：man 8 systemd-resolved
RFC 1035 - Domain Names - Implementation and Specification
RFC 8484 - DNS Queries over HTTPS (DoH)
RFC 7858 - DNS over Transport Layer Security (DoT)
RFC 4033/4034/4035 - DNS Security Introduction and Requirements (DNSSEC)
Kubernetes DNS for Services and Pods: https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/

附录

命令速查表

# 基本查询
dig example.com                     # A 记录查询
dig example.com AAAA                 # IPv6 记录查询
dig example.com MX                   # 邮件记录查询
dig +short example.com               # 只输出结果
dig @8.8.8.8 example.com            # 指定 DNS 服务器

# 高级查询
dig +trace example.com               # 追踪完整解析链路
dig +tcp example.com                 # 使用 TCP 查询
dig +dnssec example.com              # 检查 DNSSEC
dig +cd example.com                  # 跳过 DNSSEC 验证
dig -x 1.2.3.4                       # 反向解析

# 系统配置查看
cat /etc/resolv.conf                 # DNS 服务器配置
cat /etc/nsswitch.conf               # 解析顺序配置
cat /etc/hosts                       # 本地静态解析
ls -la /etc/resolv.conf              # 检查是否是符号链接

# systemd-resolved 管理
resolvectl status                    # 查看 DNS 状态
resolvectl statistics                # 缓存统计
resolvectl flush-caches              # 清空缓存
resolvectl query example.com         # 测试解析

# DNS 服务端管理
sudo rndc flush                      # 清空 BIND9 缓存
sudo rndc status                     # BIND9 状态
sudo unbound-control stats_noreset   # Unbound 统计

# 抓包分析
sudo tcpdump -i any port 53 -nn -l   # 抓取 DNS 流量

配置参数详解

/etc/resolv.conf 完整参数说明：

nameserver：DNS 服务器 IP 地址，最多 3 个，按顺序使用
search：搜索域列表，最多 6 个域名，总长度不超过 256 字符
domain：默认域名，与 search 互斥（如果同时存在，后出现的生效）
options ndots:N：域名中点号数量阈值，默认 1。低于此值的域名会先追加 search 域查询
options timeout:N：单次查询超时秒数，默认 5
options attempts:N：查询失败重试次数，默认 2
options rotate：轮询使用多个 nameserver（默认总是优先第一个）
options single-request：A 和 AAAA 查询串行发送（而非并行）
options single-request-reopen：A 和 AAAA 查询使用不同的 socket
options edns0：启用 EDNS0 扩展（现代系统默认启用）
options trust-ad：信任上游 DNS 返回的 AD（Authenticated Data）标志

希望这份详尽的手册能帮助你在遇到 DNS 问题时，不再感到迷茫。从基本的配置检查到复杂的链路分析，再到生产环境的最佳实践，遵循这个流程，大多数 DNS 故障都能迎刃而解。如果你在实践中发现了新的技巧或案例，欢迎到云栈社区与更多同行交流分享。

上一篇：VBS破解技术升级：无需改动BIOS设置，一键绕过Denuvo游戏加密
下一篇：传苹果拟合作长江存储，中国版iPhone供应链或迎NAND闪存变局

DNS, Linux, Ubuntu, Kubernetes, CoreDNS

DNS解析故障全链路排查手册：从Linux到K8s的实战诊断与治理

核心概念

适用场景

环境要求

详细步骤

准备工作

安装排查工具

确认当前 DNS 配置

基准测试

DNS 排查工具详解

dig 命令深度使用

nslookup 和 host

使用 tcpdump 抓取 DNS 流量

DNS 解析链路逐层排查

第一层：本地文件检查

第二层：resolv.conf 检查

第三层：systemd-resolved 排查

第四层：DNS 服务端排查

第五层：DNS 劫持检测

第六层：DNSSEC 验证问题排查

启动和验证

示例代码和配置

完整配置示例

systemd-resolved 生产环境配置

Unbound 本地缓存配置

CoreDNS 自定义 Corefile（Kubernetes 环境）

实际应用案例

案例一：resolv.conf 被 NetworkManager 覆盖

案例二：DNS 超时导致服务启动慢

案例三：内网 DNS 与公网 DNS 冲突

DNS 健康检测脚本

最佳实践和注意事项

最佳实践

DNS 高可用架构

本地 DNS 缓存优化

安全加固

注意事项

配置注意事项

常见错误

兼容性问题

故障排查和监控

故障排查

日志查看

常见问题排查

调试模式

性能监控

关键指标监控

Prometheus + Grafana 监控配置

备份与恢复

DNS 配置备份

恢复流程

总结

技术要点回顾

进阶学习方向

参考资料

附录

命令速查表

配置参数详解

相关帖子

浏览过的版块