在云原生时代,Kubernetes已成为容器编排的事实标准,但其复杂的架构也带来了前所未有的运维挑战。从Pod频繁重启到服务间通信异常,从资源调度失败到存储挂载问题,每一个故障都可能影响业务的稳定运行。掌握系统性的Troubleshooting方法,对于保障业务连续性至关重要。
二、技术背景
2.1 K8s架构关键组件回顾
Kubernetes采用典型的Master-Worker架构,理解组件交互是故障排查的基础:
控制平面组件:
- kube-apiserver:所有操作的入口,处理REST请求
- etcd:集群状态数据的唯一存储
- kube-scheduler:负责Pod调度决策
- kube-controller-manager:管理各类控制器(如ReplicaSet、Deployment)
- cloud-controller-manager:对接云服务商的API
节点组件:
- kubelet:节点代理,管理Pod生命周期
- kube-proxy:维护网络规则,实现Service抽象
- 容器运行时:如Docker、containerd、CRI-O
2.2 常见故障类型分类
根据多年运维经验,K8s故障可归为以下六大类:
- Pod状态异常:Pending、CrashLoopBackOff、ImagePullBackOff、Error
- 资源调度问题:资源不足、亲和性规则冲突、污点容忍度不匹配
- 网络通信故障:Service无法访问、DNS解析失败、跨节点通信异常
- 存储挂载问题:PVC无法绑定、挂载超时、权限错误
- 节点级故障:NotReady、磁盘压力、内存不足
- 配置错误:YAML语法错误、RBAC权限不足、资源限制设置不当
2.3 故障排查方法论
三步排查法:
第一步:信息收集
# 查看资源状态
kubectl get pods -o wide
kubectl describe pod <pod-name>
kubectl get events --sort-by='.lastTimestamp'
第二步:日志分析
# 查看容器日志
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container-name>
kubectl logs <pod-name> --previous # 查看上一个崩溃容器的日志
第三步:深入诊断
# 进入容器排查
kubectl exec -it <pod-name> -- /bin/sh
# 查看节点状态
kubectl describe node <node-name>
三、核心内容
3.1 Pod生命周期和状态解析
Pod的生命周期包含多个阶段,每个状态都有特定含义:
关键状态说明:
| 状态 |
含义 |
常见原因 |
| Pending |
调度中或等待资源 |
资源不足、镜像拉取中、存储未就绪 |
| Running |
正常运行 |
- |
| Succeeded |
成功完成(Job/CronJob) |
- |
| Failed |
执行失败 |
容器退出码非0 |
| Unknown |
无法获取状态 |
节点通信异常 |
| CrashLoopBackOff |
反复崩溃重启 |
应用启动失败、健康检查失败 |
| ImagePullBackOff |
镜像拉取失败 |
镜像不存在、认证失败、网络问题 |
容器状态查看:
# 查看Pod详细状态
kubectl get pod <pod-name> -o yaml | grep -A 10 status
# 查看容器重启次数
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses.restartCount}'
# 查看容器就绪状态
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses.ready}'
3.2 常用排查命令详解
基础信息获取:
# 查看所有命名空间的Pod
kubectl get pods -A
# 查看Pod详细信息(包括IP、节点、启动时间)
kubectl get pods -o wide -n <namespace>
# 查看Pod的YAML完整配置
kubectl get pod <pod-name> -o yaml
# 查看Pod详细描述(最常用的排查命令)
kubectl describe pod <pod-name> -n <namespace>
日志查看技巧:
# 查看最近100行日志
kubectl logs <pod-name> --tail=100
# 实时查看日志(类似tail -f)
kubectl logs -f <pod-name>
# 查看多容器Pod中指定容器的日志
kubectl logs <pod-name> -c <container-name>
# 查看所有容器日志
kubectl logs <pod-name> --all-containers=true
# 查看崩溃前的日志
kubectl logs <pod-name> --previous
# 添加时间戳
kubectl logs <pod-name> --timestamps=true
# 查看最近1小时的日志
kubectl logs <pod-name> --since=1h
事件查看:
# 查看集群事件(按时间排序)
kubectl get events --sort-by='.lastTimestamp'
# 查看特定命名空间的事件
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
# 查看特定Pod相关事件
kubectl get events --field-selector involvedObject.name=<pod-name>
# 查看Warning级别事件
kubectl get events --field-selector type=Warning
资源使用查看:
# 查看节点资源使用(需要metrics-server)
kubectl top nodes
# 查看Pod资源使用
kubectl top pods -n <namespace>
# 查看特定Pod的资源使用
kubectl top pod <pod-name> --containers
3.3 日志分析技巧
日志分析要点:
- 查看容器退出码
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
常见退出码含义:
- 0:正常退出
- 1:应用错误
- 137:收到SIGKILL信号(OOMKilled)
- 143:收到SIGTERM信号(优雅停止)
- 255:退出码超出范围
- OOMKilled判断
kubectl describe pod <pod-name> | grep -i "OOMKilled"
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.reason}'
- 日志聚合查询
# 查看多个Pod的日志
kubectl logs -l app=nginx --tail=50
# 使用stern工具(推荐安装)
stern <pod-prefix> -n <namespace>
3.4 资源限制和调度问题
资源配置示例:
apiVersion: v1
kind: Pod
metadata:
name: resource-demo
spec:
containers:
- name: app
image: nginx
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
调度排查命令:
# 查看Pod调度失败原因
kubectl describe pod <pod-name> | grep -A 5 "Events"
# 查看节点可用资源
kubectl describe nodes | grep -A 5 "Allocated resources"
# 查看节点标签
kubectl get nodes --show-labels
# 查看节点污点
kubectl describe node <node-name> | grep Taints
常见调度失败原因:
# 资源不足
0/3 nodes are available: 3 Insufficient memory.
# 节点选择器不匹配
0/3 nodes are available: 3 node(s) didn‘t match node selector.
# 污点容忍度不匹配
0/3 nodes are available: 3 node(s) had taints that the pod didn‘t tolerate.
3.5 存储和持久化问题
PVC状态查看:
# 查看PVC状态
kubectl get pvc -n <namespace>
# 查看PV状态
kubectl get pv
# 查看PVC详细信息
kubectl describe pvc <pvc-name>
# 查看存储类
kubectl get storageclass
存储配置示例:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
storageClassName: standard
3.6 网络排查工具和方法
Service连接测试:
# 查看Service详情
kubectl get svc -o wide
kubectl describe svc <service-name>
# 查看Endpoints
kubectl get endpoints <service-name>
# 在Pod内测试连接
kubectl exec -it <pod-name> -- curl <service-name>:<port>
kubectl exec -it <pod-name> -- nslookup <service-name>
# 测试跨命名空间访问
kubectl exec -it <pod-name> -- curl <service-name>.<namespace>.svc.cluster.local
网络策略查看:
# 查看网络策略
kubectl get networkpolicies -n <namespace>
kubectl describe networkpolicy <policy-name>
DNS排查:
# 查看CoreDNS状态
kubectl get pods -n kube-system -l k8s-app=kube-dns
# 查看CoreDNS日志
kubectl logs -n kube-system -l k8s-app=kube-dns
# 在Pod内测试DNS
kubectl exec -it <pod-name> -- nslookup kubernetes.default
四、实践案例
案例1:Pod CrashLoopBackOff 排查与解决
故障现象:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
webapp-deployment-7d8f9c 0/1 CrashLoopBackOff 5 3m
排查步骤:
第一步:查看Pod详情
$ kubectl describe pod webapp-deployment-7d8f9c
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 5m default-scheduler Successfully assigned default/webapp-deployment-7d8f9c to node-1
Normal Pulled 3m (x4 over 5m) kubelet Container image "myapp:v1.0" already present on machine
Normal Created 3m (x4 over 5m) kubelet Created container webapp
Normal Started 3m (x4 over 5m) kubelet Started container webapp
Warning BackOff 1m (x10 over 4m) kubelet Back-off restarting failed container
第二步:查看容器日志
$ kubectl logs webapp-deployment-7d8f9c
panic: Failed to connect to database: dial tcp 10.0.1.100:3306: connect: connection refused
goroutine 1 [running]:
main.initDB()
/app/main.go:25 +0x1e5
main.main()
/app/main.go:15 +0x25
第三步:检查退出码
$ kubectl get pod webapp-deployment-7d8f9c -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
2
问题分析: 应用启动时无法连接数据库,导致程序panic退出。容器退出码为2(应用错误)。
解决方案:
- 检查数据库服务是否正常
$ kubectl get svc mysql-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
mysql-service ClusterIP 10.96.100.10 <none> 3306/TCP 10m
$ kubectl get endpoints mysql-service
NAME ENDPOINTS AGE
mysql-service 10.244.1.50:3306 10m
- 修改应用配置,添加健康检查
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-deployment
spec:
replicas: 3
selector:
matchLabels:
app: webapp
template:
metadata:
labels:
app: webapp
spec:
containers:
- name: webapp
image: myapp:v1.0
env:
- name: DB_HOST
value: "mysql-service"
- name: DB_RETRY_INTERVAL
value: "5" # 添加重试机制
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
验证:
$ kubectl apply -f webapp-deployment.yaml
$ kubectl get pods -w
NAME READY STATUS RESTARTS AGE
webapp-deployment-9k4h2 1/1 Running 0 1m
案例2:ImagePullBackOff 镜像拉取失败
故障现象:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-5d7f8b 0/1 ImagePullBackOff 0 2m
排查步骤:
查看详细错误信息
$ kubectl describe pod nginx-app-5d7f8b
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m default-scheduler Successfully assigned default/nginx-app-5d7f8b to node-2
Normal Pulling 1m (x3 over 3m) kubelet Pulling image "harbor.company.com/prod/nginx:v2.0"
Warning Failed 1m (x3 over 3m) kubelet Failed to pull image "harbor.company.com/prod/nginx:v2.0": rpc error: code = Unknown desc = Error response from daemon: pull access denied for harbor.company.com/prod/nginx, repository does not exist or may require ‘docker login‘
Warning Failed 1m (x3 over 3m) kubelet Error: ErrImagePull
Normal BackOff 30s (x5 over 3m) kubelet Back-off pulling image "harbor.company.com/prod/nginx:v2.0"
Warning Failed 30s (x5 over 3m) kubelet Error: ImagePullBackOff
问题分析: 镜像拉取失败,提示需要认证或镜像不存在。
解决方案:
方案1:创建Docker认证Secret
# 创建Secret
kubectl create secret docker-registry harbor-secret \
--docker-server=harbor.company.com \
--docker-username=admin \
--docker-password=Harbor12345 \
--docker-email=admin@company.com \
-n default
# 查看Secret
kubectl get secret harbor-secret -o yaml
方案2:在Deployment中引用Secret
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-app
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
imagePullSecrets:
- name: harbor-secret
containers:
- name: nginx
image: harbor.company.com/prod/nginx:v2.0
ports:
- containerPort: 80
方案3:配置默认Secret(ServiceAccount)
# 给默认ServiceAccount添加imagePullSecrets
kubectl patch serviceaccount default -p ‘{"imagePullSecrets": [{"name": "harbor-secret"}]}‘
验证:
$ kubectl apply -f nginx-deployment.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx-app-7c8d9f 1/1 Running 0 30s
案例3:服务无法访问的网络故障
故障现象:
# 从一个Pod访问另一个服务失败
$ kubectl exec -it client-pod -- curl backend-service:8080
curl: (6) Could not resolve host: backend-service
排查步骤:
第一步:检查Service配置
$ kubectl get svc backend-service
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
backend-service ClusterIP 10.96.150.200 <none> 8080/TCP 5m
$ kubectl describe svc backend-service
Name: backend-service
Namespace: default
Selector: app=backend
Type: ClusterIP
IP: 10.96.150.200
Port: http 8080/TCP
TargetPort: 8080/TCP
Endpoints: <none> # 注意:Endpoints为空
第二步:检查Endpoints
$ kubectl get endpoints backend-service
NAME ENDPOINTS AGE
backend-service <none> 5m
问题分析: Service的Endpoints为空,说明没有匹配到后端Pod。
第三步:检查Pod标签
$ kubectl get pods -l app=backend
No resources found in default namespace.
$ kubectl get pods --show-labels
NAME READY STATUS LABELS
backend-deploy-5f6c7d 1/1 Running app=backend-app,version=v1
问题根因: Service的selector是 app=backend,但Pod的标签是 app=backend-app,标签不匹配。
解决方案:
方案1:修正Service的selector
apiVersion: v1
kind: Service
metadata:
name: backend-service
spec:
selector:
app: backend-app # 修正为正确的标签
ports:
- protocol: TCP
port: 8080
targetPort: 8080
方案2:修正Pod的标签
apiVersion: apps/v1
kind: Deployment
metadata:
name: backend-deploy
spec:
replicas: 3
selector:
matchLabels:
app: backend # 统一标签
template:
metadata:
labels:
app: backend # 统一标签
spec:
containers:
- name: backend
image: backend:v1.0
ports:
- containerPort: 8080
验证:
$ kubectl apply -f backend-service.yaml
$ kubectl get endpoints backend-service
NAME ENDPOINTS AGE
backend-service 10.244.1.10:8080,10.244.2.15:8080,10.244.3.20:8080 1m
$ kubectl exec -it client-pod -- curl backend-service:8080
{"status":"ok","version":"v1.0"}
额外排查:DNS解析问题
# 测试DNS解析
$ kubectl exec -it client-pod -- nslookup backend-service
Server: 10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local
Name: backend-service
Address 1: 10.96.150.200 backend-service.default.svc.cluster.local
# 测试跨命名空间访问
$ kubectl exec -it client-pod -- curl backend-service.production.svc.cluster.local:8080
案例4:节点 NotReady 状态排查
故障现象:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-1 Ready master 30d v1.24.0
node-2 NotReady worker 30d v1.24.0
node-3 Ready worker 30d v1.24.0
排查步骤:
第一步:查看节点详情
$ kubectl describe node node-2
Conditions:
Type Status Reason Message
---- ------ ------ -------
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory
DiskPressure True KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID
Ready False KubeletNotReady container runtime not ready: RuntimeReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ContainerRuntimeUnhealthy 5m kubelet container runtime is down: failed to connect to containerd
第二步:登录问题节点检查
# SSH登录节点
ssh root@node-2
# 检查kubelet状态
systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled)
Active: active (running) since Mon 2024-01-15 10:00:00 CST; 5min ago
# 查看kubelet日志
journalctl -u kubelet -n 100
# 检查容器运行时
systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled)
Active: failed (Result: exit-code) since Mon 2024-01-15 10:05:00 CST
第三步:检查磁盘空间
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 48G 2.0G 96% /
/dev/sdb1 100G 95G 5.0G 95% /var/lib/containerd
问题分析:
- 磁盘使用率过高,触发DiskPressure
- containerd服务异常
- 网络插件未就绪
解决方案:
第一步:清理磁盘空间
# 清理未使用的镜像
crictl rmi --prune
# 清理停止的容器
crictl rm $(crictl ps -a -q --state=Exited)
# 清理日志文件
find /var/log/pods -name "*.log" -mtime +7 -delete
journalctl --vacuum-time=7d
# 再次检查磁盘
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 35G 15G 70% /
第二步:重启containerd
systemctl restart containerd
systemctl status containerd
● containerd.service - containerd container runtime
Loaded: loaded (/usr/lib/systemd/system/containerd.service; enabled)
Active: active (running) since Mon 2024-01-15 10:10:00 CST
第三步:重启kubelet
systemctl restart kubelet
systemctl status kubelet
验证:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
node-1 Ready master 30d v1.24.0
node-2 Ready worker 30d v1.24.0
node-3 Ready worker 30d v1.24.0
$ kubectl describe node node-2 | grep -A 5 Conditions
Conditions:
Type Status
---- ------
MemoryPressure False
DiskPressure False
PIDPressure False
Ready True
案例5:资源不足导致的调度失败
故障现象:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
java-app-deployment-7f8d 0/1 Pending 0 5m
排查步骤:
第一步:查看Pod事件
$ kubectl describe pod java-app-deployment-7f8d
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 5m default-scheduler 0/3 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
Warning FailedScheduling 3m default-scheduler 0/3 nodes are available: 3 Insufficient memory.
第二步:查看Pod资源请求
$ kubectl get pod java-app-deployment-7f8d -o yaml | grep -A 10 resources
resources:
requests:
memory: "4Gi"
cpu: "2000m"
limits:
memory: "8Gi"
cpu: "4000m"
第三步:查看节点可用资源
$ kubectl describe nodes | grep -A 5 "Allocated resources"
Node: node-1
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 3500m (87%) 7000m (175%)
memory 6Gi (75%) 12Gi (150%)
Node: node-2
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 3000m (75%) 6000m (150%)
memory 7Gi (87%) 14Gi (175%)
Node: node-3
Allocated resources:
Resource Requests Limits
-------- -------- ------
cpu 2800m (70%) 5600m (140%)
memory 6.5Gi (81%) 13Gi (162%)
问题分析: 所有节点的可用内存都不足4Gi,无法满足Pod的资源请求。
解决方案:
方案1:调整资源请求(推荐)
apiVersion: apps/v1
kind: Deployment
metadata:
name: java-app-deployment
spec:
replicas: 1
selector:
matchLabels:
app: java-app
template:
metadata:
labels:
app: java-app
spec:
containers:
- name: java-app
image: java-app:v1.0
resources:
requests:
memory: "2Gi" # 降低资源请求
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
方案2:扩容节点
# 在云环境中添加节点(示例:阿里云)
aliyun cs ScaleOutCluster --ClusterId=c1234567890 --count=2 --worker-instance-types=ecs.g6.2xlarge
方案3:清理资源占用
# 查找资源占用高的Pod
kubectl top pods -A --sort-by=memory | head -20
# 删除不必要的Pod
kubectl delete pod <unused-pod> -n <namespace>
# 缩减副本数
kubectl scale deployment <deployment-name> --replicas=1
验证:
$ kubectl apply -f java-app-deployment.yaml
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
java-app-deployment-9k2h 1/1 Running 0 1m
案例6:PVC挂载失败问题
故障现象:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
mysql-statefulset-0 0/1 ContainerCreating 0 3m
排查步骤:
第一步:查看Pod事件
$ kubectl describe pod mysql-statefulset-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m default-scheduler Successfully assigned default/mysql-statefulset-0 to node-2
Warning FailedAttachVolume 3m attachdetach-controller Multi-Attach error for volume “pvc-abc123” Volume is already exclusively attached to one node and can‘t be attached to another
Warning FailedMount 1m kubelet Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[data kube-api-access-xyz]: timed out waiting for the condition
第二步:查看PVC状态
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
data-mysql-statefulset-0 Bound pvc-abc123 20Gi RWO standard 5m
$ kubectl describe pvc data-mysql-statefulset-0
Name: data-mysql-statefulset-0
Namespace: default
StorageClass: standard
Status: Bound
Volume: pvc-abc123
Labels: app=mysql
Annotations: pv.kubernetes.io/bind-completed: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
Capacity: 20Gi
Access Modes: RWO
VolumeMode: Filesystem
第三步:查看PV信息
$ kubectl get pv pvc-abc123
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS AGE
pvc-abc123 20Gi RWO Delete Bound default/data-mysql-statefulset-0 standard 5m
$ kubectl describe pv pvc-abc123
Name: pvc-abc123
StorageClass: standard
Status: Bound
Claim: default/data-mysql-statefulset-0
Reclaim Policy: Delete
Access Modes: RWO
VolumeMode: Filesystem
Capacity: 20Gi
Node Affinity:
Required Terms:
Term 0: topology.kubernetes.io/zone in [us-east-1a]
问题分析: PV使用RWO(ReadWriteOnce)模式,只能挂载到一个节点。可能是之前的Pod未正常卸载卷。
解决方案:
方案1:强制删除旧Pod
# 查找使用该卷的Pod
kubectl get pods -A -o json | jq -r ‘.items[] | select(.spec.volumes[]?.persistentVolumeClaim.claimName=="data-mysql-statefulset-0") | .metadata.name‘
# 强制删除
kubectl delete pod mysql-statefulset-0 --grace-period=0 --force
方案2:检查节点上的卷挂载
# SSH到节点
ssh root@node-2
# 查看挂载点
mount | grep pvc-abc123
/dev/xvdf on /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~aws-ebs/pvc-abc123 type ext4
# 手动卸载(谨慎操作)
umount /var/lib/kubelet/pods/xxx/volumes/kubernetes.io~aws-ebs/pvc-abc123
方案3:修改为ReadWriteMany(如果存储支持)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mysql-data
spec:
accessModes:
- ReadWriteMany # 改为RWX
resources:
requests:
storage: 20Gi
storageClassName: nfs-client # 使用支持RWX的存储类
验证:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
mysql-statefulset-0 1/1 Running 0 1m
$ kubectl exec -it mysql-statefulset-0 -- df -h /var/lib/mysql
Filesystem Size Used Avail Use% Mounted on
/dev/xvdf 20G 1.2G 18G 6% /var/lib/mysql
五、最佳实践
5.1 监控和告警配置
Prometheus + Grafana监控方案:
关键监控指标:
# Pod重启监控规则
groups:
- name: kubernetes-pods
rules:
- alert: PodRestartingTooOften
expr: rate(kube_pod_container_status_restarts_total[1h]) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 重启过于频繁"
- alert: PodNotReady
expr: kube_pod_status_phase{phase!~"Running|Succeeded"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} 状态异常"
节点监控规则:
- alert: NodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.node }} NotReady"
- alert: NodeDiskPressure
expr: kube_node_status_condition{condition="DiskPressure",status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "节点 {{ $labels.node }} 磁盘压力"
5.2 日志收集方案
EFK日志收集架构:
Fluentd DaemonSet配置:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluentd
namespace: kube-system
spec:
selector:
matchLabels:
name: fluentd
template:
metadata:
labels:
name: fluentd
spec:
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
containers:
- name: fluentd
image: fluent/fluentd-kubernetes-daemonset:v1-debian-elasticsearch
env:
- name: FLUENT_ELASTICSEARCH_HOST
value: "elasticsearch.logging.svc.cluster.local"
- name: FLUENT_ELASTICSEARCH_PORT
value: "9200"
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
5.3 故障排查工具箱
必备工具清单:
- kubectl插件
# 安装kubectl-debug(调试工具)
curl -Lo kubectl-debug.tar.gz https://github.com/aylei/kubectl-debug/releases/download/v0.1.1/kubectl-debug_0.1.1_linux_amd64.tar.gz
tar -zxvf kubectl-debug.tar.gz kubectl-debug
mv kubectl-debug /usr/local/bin/
# 使用示例
kubectl debug <pod-name> --agentless --port-forward=true
- stern(多Pod日志查看)
# 安装
wget https://github.com/stern/stern/releases/download/v1.22.0/stern_1.22.0_linux_amd64.tar.gz
tar -zxvf stern_1.22.0_linux_amd64.tar.gz
mv stern /usr/local/bin/
# 使用示例
stern -n production backend-* # 查看所有backend-开头的Pod日志
stern -l app=nginx # 按标签过滤
- 网络调试工具Pod
apiVersion: v1
kind: Pod
metadata:
name: netshoot
spec:
containers:
- name: netshoot
image: nicolaka/netshoot
command:
- sleep
- "3600"
使用方法:
kubectl apply -f netshoot.yaml
kubectl exec -it netshoot -- bash
# 在容器内执行网络调试
ping backend-service
nslookup backend-service
curl -v backend-service:8080
traceroute backend-service
tcpdump -i any port 8080
5.4 预防性措施
1. 资源配额和限制
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: production
spec:
hard:
requests.cpu: "100"
requests.memory: 200Gi
limits.cpu: "200"
limits.memory: 400Gi
persistentvolumeclaims: "50"
2. Pod中断预算(PDB)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: backend-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: backend
3. 健康检查最佳实践
apiVersion: v1
kind: Pod
metadata:
name: webapp
spec:
containers:
- name: app
image: webapp:v1.0
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 60 # 给足启动时间
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe: # 用于慢启动应用
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 0
periodSeconds: 10
failureThreshold: 30 # 允许300秒启动时间
4. 日常巡检脚本
#!/bin/bash
# k8s_health_check.sh
echo "=== 节点状态检查 ==="
kubectl get nodes -o wide
echo -e "\n=== 异常 Pod 检查 ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
echo -e "\n=== 高重启次数 Pod ==="
kubectl get pods -A -o json | jq -r ‘.items[] | select(.status.containerStatuses[]?.restartCount > 5) | "\(.metadata.namespace)/\(.metadata.name) - 重启次数: \(.status.containerStatuses[0].restartCount)"‘
echo -e "\n=== 资源使用 Top 10 ==="
kubectl top nodes
kubectl top pods -A --sort-by=memory | head -10
echo -e "\n=== 最近告警事件 ==="
kubectl get events -A --sort-by=‘.lastTimestamp‘ | grep -i "warning\|error" | tail -20
六、总结与展望
Kubernetes故障排查是一项系统性工程,需要掌握从基础架构到实战技巧的全方位知识。通过本文的梳理,我们系统性地探讨了Pod生命周期、资源调度、网络通信、存储管理等核心领域的故障排查方法,并通过6个真实案例展示了完整的问题定位和解决流程。
核心要点回顾:
- 系统化排查思路:信息收集 → 日志分析 → 深入诊断的三步法
- 熟练掌握工具:kubectl、describe、logs、events等核心命令
- 理解底层原理:Pod调度机制、网络模型、存储绑定流程
- 建立监控体系:Prometheus监控 + EFK日志 + 告警规则
- 预防胜于治疗:资源配额、健康检查、PDB、定期巡检
未来发展趋势:
- AIOps智能运维:基于机器学习的故障预测和自动修复
- eBPF深度观测:更细粒度的网络和性能监控(Cilium、Pixie)
- 服务网格增强:Istio/Linkerd提供更强大的流量管理和故障隔离
- GitOps运维模式:Argo CD/Flux实现声明式配置和自动化回滚
- 边缘计算扩展:KubeEdge等技术将K8s能力延伸到边缘节点
作为运维工程师,持续学习是必备素养。建议定期关注Kubernetes官方博客、CNCF项目动态,并在实际环境中不断实践和总结。每一次故障都是提升能力的机会,建立完善的知识库和自动化工具,才能在云原生时代游刃有余。
最后,建议将本文中的命令和配置保存为团队知识库,结合自身业务特点持续完善。如果你在实践过程中有更多的经验或疑问,也欢迎到 云栈社区 进行交流分享,共同进步。