云栈社区»论坛 › 技术文档「 Note & Doc 」 › eBPF与Cilium实战：排查Kubernetes集群生产环境网络抖动问题 ...

发回帖发新帖

3874 积分	1 好友	535 主题

发消息

eBPF与Cilium实战：排查Kubernetes集群生产环境网络抖动问题

发表于 2025-12-25 19:15:09 | 查看: 97| 回复: 0

eBPF 是什么

eBPF（extended Berkeley Packet Filter）是一项允许在内核中安全、高效地运行自定义程序的技术。它就像一个运行在内核中的微型虚拟机，无需修改内核源码或加载内核模块，便可在关键的代码路径上注入探针，实现对网络、安全、性能等层面的深度可观测性和控制。

┌─────────────────────────────────────────────────────────────────┐
│                    User Space                                   │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────────────┐ │
│  │   App    │  │   App    │  │ BPF Tools│  │ Cilium/Hubble   │ │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘  └────────┬─────────┘ │
│       │             │             │                  │          │
├───────┼─────────────┼─────────────┼──────────────────┼──────────┤
│       │             │     System Calls               │          │
├───────┼─────────────┼─────────────┼──────────────────┼──────────┤
│       ▼             ▼             ▼                  ▼          │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                  BPF Virtual Machine                       │ │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────────┐   │ │
│  │  │ XDP     │  │ TC      │  │ Socket  │  │ Kprobes/    │   │ │
│  │  │ Programs│  │ Programs│  │ Programs│  │ Tracepoints │   │ │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────────┘   │ │
│  └────────────────────────────────────────────────────────────┘ │
│                     Kernel Space                                │
└─────────────────────────────────────────────────────────────────┘

eBPF 程序可以被挂载到内核的多个关键位置，包括：

XDP (eXpress Data Path)：网卡驱动层，在数据包进入内核协议栈之前处理，性能极高。
TC (Traffic Control)：内核协议栈的入口（ingress）和出口（egress）路径。
Socket：与套接字相关的操作。
Kprobes/Kretprobes：内核任意函数的入口和出口。
Tracepoints：内核预定义的静态追踪点。
Perf Events：性能计数器相关事件。

为什么选择 Cilium

Cilium 是基于 eBPF 技术构建的、目前最为成熟的 Kubernetes CNI（容器网络接口）。相较于传统的基于 iptables 的方案（如 Calico/Flannel），它具有显著优势：

特性	iptables (Calico/Flannel)	eBPF (Cilium)
数据包处理	经过完整内核协议栈 + netfilter	可直接在 XDP/TC 层旁路处理
规则匹配	O(n) 线性匹配	O(1) 哈希查找
Service 实现	kube-proxy + iptables 链	纯 eBPF 实现，完全替换 kube-proxy
可观测性	需要额外工具（如 tcpdump）	内置 Hubble，提供七层流量可视化
网络策略	L3/L4	L3/L4/L7 (HTTP, gRPC, Kafka等)

我们选择 Cilium，主要看中了其强大的内置可观测性能力，这在后续的问题排查中起到了决定性作用。

环境要求

内核版本：eBPF 的功能与内核版本强相关，不同版本支持的特性差异较大。

# 检查内核版本
uname -r
# 输出示例: 5.15.0-91-generic

# 检查 eBPF 支持
ls /sys/fs/bpf
# 目录应存在

# 检查 BTF 支持 (CO-RE 必需)
ls /sys/kernel/btf/vmlinux
# 内核 5.2+ 应存在此文件

功能	最低内核版本
基本 eBPF	4.1
BPF Maps	4.1
XDP	4.8
Socket Programs	4.10
BTF (BPF Type Format)	5.2
CO-RE (Compile Once, Run Everywhere)	5.2
Ring Buffer	5.8
Cilium 完整功能	5.4+ (推荐 5.10+)

硬件配置：

网卡：如需使用 XDP 原生模式以获得最佳性能，需要网卡驱动支持（推荐 Intel X710, Mellanox ConnectX-5+）。
CPU：eBPF 程序会消耗一定 CPU 资源，建议在节点资源规划时预留 10% 左右的余量。

我们的生产环境：

操作系统：Ubuntu 22.04 LTS (Kernel 5.15.0)
Kubernetes：1.28.4
Cilium：1.14.5
网卡：Mellanox ConnectX-6 Dx (100GbE)
节点规模：78 个计算节点，3000+ Pod

二、Cilium 部署详解

2.1 准备工作

卸载现有 CNI（如果已安装）：

# 卸载现有 CNI (以 Calico 为例)
kubectl delete -f https://raw.githubusercontent.com/projectcalico/calico/v3.26.1/manifests/calico.yaml

# 清理 CNI 配置
rm -rf /etc/cni/net.d/*
rm -rf /var/lib/cni/

# 清理 iptables 规则 (在每个节点执行)
iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

安装 Cilium CLI：

# 下载 Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
CLI_ARCH=amd64
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}
sha256sum --check cilium-linux-${CLI_ARCH}.tar.gz.sha256sum
sudo tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz{,.sha256sum}

# 验证安装
cilium version

2.2 Helm 安装 Cilium

我们使用 Helm 进行安装，便于配置管理和后续升级。

# 添加 Cilium Helm 仓库
helm repo add cilium https://helm.cilium.io/
helm repo update

以下是我们生产环境使用的 values.yaml 配置：

# cilium-values.yaml
cluster:
  name: prod-cluster-01
  id: 1

# 启用 kube-proxy 替换
kubeProxyReplacement: strict
k8sServiceHost: "10.0.0.10"  # API Server 地址
k8sServicePort: 6443

# 启用原生路由（更好性能）
routingMode: native
ipv4NativeRoutingCIDR: "10.244.0.0/16"
autoDirectNodeRoutes: true

# IPAM 配置
ipam:
  mode: kubernetes
  operator:
    clusterPoolIPv4PodCIDRList:
      - "10.244.0.0/16"
    clusterPoolIPv4MaskSize: 24

# 启用 BPF NodePort
bpf:
  masquerade: true
  hostLegacyRouting: false
  tproxy: true
  lbExternalClusterIP: true

# 启用 XDP 加速（需兼容网卡）
loadBalancer:
  acceleration: native
  mode: dsr
  algorithm: maglev

# 启用带宽管理器进行速率限制
bandwidthManager:
  enabled: true
  bbr: true

# 启用 Hubble 以提供可观测性
hubble:
  enabled: true
  relay:
    enabled: true
    replicas: 3
    resources:
      limits:
        cpu: "1"
        memory: "512Mi"
      requests:
        cpu: "100m"
        memory: "128Mi"
  ui:
    enabled: true
    replicas: 2
    ingress:
      enabled: true
      className: nginx
      hosts:
        - hubble.internal.company.com
      tls:
        - secretName: hubble-tls
          hosts:
            - hubble.internal.company.com
  metrics:
    enabled:
      - dns
      - drop
      - tcp
      - flow
      - icmp
      - http
    serviceMonitor:
      enabled: true
      labels:
        release: prometheus
    dashboards:
      enabled: true
      namespace: monitoring

# 启用 L7 可见性（用于 HTTP）
envoy:
  enabled: true
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true

# Operator 配置
operator:
  replicas: 2
  resources:
    limits:
      cpu: "1"
      memory: "1Gi"
    requests:
      cpu: "100m"
      memory: "128Mi"
  prometheus:
    enabled: true
    serviceMonitor:
      enabled: true

# Agent 资源配置
resources:
  limits:
    cpu: "4"
    memory: "4Gi"
  requests:
    cpu: "500m"
    memory: "512Mi"

# 启用加密（可选但推荐）
encryption:
  enabled: true
  type: wireguard
  wireguard:
    userspaceFallback: false

# 启用主机防火墙
hostFirewall:
  enabled: true

# 启用本地重定向策略
localRedirectPolicy: true

# Prometheus 指标
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true
    labels:
      release: prometheus

# 启用 Socket LB 提升性能
socketLB:
  enabled: true
  hostNamespaceOnly: false

# 安全上下文
securityContext:
  capabilities:
    ciliumAgent:
      - CHOWN
      - KILL
      - NET_ADMIN
      - NET_RAW
      - IPC_LOCK
      - SYS_MODULE
      - SYS_ADMIN
      - SYS_RESOURCE
      - DAC_OVERRIDE
      - FOWNER
      - SETGID
      - SETUID
    cleanCiliumState:
      - NET_ADMIN
      - SYS_ADMIN
      - SYS_RESOURCE

执行安装：

helm upgrade --install cilium cilium/cilium --version 1.14.5 \
    --namespace kube-system \
    --values cilium-values.yaml \
    --wait

# 验证安装
cilium status --wait

期望的输出状态如下：

    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

DaemonSet         cilium             Desired: 78, Ready: 78/78, Available: 78/78
DaemonSet         cilium-envoy       Desired: 78, Ready: 78/78, Available: 78/78
Deployment        cilium-operator    Desired: 2, Ready: 2/2, Available: 2/2
Deployment        hubble-relay       Desired: 3, Ready: 3/3, Available: 3/3
Deployment        hubble-ui          Desired: 2, Ready: 2/2, Available: 2/2

2.3 验证 eBPF 功能

# 检查 BPF Maps
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list
kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global
kubectl exec -n kube-system ds/cilium -- cilium bpf policy get --all

# 检查 XDP 模式
kubectl exec -n kube-system ds/cilium -- cilium status | grep XDP

# 连通性测试
cilium connectivity test

三、生产问题排查实录

接下来，我们详细介绍如何利用 Cilium 和 eBPF 生态工具排查一起棘手的网络延迟抖动问题。

3.1 问题现象

告警内容：

Prometheus 告警：service_request_duration_seconds{quantile="0.99"} > 0.5
受影响服务：order-service 调用 inventory-service 的链路
触发规律：每 10-15 分钟出现一次，持续 2-3 分钟

初步排查：

应用日志：无任何错误或异常记录。
Kubernetes Events：未发现 Pod 重启、驱逐或调度事件。
节点负载：CPU、内存、磁盘 I/O 均在正常水位。
网络监控：整体带宽利用率低于 30%。

常规手段无法定位问题，因此我们决定启用 eBPF 这一“内核透视”工具。

3.2 使用 Hubble 观察网络流量

Hubble 是 Cilium 内置的可观测性组件，能够提供 Pod 级别的网络流详情报文。

安装 Hubble CLI：

HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
curl -L --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-amd64.tar.gz{,.sha256sum}
sha256sum --check hubble-linux-amd64.tar.gz.sha256sum
sudo tar xzvfC hubble-linux-amd64.tar.gz /usr/local/bin
rm hubble-linux-amd64.tar.gz{,.sha256sum}

设置端口转发以便本地访问：

# 端口转发到 Hubble Relay 服务
kubectl port-forward -n kube-system svc/hubble-relay 4245:80 &
export HUBBLE_SERVER=localhost:4245

观察 order-service 的网络流量：

# 实时观察来自 order-service 的流量
hubble observe --namespace production --from-pod order-service --protocol tcp -f

# 筛选出延迟大于 100ms 的连接
hubble observe --namespace production --from-pod order-service --verdict FORWARDED -o json | \
    jq 'select(.flow.l4.TCP != null) | select(.flow.latency_ns > 100000000)'

输出中发现了高延迟连接：

{
  "flow": {
    "time": "2024-12-15T03:42:17.123456789Z",
    "source": {
      "identity": 12345,
      "namespace": "production",
      "labels": ["k8s:app=order-service"],
      "pod_name": "order-service-7d8f9c6b5-x2k4m"
    },
    "destination": {
      "identity": 23456,
      "namespace": "production",
      "labels": ["k8s:app=inventory-service"],
      "pod_name": "inventory-service-5c4d3b2a1-m8n7p",
      "IP": "10.244.15.87",
      "port": 8080
    },
    "l4": {
      "TCP": {
        "source_port": 45678,
        "destination_port": 8080,
        "flags": {"SYN": true}
      }
    },
    "latency_ns": 523456789,
    "verdict": "FORWARDED"
  }
}

关键发现：延迟高达 523ms，且发生在 TCP SYN 包阶段。这表明问题出现在 TCP 连接建立初期，而非应用层业务逻辑处理慢。

3.3 深入分析 TCP 握手延迟

使用 Hubble 进一步分析 TCP 流的时序信息：

hubble observe --namespace production \
    --from-pod order-service \
    --to-pod inventory-service \
    --protocol tcp \
    --verdict FORWARDED \
    -o json | jq '.flow | {
        time: .time,
        src: .source.pod_name,
        dst: .destination.pod_name,
        src_port: .l4.TCP.source_port,
        dst_port: .l4.TCP.destination_port,
        flags: .l4.TCP.flags,
        latency_ms: (.latency_ns / 1000000)
    }'

收集一段时间的数据后，编写 Python 脚本进行分析：

#!/usr/bin/env python3
# analyze_tcp_latency.py
import json
import sys
from collections import defaultdict
from datetime import datetime

def analyze_flows(flows):
    connections = defaultdict(list)
    for flow in flows:
        # Parse flow
        src = flow.get('src', '')
        dst = flow.get('dst', '')
        src_port = flow.get('src_port', 0)
        flags = flow.get('flags', {})
        latency_ms = flow.get('latency_ms', 0)
        timestamp = flow.get('time', '')

        # Group by connection (src:port -> dst)
        conn_key = f"{src}:{src_port}->{dst}"
        connections[conn_key].append({
            'flags': flags,
            'latency_ms': latency_ms,
            'time': timestamp
        })

    # Analyze each connection
    slow_syns = []
    for conn_key, packets in connections.items():
        syn_packets = [p for p in packets if p['flags'].get('SYN') and not p['flags'].get('ACK')]
        for syn in syn_packets:
            if syn['latency_ms'] > 100:
                slow_syns.append({
                    'connection': conn_key,
                    'latency_ms': syn['latency_ms'],
                    'time': syn['time']
                })

    # Sort by latency
    slow_syns.sort(key=lambda x: x['latency_ms'], reverse=True)
    print(f"Found {len(slow_syns)} slow SYN packets (>100ms)")
    print("\nTop 10 slowest:")
    for syn in slow_syns[:10]:
        print(f"  {syn['time']}: {syn['connection']} - {syn['latency_ms']:.2f}ms")

    # Analyze time distribution
    if slow_syns:
        print("\nTime distribution of slow SYNs:")
        time_buckets = defaultdict(int)
        for syn in slow_syns:
            hour = syn['time'][:13]  # Group by hour
            time_buckets[hour] += 1
        for hour, count in sorted(time_buckets.items()):
            print(f"  {hour}: {count} slow SYNs")

if __name__ == '__main__':
    flows = []
    for line in sys.stdin:
        try:
            flows.append(json.loads(line))
        except json.JSONDecodeError:
            continue
    analyze_flows(flows)

运行分析脚本：

hubble observe --namespace production \
    --from-pod order-service \
    --to-pod inventory-service \
    --protocol tcp \
    -o json --last 1h | \
    jq '.flow | {
        time: .time,
        src: .source.pod_name,
        dst: .destination.pod_name,
        src_port: .l4.TCP.source_port,
        dst_port: .l4.TCP.destination_port,
        flags: .l4.TCP.flags,
        latency_ms: (.latency_ns / 1000000)
    }' | python3 analyze_tcp_latency.py

分析结果显示，高延迟的 SYN 包呈现明显的规律性：大约每 10 分钟左右集中出现一波。

3.4 使用 bpftrace 进行内核级追踪

Hubble 提供了 Cilium 层面的视图，但要深入内核行为，需要借助 bpftrace 这类底层追踪工具。在排查复杂的系统级问题时，了解网络/系统层面的交互至关重要。

安装 bpftrace：

# Ubuntu/Debian
apt-get install -y bpftrace
bpftrace --version

编写并运行追踪 TCP 连接建立延迟的脚本：

#!/usr/bin/env bpftrace
// tcp_connect_latency.bt
// Trace TCP connection establishment latency

#include <net/sock.h>
#include <linux/tcp.h>

BEGIN
{
    printf("Tracing TCP connect latency... Hit Ctrl-C to end.\n");
    printf("%-20s %-6s %-16s %-6s %-16s %-6s %s\n",
           "TIME", "PID", "SADDR", "SPORT", "DADDR", "DPORT", "LAT(ms)");
}

kprobe:tcp_v4_connect
{
    @start[tid] = nsecs;
    @sock[tid] = arg0;
}

kretprobe:tcp_v4_connect
/@start[tid]/
{
    $sk = (struct sock *)@sock[tid];
    $dport = $sk->__sk_common.skc_dport;
    $dport = ($dport >> 8) | (($dport << 8) & 0xff00);
    $delta = (nsecs - @start[tid]) / 1000000;  // Convert to ms
    if ($delta > 10) {  // Only show connections > 10ms
        time("%H:%M:%S ");
        printf("%-6d %-16s %-6d %-16s %-6d %d\n",
               pid,
               ntop(AF_INET, $sk->__sk_common.skc_rcv_saddr),
               $sk->__sk_common.skc_num,
               ntop(AF_INET, $sk->__sk_common.skc_daddr),
               $dport,
               $delta);
    }
    delete(@start[tid]);
    delete(@sock[tid]);
}

END
{
    clear(@start);
    clear(@sock);
}

bpftrace tcp_connect_latency.bt

输出证实了延迟发生在内核层面：

Tracing TCP connect latency... Hit Ctrl-C to end.
TIME                 PID    SADDR            SPORT  DADDR            DPORT  LAT(ms)
15:42:17.123         12345  10.244.8.45      45678  10.244.15.87     8080   523
15:42:17.125         12346  10.244.8.45      45679  10.244.15.87     8080   518
15:42:17.127         12347  10.244.8.45      45680  10.244.15.87     8080   521

3.5 追踪网络栈各层延迟

为了定位具体是网络栈的哪一层出现了延迟，我们编写了一个更详细的追踪脚本：

#!/usr/bin/env bpftrace
// network_stack_latency.bt
// Trace packet latency through network stack

#include <linux/skbuff.h>
#include <linux/netdevice.h>

BEGIN
{
    printf("Tracing network stack latency... Ctrl-C to end.\n");
}

// XDP layer
kprobe:bpf_prog_run_xdp
{
    @xdp_start[arg1] = nsecs;
}

kretprobe:bpf_prog_run_xdp
/@xdp_start[arg1]/
{
    @xdp_latency = hist((nsecs - @xdp_start[arg1]) / 1000);  // us
    delete(@xdp_start[arg1]);
}

// TC ingress
kprobe:sch_handle_ingress
{
    @tc_ingress_start[arg0] = nsecs;
}

kretprobe:sch_handle_ingress
/@tc_ingress_start[arg0]/
{
    @tc_ingress_latency = hist((nsecs - @tc_ingress_start[arg0]) / 1000);
    delete(@tc_ingress_start[arg0]);
}

// Netfilter / conntrack
kprobe:nf_conntrack_in
{
    @conntrack_start[arg1] = nsecs;
}

kretprobe:nf_conntrack_in
/@conntrack_start[arg1]/
{
    @conntrack_latency = hist((nsecs - @conntrack_start[arg1]) / 1000);
    delete(@conntrack_start[arg1]);
}

// IP layer
kprobe:ip_rcv
{
    @ip_rcv_start[arg0] = nsecs;
}

kretprobe:ip_rcv
/@ip_rcv_start[arg0]/
{
    @ip_rcv_latency = hist((nsecs - @ip_rcv_start[arg0]) / 1000);
    delete(@ip_rcv_start[arg0]);
}

// TCP layer
kprobe:tcp_v4_rcv
{
    @tcp_rcv_start[arg0] = nsecs;
}

kretprobe:tcp_v4_rcv
/@tcp_rcv_start[arg0]/
{
    @tcp_rcv_latency = hist((nsecs - @tcp_rcv_start[arg0]) / 1000);
    delete(@tcp_rcv_start[arg0]);
}

END
{
    printf("\n\n=== XDP Latency (us) ===\n");
    print(@xdp_latency);
    printf("\n\n=== TC Ingress Latency (us) ===\n");
    print(@tc_ingress_latency);
    printf("\n\n=== Conntrack Latency (us) ===\n");
    print(@conntrack_latency);
    printf("\n\n=== IP Receive Latency (us) ===\n");
    print(@ip_rcv_latency);
    printf("\n\n=== TCP Receive Latency (us) ===\n");
    print(@tcp_rcv_latency);
}

运行该脚本约5分钟后，输出显示了一个清晰的异常：

=== TC Ingress Latency (us) ===
@tc_ingress_latency:
[0]                  892456 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   45678 |@@                                                  |
[2, 4)                12345 |                                                    |
[4, 8)                 5678 |                                                    |
[8, 16)                2345 |                                                    |
[16, 32)               1234 |                                                    |
[32, 64)                567 |                                                    |
[64, 128)               234 |                                                    |
[128, 256)              123 |                                                    |
[256, 512)               89 |                                                    |
[512, 1K)               456 |                                                    |  <-- 异常！
[1K, 2K)                234 |                                                    |
[2K, 4K)                 12 |                                                    |

问题定位：TC（Traffic Control）入口层的处理延迟直方图中，出现了大量延迟超过 512 微秒甚至 1 毫秒的数据包，这与观测到的整体延迟量级相符。这表明高延迟很可能源于 Cilium 加载在 TC 入口层的 eBPF 程序处理过程。

3.6 追踪 Cilium eBPF 程序

Cilium Agent 暴露了丰富的 eBPF 相关性能指标：

# 查看 Cilium Agent 指标
kubectl exec -n kube-system ds/cilium -- cilium metrics list | grep bpf
# 关键指标:
# cilium_bpf_map_ops_total
# cilium_bpf_map_pressure
# cilium_datapath_errors_total

我们注意到 cilium_bpf_map_pressure 指标值偏高。进一步检查连接跟踪表：

kubectl exec -n kube-system ds/cilium -- cilium bpf ct list global | wc -l
# 输出: 2847563

连接跟踪条目高达 280 万！这触发了我们的警觉。

3.7 根因定位

查看 Cilium 的 BPF Map 详细使用情况：

kubectl exec -n kube-system ds/cilium -- cilium status --verbose | grep -A 10 "BPF Maps"

输出揭示了根本原因：

BPF Maps:
  cilium_ct4_global         max:524288  in-use:512000 (97.7%)
  cilium_ct_any4_global     max:524288  in-use:487234 (92.9%)
  cilium_lb4_services_v2    max:65536   in-use:3456   (5.3%)
  cilium_lb4_backends_v3    max:65536   in-use:12345  (18.8%)
  ...

根因：连接跟踪表 (cilium_ct4_global) 的默认最大容量为 524,288 条，而当前使用量已达 512,000 条，使用率高达 97.7%！当连接跟踪表接近满载时，新连接的建立需要等待旧条目超时或被垃圾回收机制清理，这个等待过程直接导致了 TCP SYN 包的高延迟，表现为周期性的网络抖动。

3.8 问题修复

解决方案是扩大 BPF Map 的容量。更新 Cilium 的配置：

# 在 cilium-values.yaml 中更新 bpf 部分
bpf:
  ctTcpMax: 2097152     # 从 524288 增加到 2M
  ctAnyMax: 2097152
  natMax: 2097152
  policyMapMax: 65536
  lbMapMax: 65536

应用配置更新并滚动重启 Cilium Agent：

helm upgrade cilium cilium/cilium --namespace kube-system --values cilium-values.yaml
kubectl rollout restart ds/cilium -n kube-system

# 验证新配置生效
kubectl exec -n kube-system ds/cilium -- cilium status --verbose | grep -A 10 "BPF Maps"

修复后，连接跟踪表使用率回归健康水平，P99 网络延迟也恢复到了正常范围（<10ms）。

四、eBPF 排查工具集

将本次排查中使用的工具脚本化，供日后参考使用。

4.1 网络延迟追踪工具

#!/usr/bin/env bpftrace
// pod_network_latency.bt
// Trace network latency for specific pod CIDR

#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/tcp.h>

BEGIN
{
    printf("Tracing pod network latency (10.244.0.0/16)...\n");
    printf("%-20s %-16s %-6s %-16s %-6s %-10s\n",
           "TIME", "SADDR", "SPORT", "DADDR", "DPORT", "LAT(us)");
}

kprobe:ip_output
{
    $skb = (struct sk_buff *)arg1;
    $iph = (struct iphdr *)($skb->head + $skb->network_header);
    $saddr = ntop(AF_INET, $iph->saddr);
    $daddr = ntop(AF_INET, $iph->daddr);
    // Filter for pod CIDR 10.244.0.0/16
    if (($iph->saddr & 0xffff0000) == 0x0af40000 ||
        ($iph->daddr & 0xffff0000) == 0x0af40000) {
        @start[$skb] = nsecs;
        @saddr[$skb] = $saddr;
        @daddr[$skb] = $daddr;
    }
}

kprobe:dev_queue_xmit
/@start[arg0]/
{
    $delta = (nsecs - @start[arg0]) / 1000;
    if ($delta > 100) {  // Only show > 100us
        time("%H:%M:%S ");
        printf("%-16s %-6d %-16s %-6d %-10d\n",
               @saddr[arg0], 0,
               @daddr[arg0], 0,
               $delta);
    }
    delete(@start[arg0]);
    delete(@saddr[arg0]);
    delete(@daddr[arg0]);
}

4.2 TCP 重传追踪

#!/usr/bin/env bpftrace
// tcp_retransmit.bt
// Track TCP retransmissions

#include <linux/tcp.h>
#include <net/sock.h>

BEGIN
{
    printf("Tracing TCP retransmissions...\n");
    printf("%-20s %-6s %-16s %-6s %-16s %-6s %-10s\n",
           "TIME", "PID", "SADDR", "SPORT", "DADDR", "DPORT", "STATE");
}

kprobe:tcp_retransmit_skb
{
    $sk = (struct sock *)arg0;
    $inet_family = $sk->__sk_common.skc_family;
    if ($inet_family == AF_INET) {
        $daddr = ntop(AF_INET, $sk->__sk_common.skc_daddr);
        $saddr = ntop(AF_INET, $sk->__sk_common.skc_rcv_saddr);
        $dport = $sk->__sk_common.skc_dport;
        $dport = ($dport >> 8) | (($dport << 8) & 0xff00);
        $sport = $sk->__sk_common.skc_num;
        $state = $sk->__sk_common.skc_state;
        time("%H:%M:%S ");
        printf("%-6d %-16s %-6d %-16s %-6d %-10d\n",
               pid, $saddr, $sport, $daddr, $dport, $state);
        @retrans[$saddr, $daddr] = count();
    }
}

END
{
    printf("\n\nRetransmission counts by src->dst:\n");
    print(@retrans);
}

4.3 DNS 查询延迟追踪

#!/usr/bin/env bpftrace
// dns_latency.bt
// Track DNS query latency

#include <linux/skbuff.h>
#include <linux/udp.h>
#include <linux/ip.h>

BEGIN
{
    printf("Tracing DNS latency (port 53)...\n");
}

kprobe:udp_sendmsg
{
    $sk = (struct sock *)arg0;
    $dport = $sk->__sk_common.skc_dport;
    $dport = ($dport >> 8) | (($dport << 8) & 0xff00);
    if ($dport == 53) {
        @dns_start[tid] = nsecs;
        @dns_server[tid] = ntop(AF_INET, $sk->__sk_common.skc_daddr);
    }
}

kretprobe:udp_recvmsg
/@dns_start[tid]/
{
    $delta = (nsecs - @dns_start[tid]) / 1000000;  // ms
    if ($delta > 10) {  // Only show > 10ms
        time("%H:%M:%S ");
        printf("DNS to %s took %dms\n", @dns_server[tid], $delta);
    }
    @dns_latency = hist($delta);
    delete(@dns_start[tid]);
    delete(@dns_server[tid]);
}

END
{
    printf("\nDNS Latency Distribution (ms):\n");
    print(@dns_latency);
}

4.4 Cilium 专用调试脚本

#!/bin/bash
# cilium_debug.sh
# Comprehensive Cilium debugging script

set -euo pipefail

NAMESPACE="kube-system"
POD_NAMESPACE="${1:-default}"
POD_NAME="${2:-}"

echo "=== Cilium Status ==="
cilium status

echo -e "\n=== Cilium Version ==="
kubectl exec -n ${NAMESPACE} ds/cilium -- cilium version

echo -e "\n=== BPF Maps Usage ==="
kubectl exec -n ${NAMESPACE} ds/cilium -- cilium bpf ct list global 2>/dev/null | head -20
kubectl exec -n ${NAMESPACE} ds/cilium -- cilium status --verbose 2>/dev/null | grep -A 15 "BPF Maps"

echo -e "\n=== Endpoints Status ==="
if [ -n "${POD_NAME}" ]; then
    kubectl exec -n ${NAMESPACE} ds/cilium -- cilium endpoint list | grep -E "(ENDPOINT|${POD_NAME})"
else
    kubectl exec -n ${NAMESPACE} ds/cilium -- cilium endpoint list | head -20
fi

echo -e "\n=== Service List ==="
kubectl exec -n ${NAMESPACE} ds/cilium -- cilium service list | head -20

echo -e "\n=== Recent Drops ==="
hubble observe --verdict DROPPED --last 100 2>/dev/null || echo "Hubble not available"

echo -e "\n=== Cilium Agent Logs (Errors) ==="
kubectl logs -n ${NAMESPACE} ds/cilium --tail=50 2>/dev/null | grep -iE "(error|warn|fail)" | tail -20

echo -e "\n=== Network Policies ==="
kubectl get ciliumnetworkpolicies --all-namespaces 2>/dev/null | head -20

echo -e "\n=== Cilium Health ==="
kubectl exec -n ${NAMESPACE} ds/cilium -- cilium-health status

echo -e "\n=== Monitor (10 seconds) ==="
timeout 10 kubectl exec -n ${NAMESPACE} ds/cilium -- cilium monitor --type drop 2>/dev/null || true

五、最佳实践和注意事项

5.1 Cilium 调优建议

BPF Map 大小规划

根据集群规模预先规划 BPF Map 大小，避免后期出现容量瓶颈：

集群规模	Pod 数量	建议 CT Map 大小
小型	< 500	262144 (256K)
中型	500-2000	524288 (512K)
大型	2000-5000	1048576 (1M)
超大型	> 5000	2097152 (2M)

连接跟踪超时调优

bpf:
  ctTcpMax: 2097152
  ctAnyMax: 2097152
  # CT entry timeouts
  monitorAggregation: medium
  preallocateMaps: true

调整系统参数以减少连接跟踪条目保留时间：

# 减少 TCP TIME_WAIT 状态超时时间
sysctl -w net.ipv4.tcp_fin_timeout=30
# 增大本地端口范围
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# 启用 TIME_WAIT 端口复用
sysctl -w net.ipv4.tcp_tw_reuse=1

5.2 监控告警配置

为 Cilium 配置关键的 Prometheus 告警规则，是实现云原生/IaaS 环境稳定运维的重要一环。

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cilium-alerts
  namespace: monitoring
spec:
  groups:
    - name: cilium
      rules:
        - alert: CiliumBPFMapPressureHigh
          expr: |
            cilium_bpf_map_pressure{map_name=~"cilium_ct.*"} > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Cilium BPF map {{ $labels.map_name }} usage high"
            description: "Map usage is {{ $value | humanizePercentage }}"
        - alert: CiliumDropsHigh
          expr: |
            rate(cilium_drop_count_total[5m]) > 100
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High packet drop rate on {{ $labels.node }}"
            description: "{{ $value }} drops/sec"
        - alert: CiliumPolicyDenied
          expr: |
            rate(cilium_policy_verdict_total{verdict="denied"}[5m]) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Policy denying traffic"
            description: "{{ $value }} denied/sec"
        - alert: CiliumAgentUnhealthy
          expr: |
            cilium_agent_unhealthy > 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "Cilium agent unhealthy on {{ $labels.node }}"
        - alert: CiliumEndpointNotReady
          expr: |
            cilium_endpoint_state{state!="ready"} > 0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Cilium endpoint not ready"

5.3 常见问题和解决方案

问题1：Pod 无法通信，Hubble 显示 DROP

排查步骤：

# 检查丢弃原因
hubble observe --verdict DROPPED --to-pod <pod-name> -o json | jq '.flow.drop_reason_desc'
# 常见丢弃原因及修复：
# - "Policy denied": 检查 CiliumNetworkPolicy
# - "No route": 检查路由和 IP 分配
# - "Invalid packet": 检查 MTU 设置

问题2：Service 访问慢

# 检查 socket LB 是否工作
kubectl exec -n kube-system ds/cilium -- cilium bpf lb list
# 检查 Service 是否被正确编程到 BPF 映射中
kubectl exec -n kube-system ds/cilium -- cilium service list | grep <service-name>
# 如果 Service 缺失，重启 Cilium Agent
kubectl rollout restart ds/cilium -n kube-system

问题3：DNS 解析慢

# 检查 DNS 流量是否被正确代理
hubble observe --protocol DNS --last 100
# 检查 CoreDNS 服务端点
kubectl exec -n kube-system ds/cilium -- cilium service list | grep kube-dns
# 如果使用 Cilium DNS 代理，检查其状态
kubectl exec -n kube-system ds/cilium -- cilium status | grep DNS

问题4：XDP 模式不工作

# 检查 XDP 支持状态
kubectl exec -n kube-system ds/cilium -- cilium status | grep XDP
# 检查网卡驱动支持
ethtool -i eth0 | grep driver
# 部分网卡可能需要使用通用 XDP 模式
# 更新 values.yaml:
# loadBalancer:
#   acceleration: disabled  # 或 "generic"

六、总结

这次排查学到的经验

eBPF 可观测性能力强大：传统工具（如 tcpdump、netstat）难以捕获间歇性、内核层面的微妙问题，而 eBPF 能够在内核关键路径插入探针，提供前所未有的可见性。
监控需覆盖 BPF 资源：BPF Map 的使用率、操作延迟等是极易被忽略的指标，但其一旦成为瓶颈，将直接导致严重的网络性能问题。
容量规划至关重要：连接跟踪表、NAT 表等 BPF Map 的默认容量可能无法满足大规模生产环境的需求，需要根据实际业务规模和连接模式进行预估和调整。
分层排查是定位复杂问题的关键：网络问题可能源自应用层、传输层、网络层、数据链路层乃至硬件驱动。eBPF 工具链允许我们从顶到底，在每一层进行精准的观测和度量。

进阶学习方向

编写自定义 eBPF 程序：使用 libbpf 或 cilium/ebpf 库，开发针对特定场景的追踪或优化工具。
XDP 高性能网络：深入学习 XDP 编程，实现高性能负载均衡、DDoS 防护或协议栈旁路。
eBPF 安全应用：探索 Falco、Tetragon 等基于 eBPF 的运行时安全监控与策略执行工具。
服务网格可观测性：结合 Cilium Service Mesh，实现基于 eBPF 的无侵入全链路追踪。

参考资料

Cilium 官方文档
《BPF Performance Tools》(Brendan Gregg)
eBPF.io 官网
bpftrace 参考指南
Hubble 文档

附录

命令速查表

# Cilium 状态
cilium status
cilium status --verbose
cilium health status

# BPF Maps
cilium bpf ct list global
cilium bpf lb list
cilium bpf policy get --all
cilium bpf nat list

# Endpoints
cilium endpoint list
cilium endpoint get <id>
cilium endpoint log <id>

# Services
cilium service list
cilium service get <id>

# Hubble
hubble observe
hubble observe --verdict DROPPED
hubble observe --protocol tcp
hubble observe --from-pod <pod>
hubble observe --to-pod <pod>
hubble status

# bpftrace
bpftrace -l 'kprobe:tcp*'
bpftrace -e 'kprobe:tcp_v4_connect { @[comm] = count(); }'

# 调试
cilium monitor
cilium monitor --type drop
cilium monitor --type trace
cilium debuginfo

eBPF 常用追踪点

追踪点	用途
`kprobe:tcp_v4_connect`	TCP 连接发起
`kprobe:tcp_rcv_established`	TCP 数据接收
`kprobe:tcp_retransmit_skb`	TCP 重传
`kprobe:ip_output`	IP 包发送
`kprobe:ip_rcv`	IP 包接收
`kprobe:nf_conntrack_in`	连接跟踪
`kprobe:dev_queue_xmit`	网卡发送队列
`kprobe:netif_receive_skb`	网卡接收
`tracepoint:skb:kfree_skb`	数据包丢弃

术语表

术语	说明
eBPF	Extended Berkeley Packet Filter，内核虚拟机技术
XDP	eXpress Data Path，在网卡驱动层处理数据包
TC	Traffic Control，Linux 流量控制
CT	Connection Tracking，连接跟踪
BPF Map	eBPF 程序与用户空间共享数据的结构
Hubble	Cilium 的网络可观测性组件
DSR	Direct Server Return，负载均衡模式
Maglev	一致性哈希负载均衡算法

上一篇：DDD领域驱动设计：核心分层模型解析与微服务架构实践指南
下一篇：AI辅助前端动画开发实战：基于AE+MCP+Cursor的L3级自动驾驶工作流

eBPF, Cilium, Kubernetes, 网络排查, 性能优化