本文总结了在 Kubernetes 环境中,使用 Mellanox 网卡的 DPDK 驱动时可能遇到的几个典型问题及其解决方法。
不能挂载 /sys 目录到 Pod 中
对于其他厂家的网卡,例如 Intel 的 X710 等,若想在 Kubernetes 中使用 DPDK 驱动,通常必须挂载宿主机的 /sys 目录,因为 DPDK 启动过程需要读取该目录下的文件。但 Mellanox 网卡比较特殊,其 DPDK 驱动也必须绑定在内核驱动 mlx5_core 之上。
如果挂载了 /sys 目录到 Pod 中,则会出现如下错误:
net_mlx5: port 0 cannot get MAC address, is mlx5_en loaded? (errno: No such file or directory)
net_mlx5: probe of PCI device 0000:00:09.0 aborted after encountering an error: No such device
EAL: Requested device 0000:00:09.0 cannot be used
原因在于,宿主机的 /sys/ 会覆盖 Pod 内的 /sys/ 内容,而 mlx5 驱动会读取如 /sys/devices/pci0000:00/0000:00:09.0/net/ 等路径。一旦被覆盖,就会触发错误。
下面分析相关代码逻辑:
mlx5_pci_probe
mlx5_dev_spawn
/* Configure the first MAC address by default. */
if (mlx5_get_mac(eth_dev, &mac.addr_bytes)) {
DRV_LOG(ERR,
"port %u cannot get MAC address, is mlx5_en"
" loaded? (errno: %s)",
eth_dev->data->port_id, strerror(rte_errno));
err = ENODEV;
goto error;
}
//如果host上的 /sys/ 覆盖 pod 里的 /sys/ 内容,会出现问题(具体哪行代码出问题还有待调查)
int
mlx5_get_mac(struct rte_eth_dev *dev, uint8_t (*mac)[ETHER_ADDR_LEN])
{
struct ifreq request;
int ret;
ret = mlx5_ifreq(dev, SIOCGIFHWADDR, &request);
int sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
mlx5_get_ifname(dev, &ifr->ifr_name);
ioctl(sock, req, ifr);
if (ret)
return ret;
memcpy(mac, request.ifr_hwaddr.sa_data, ETHER_ADDR_LEN);
return 0;
}
DPDK 启动过程中找不到 Mellanox 网卡
有时会出现以下错误,提示找不到 Mellanox 网卡,并询问是否未加载内核驱动。在非容器环境中,此提示有效;但在 Kubernetes 环境中,可能另有原因。
EAL: PCI device 0000:00:06.0 on NUMA socket -1
EAL: Invalid NUMA socket, default to 0
EAL: probe driver: 15b3:101a net_mlx5
net_mlx5: no Verbs device matches PCI device 0000:00:06.0, are kernel drivers loaded?
EAL: Requested device 0000:00:06.0 cannot be used
问题现象是:如果启动的 Pod 拥有 privileged 权限,DPDK 可以成功识别并启动网卡;若无此权限,则会报上述错误。
先分析报错原因,相关 DPDK 代码如下:
mlx5_pci_probe
unsigned int n = 0;
//调用 libibverbs 里的函数 ibv_get_device_list 获取 ibv 设备
ibv_list = mlx5_glue->get_device_list(&ret);
while (ret-- > 0) {
ibv_match[n++] = ibv_list[ret];
}
//在这里报错,说明n为0,n在上面遍历ret的时候赋值,n为0,说明ret也为0
if (!n) {
DRV_LOG(WARNING,
"no Verbs device matches PCI device " PCI_PRI_FMT ",",
" are kernel drivers loaded?",
pci_dev->addr.domain, pci_dev->addr.bus,
pci_dev->addr.devid, pci_dev->addr.function);
rte_errno = ENOENT;
ret = -rte_errno;
}
要理解为何 ret 为 0,需分析 libibverbs 源码。该源码可在 OFED 包中找到,路径例如:MLNX_OFED_LINUX-5.2-2.2.0.0-ubuntu16.04-x86_64/src/MLNX_OFED_SRC-5.2-2.2.0.0/SOURCES/rdma-core-52mlnx1/libibverbs。
LATEST_SYMVER_FUNC(ibv_get_device_list, 1_1, "IBVERBS_1.1",
struct ibv_device **,
int *num)
ibverbs_get_device_list(&device_list);
find_sysfs_devs(&sysfs_list);
setup_sysfs_dev
try_access_device(sysfs_dev)
struct stat cdev_stat;
char *devpath;
int ret;
//查看这个文件是否存在 /dev/infiniband/uverbs0
if (asprintf(&devpath, RDMA_CDEV_DIR"/%s", sysfs_dev->sysfs_name) < 0)
return ENOMEM;
ret = stat(devpath, &cdev_stat);
free(devpath);
return ret;
由上述代码可知,ibverbs 会检查 /dev/infiniband/uverbs0 文件是否存在(一个网卡对应一个 uverbsX)。如果不存在,则认为未找到网卡。
如果 Pod 拥有 privileged 权限,则可读取 /dev/infiniband/uverbs0 等文件;若无此权限,则 Pod 内部不存在 /dev/infiniband 目录。
//pod没有privilege特权时
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /dev
core fd full mqueue null ptmx pts random shm stderr stdin stdout termination-log tty urandom zero
//pod有privilege特权时,可看到infiniband
root@pod-dpdk:~/# ls /dev/
autofs infiniband mqueue sda2 tty12 tty28 tty43 tty59 ttyS16 ttyS31 vcs4 vfio
...
root@pod-dpdk:~# ls /dev/infiniband/
uverbs0
由此可见,Pod 内是否存在 /dev/infiniband/uverbs0 文件是关键。若无此文件,就会报错。
那么,如何让 Pod 获得这些文件呢?有两种方法:
a. 为 Pod 赋予 privileged 权限。这种方式无需将 /dev/ 挂载到 Pod 内部,因为特权 Pod 可直接访问宿主机上的这些文件。
b. 使用 Kubernetes 的 sriov-network-device-plugin。需要注意版本,较高版本才能避免此问题。对于 Mellanox 网卡,该插件会通过 Docker 的 --device 参数将所需设备文件挂载给容器,例如将 /dev/infiniband/uverbs2 等文件挂载到容器中。
root# docker inspect 1dfe96c8eff4
[
{
"Id": "1dfe96c8eff4c8ede0d8eb4e480fec9f002f68c4da1bb5265580ee968c6d7502",
"Created": "2021-04-12T03:24:22.598030845Z",
...
"HostConfig": {
...
"CapAdd": [
"NET_RAW",
"NET_ADMIN",
"IPC_LOCK"
],
"Privileged": false,
"Devices": [
{
"PathOnHost": "/dev/infiniband/ucm2",
"PathInContainer": "/dev/infiniband/ucm2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/issm2",
"PathInContainer": "/dev/infiniband/issm2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/umad2",
"PathInContainer": "/dev/infiniband/umad2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/uverbs2",
"PathInContainer": "/dev/infiniband/uverbs2",
"CgroupPermissions": "rwm"
},
{
"PathOnHost": "/dev/infiniband/rdma_cm",
"PathInContainer": "/dev/infiniband/rdma_cm",
"CgroupPermissions": "rwm"
}
],
...
},
sriov-network-device-plugin 的代码中,会将宿主机上的路径挂载到 Pod 中:
// NewRdmaSpec returns the RdmaSpec
func NewRdmaSpec(pciAddrs string) types.RdmaSpec {
deviceSpec := make([]*pluginapi.DeviceSpec, 0)
isSupportRdma := false
rdmaResources := rdmamap.GetRdmaDevicesForPcidev(pciAddrs)
if len(rdmaResources) > 0 {
isSupportRdma = true
for _, res := range rdmaResources {
resRdmaDevices := rdmamap.GetRdmaCharDevices(res)
for _, rdmaDevice := range resRdmaDevices {
deviceSpec = append(deviceSpec, &pluginapi.DeviceSpec{
HostPath: rdmaDevice,
ContainerPath: rdmaDevice,
Permissions: "rwm",
})
}
}
}
return &rdmaSpec{isSupportRdma: isSupportRdma, deviceSpec: deviceSpec}
}
从 sriov-network-device-plugin 的日志中,可以看到它会将 /dev/infiniband 下的几个文件传递到 Pod:
###/var/log/sriovdp/sriovdp.INFO
I0412 03:17:57.120886 5327 server.go:123] AllocateResponse send: &AllocateResponse{ContainerResponses:[]
*ContainerAllocateResponse{&ContainerAllocateResponse{Envs:map[string]string{PCIDEVICE_INTEL_COM_DP_SRIOV_MLX5: 0000:00:0a.0,},
Mounts:[]*Mount{},Devices:[]*DeviceSpec{&DeviceSpec{ContainerPath:/dev/infiniband/ucm2,HostPath:/dev/infiniband/ucm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/issm2,HostPath:/dev/infiniband/issm2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/umad2,HostPath:/dev/infiniband/umad2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/uverbs2,HostPath:/dev/infiniband/uverbs2,Permissions:rwm,},
&DeviceSpec{ContainerPath:/dev/infiniband/rdma_cm,HostPath:/dev/infiniband/rdma_cm,Permissions:rwm,},},
Annotations:map[string]string{},},},}
没有特权时 DPDK 启动失败
出于安全考虑,通常不会赋予 Pod 特权。但这会导致在 Pod 内部启动 DPDK 失败,报错如下:
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4 -w 00:09.0 -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: Cannot obtain physical addresses: No such file or directory. Only vfio will function.
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
error allocating rte services array
解决方法是在 DPDK 的启动参数中加上 -iova-mode=va,这样就不需要特权了。
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ./l2fwd -cf -n4 -w 00:09.0 --iova-mode=va -- -p1
EAL: Detected 4 lcore(s)
EAL: Detected 1 NUMA nodes
EAL: Multi-process socket /var/run/dpdk/rte/mp_socket
EAL: Probing VFIO support...
EAL: WARNING: cpu flags constant_tsc=yes nonstop_tsc=no -> using unreliable clock cycles !
EAL: PCI device 0000:00:09.0 on NUMA socket -1
EAL: Invalid NUMA socket, default to 0
EAL: probe driver: 15b3:1016 net_mlx5
读取网卡统计计数失败
DPDK 提供了 rte_eth_stats_get 和 rte_eth_xstats_get 两个函数来获取网卡统计计数。前者获取固定的计数(如收发报文数和字节数),后者获取扩展计数,每种网卡都有自己的扩展计数集。
对于 Mellanox 网卡,其 DPDK 驱动 mlx5 也提供了对应的函数:
rte_eth_stats_get -> stats_get -> mlx5_stats_get
rte_eth_xstats_get -> xstats_get -> mlx5_xstats_get
mlx5_stats_get 在 Pod 内部可以正常读取计数,但 mlx5_xstats_get 会遇到问题。
mlx5_xstats_get -> mlx5_read_dev_counters 这个函数会通过读取以下路径的文件来获取扩展计数:
root# ls /sys/devices/pci0000\:00/0000\:00\:09.0/infiniband/mlx5_0/ports/1/hw_counters/
duplicate_request out_of_buffer req_cqe_flush_error resp_cqe_flush_error rx_atomic_requests
implied_nak_seq_err out_of_sequence req_remote_access_errors resp_local_length_error rx_read_requests
lifespan packet_seq_err req_remote_invalid_request resp_remote_access_errors rx_write_requests
local_ack_timeout_err req_cqe_error resp_cqe_error rnr_nak_retry_err
然而,在 Pod 内部同样的路径下,却不存在 hw_counters 目录。因为这个目录是在宿主机加载驱动时生成的,在 Pod 内无法看到。
root@pod-dpdk:~/dpdk/x86_64-native-linuxapp-gcc/app# ls /sys/devices/pci0000\:00/0000\:00\:09.0/infiniband/mlx5_0/ports/1/
cap_mask gid_attrs gids lid lid_mask_count link_layer phys_state pkeys rate sm_lid sm_sl state
可能的解决办法:
- 手动将宿主机的
/sys/devices/pci0000:00/0000:00:09.0/infiniband/mlx5_0/ports/1/hw_counters/ 目录挂载到 Pod 内。
- 修改
sriov-network-device-plugin 的代码,使其自动挂载上述目录。
测试文件
root# cat dpdk-mlx.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-dpdk
annotations:
k8s.v1.cni.cncf.io/networks: host-device1
spec:
nodeName: node1
containers:
- name: appcntr3
image: l2fwd:v3
imagePullPolicy: IfNotPresent
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 300000; done;" ]
securityContext:
privileged: true
resources:
requests:
memory: 100Mi
hugepages-2Mi: 500Mi
cpu: '3'
limits:
hugepages-2Mi: 500Mi
cpu: '3'
memory: 100Mi
volumeMounts:
- mountPath: /mnt/huge
name: hugepage
readOnly: False
- mountPath: /var/run
name: var
readOnly: False
volumes:
- name: hugepage
emptyDir:
medium: HugePages
- name: var
hostPath:
path: /var/run/
参考
希望这篇文章能帮助你在 云原生 环境中更顺利地部署高性能网络应用。如果在 网络 或驱动开发层面遇到更深层次的问题,例如涉及 C/C++ 底层交互,也欢迎在技术社区深入探讨。