最近在项目中遇到一个网络问题:在使用 af_packet 的 mmap 模式发送数据包时,如果将绑定的网卡先 down 掉再 up 起来,恢复后发送的第一个数据包总会失败,并且系统返回的错误码(errno)是 -100,即 ENETDOWN(网络不可用)。
问题复现
最初怀疑这个问题与特定的网卡驱动有关,因为问题环境使用了基于 X710 网卡 SR-IOV 生成的 VF。然而,当我们将网卡换成 virtio 类型后,问题依然存在,这就排除了驱动层的嫌疑。
这个问题是在一个 Kubernetes Pod 环境中发现的,干扰因素较多。在初步查看了业务代码后,我怀疑问题与业务逻辑无关,而是出在 packet mmap 机制本身。为了验证,我修改了 Linux 内核源码中自带的 packet mmap 示例程序(linux-stable/tools/testing/selftests/net/psock_tpacket.c),并在一个干净的新建虚拟机中运行,问题果然成功复现。
修改后的完整代码附在文章末尾。我们先来看一下复现步骤:
编译并执行以下代码,参数 ens8 指定了数据包将发送到的网卡。程序会持续检查网卡的链路状态,如果为 up 则发送一个数据包,如果为 down 则跳过。为了复现问题,我们手动将网卡 ens8 down,再 up 起来。
root@node2:~# gcc psock_tpacket.c -o psock_tpacket
root@node2:~# ./psock_tpacket ens8
send data to NIC ens8
NIC is up, send one packet
send success
NIC is up, send one packet
send success
NIC is up, send one packet
send success
NIC is down, don‘t send packet ---> ifconfig ens8 down
NIC is down, don‘t send packet
sleep 5 s ---> ifconfig ens8 up
NIC is up, send one packet -->ens8 up后发的第一个包失败
sendto fail: Network is down
NIC is up, send one packet -->后续的包都可以发送成功
send success
从输出可以清晰看到,网卡 ens8 从 down 状态恢复为 up 后,程序尝试发送的第一个数据包失败了,错误信息正是 Network is down。而后续的数据包发送则完全正常。能稳定复现的问题,就等于解决了问题的一大半。
问题根因分析
在 Linux 网络/系统 协议栈中,通过 sendto 系统调用发送数据包的简化流程如下(省略了 tc qdisc 等处理):
sys_sendto -> sock_sendmsg -> __sock_sendmsg -> packet_sendmsg -> tpacket_snd -> dev_queue_xmit --> __dev_queue_xmit -->dev_hard_start_xmit -> xmit_one -> netdev_start_xmit -> __netdev_start_xmit -> ndo_start_xmit
首先,我们需要定位 ENETDOWN 这个错误码是在哪个环节设置的。粗略查看代码后,发现上述调用路径上有两个函数会明确返回 ENETDOWN:tpacket_snd 和 __dev_queue_xmit。相关代码片段如下:
static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
...
err = -ENETDOWN;
if (unlikely(!(dev->flags & IFF_UP)))
goto out_put;
...
out_put:
dev_put(dev);
out:
mutex_unlock(&po->pg_vec_lock);
return err;
static int __dev_queue_xmit(struct sk_buff *skb, void *accel_priv)
if (dev->flags & IFF_UP) {
skb = dev_hard_start_xmit(skb, dev, txq, &rc);
}
rc = -ENETDOWN;
drop:
rcu_read_unlock_bh();
atomic_long_inc(&dev->tx_dropped);
kfree_skb_list(skb);
return rc;
对于 tpacket_snd 函数:我们的程序每次都是在确认网卡 IFF_UP 标志位已设置(即网卡已启动)的情况下才调用发送的,因此这个分支的可能性不大。
对于 __dev_queue_xmit 函数:如果是在这里丢弃了数据包,网卡的 tx_dropped 计数器应该会增加。我们检查了相关计数,发现其值并未增长。
因此,这两种常见的 ENETDOWN 返回路径似乎都不是问题的直接原因。
既然数据包发送失败,内核必然要释放对应的 skb(Socket Buffer)。为了追踪释放点,我们使用 kretprobe 来捕获 kfree_skb 和 packet_sendmsg(实际会调用 tpacket_snd)的返回值和调用栈。
//首先获取psock_tpacket的pid,后面会用pid过滤输出
root@node2:/root# ps -ef | grep psock_tpacket
root 30924 29468 0 21:45 pts/2 00:00:00 ./psock_tpacket ens8
cd /sys/kernel/debug/tracing
//使能backtrace,查看调用栈
echo 1 > options/stacktrace
//抓取函数tpacket_snd
echo ‘r tpacket_snd ret=$retval’ >> kprobe_events
//使能
echo 1 > events/kprobes/r_tpacket_snd_0/enable
//过滤指定进程发生的事件
echo ‘common_pid==30924’ > events/kprobes/r_tpacket_snd_0/filter
//使能 kfree_skb
echo ‘r kfree_skb skb=+0(%si) ret=$retval’ >> kprobe_events
echo 1 > events/kprobes/r_kfree_skb_0/enable
echo ‘common_pid==30924’ > events/kprobes/r_kfree_skb_0/filter
//总开关,使能后,就会开始抓取上面定义的两个事件
echo 1 > tracing_on
在网卡 ens8 从 down 恢复到 up 后,发送第一个失败的数据包时,我们抓取到了以下两个关键事件:
root@node2:/sys/kernel/debug/tracing# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 6/6 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
psock_tpacket-30924 [000] d... 633441.532440: r_kfree_skb_0: (tpacket_snd+0x582/0xf10 <- kfree_skb) skb=0x8e00000001 ret=0x1
psock_tpacket-30924 [000] d... 633441.532447: <stack trace>
=> [unknown/kretprobe‘d]
=> sock_sendmsg
=> __sys_sendto
=> __x64_sys_sendto
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
psock_tpacket-30924 [000] d... 633441.532449: r_tpacket_snd_0: (packet_sendmsg+0x1f/0x30 <- tpacket_snd) ret=0xffffff9c
psock_tpacket-30924 [000] d... 633441.532451: <stack trace>
=> [unknown/kretprobe’d]
=> sock_sendmsg
=> __sys_sendto
=> __x64_sys_sendto
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe
r_kfree_skb_0 事件的出现证实了确实有释放 skb 的操作,并且调用栈显示它是在 tpacket_snd 函数中被调用的。r_tpacket_snd_0 事件的返回值 0xffffff9c(有符号数即为 -100)也直接证明 ENETDOWN 错误是在 tpacket_snd 函数中设置的。
虽然锁定了问题函数,但仍需精确定位是哪一行代码返回了 -100。重新仔细分析 af_packet.c 的源码,一个关键的回调函数 packet_notifier 引起了我们的注意:
packet_notifier in af_packet.c (linux-3.18.79\net\packet) : sk->sk_err = ENETDOWN;
让我们来看一下相关的 C/C++ 内核代码逻辑:
static struct notifier_block packet_netdev_notifier = {
.notifier_call = packet_notifier,
};
//packet_init初始化时,会注册网络设备变化通知事件
module_init(packet_init);
static int __init packet_init(void)
register_netdevice_notifier(&packet_netdev_notifier);
//在事件处理函数中,如果网卡down了,会将和此网卡关联的所有af_packet类型的socket的sk->sk_err设置成ENETDOWN
static int packet_notifier(struct notifier_block *this,
unsigned long msg, void *ptr)
sk_for_each_rcu(sk, &net->packet.sklist) {
struct packet_sock *po = pkt_sk(sk);
switch (msg) {
case NETDEV_DOWN:
if (dev->ifindex == po->ifindex) {
sk->sk_err = ENETDOWN;
case NETDEV_UP:
而在 tpacket_snd 函数的执行路径中,会调用 sock_error 来检查 socket 的错误状态:
tpacket_snd -> sock_alloc_send_skb -> sock_alloc_send_pskb -> sock_error
至此,问题的根本原因终于水落石出:
当网卡被设置为 down 时,会触发执行 packet_notifier 这个回调函数。该函数会遍历所有与该网卡关联的 af_packet 类型 socket,并将其 sk->sk_err 字段设置为 ENETDOWN。然而,当网卡再次 up 时,内核并没有自动清理这个 sk_err 中的错误状态。
即使在最新的内核代码中,我们依然没有发现网卡 up 时会自动清除 ENETDOWN 的代码。这或许是内核有意为之的设计,目的是提醒应用程序该 socket 关联的网络设备曾经发生过故障。
解决方案
弄清楚了原因,解决办法就很直接了:在网卡从 down 状态恢复为 up 之后,发送第一个数据包之前,我们需要主动清除 socket 的错误状态。
具体方法是调用 getsockopt 函数获取 SO_ERROR。关键在于,内核在读取 sk_err 的同时,会将其值清零。相关用户空间和内核空间代码如下:
static int get_sock_err(int sock)
{
int ret = 0;
int opt_val = 0;
socklen_t optlen = sizeof(opt_val);
ret = getsockopt(sock, SOL_SOCKET, SO_ERROR, &opt_val, &optlen);
if (ret)
{
perror(“a error getsockopt SO_ERROR”);
return 0;
}
return opt_val;
}
//kernel端代码,每次获取sock_error,都会将sock_error清零
int sock_getsockopt(struct socket *sock, int level, int optname, char __user *optval, int __user *optlen)
case SO_ERROR:
v.val = -sock_error(sk);
break;
static inline int sock_error(struct sock *sk)
{
int err;
if (likely(!sk->sk_err))
return 0;
err = xchg(&sk->sk_err, 0);
return -err;
}
因此,我们只需要在网卡 up 后、发送数据前,调用一次 get_sock_err(sock) 函数,即可清除残留的 ENETDOWN 错误,使后续的数据包发送恢复正常。
参考与延伸
实际上,这个问题很早之前就有人发现并讨论过,不过当时是在接收数据包的场景下(链接已失效)。而我们这次是在发送数据包的场景下触发了相同的机制。
如果你想深入了解 packet mmap 机制,可以参考以下官方资源:
这个问题在 K8s 或其它容器化、虚拟化网络环境中可能更容易遇到,因为网卡的重启或重配操作相对更频繁。理解底层机制有助于我们编写更健壮的网络应用程序。
附录:完整测试代码
以下是我们用于复现和验证的完整 C 代码。编译后,按照上文所述的步骤操作即可复现问题。解决问题的关键代码在 walk_tx 函数中,取消 get_sock_err(sock); 这行的注释即可清除 socket 错误,修复问题。
psock_tpacket.c:
#include<stdio.h>
#include<stdlib.h>
#include<sys/types.h>
#include<sys/stat.h>
#include<sys/mman.h>
#include<linux/if_packet.h>
#include<linux/filter.h>
#include<ctype.h>
#include<fcntl.h>
#include<unistd.h>
#include<bits/wordsize.h>
#include<net/ethernet.h>
#include<netinet/ip.h>
#include<arpa/inet.h>
#include<stdint.h>
#include<string.h>
#include<assert.h>
#include<net/if.h>
#include<inttypes.h>
#include<poll.h>
#include<sys/socket.h>
#include<sys/ioctl.h>
#include<netinet/in.h>
#include<linux/if_ether.h>
#ifndef __aligned_tpacket
# define __aligned_tpacket __attribute__((aligned(TPACKET_ALIGNMENT)))
#endif
#ifndef __align_tpacket
# define __align_tpacket(x) __attribute__((aligned(TPACKET_ALIGN(x))))
#endif
#define NUM_PACKETS 100
#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
#define DATA_LEN 100
#define DATA_CHAR ‘a‘
#define DATA_CHAR_1 ’b‘
struct ring {
struct iovec *rd;
uint8_t *mm_space;
size_t mm_len, rd_len;
struct sockaddr_ll ll;
void (*walk)(int sock, struct ring *ring, char* name);
int type, rd_num, flen, version;
union {
struct tpacket_req req;
struct tpacket_req3 req3;
};
};
struct block_desc {
uint32_t version;
uint32_t offset_to_priv;
struct tpacket_hdr_v1 h1;
};
union frame_map {
struct {
struct tpacket_hdr tp_h __aligned_tpacket;
struct sockaddr_ll s_ll __align_tpacket(sizeof(struct tpacket_hdr));
} *v1;
struct {
struct tpacket2_hdr tp_h __aligned_tpacket;
struct sockaddr_ll s_ll __align_tpacket(sizeof(struct tpacket2_hdr));
} *v2;
void *raw;
};
static int get_status(int sock, char* name)
{
struct ifreq ifr;
strcpy(ifr.ifr_name, name);
if (ioctl(sock, SIOCGIFFLAGS, &ifr) == -1)
{
perror(“a error ioctl SIOCGIFFLAGS”);
return 0;
}
return (ifr.ifr_flags & IFF_UP);
}
static int get_sock_err(int sock)
{
int ret = 0;
int opt_val = 0;
socklen_t optlen = sizeof(opt_val);
ret = getsockopt(sock, SOL_SOCKET, SO_ERROR, &opt_val, &optlen);
if (ret)
{
perror(“a error getsockopt SO_ERROR”);
return 0;
}
return opt_val;
}
static unsigned int total_packets, total_bytes;
static int pfsocket(int ver)
{
int ret, sock = socket(PF_PACKET, SOCK_RAW, 0);
if (sock == -1) {
perror(“socket”);
exit(1);
}
ret = setsockopt(sock, SOL_PACKET, PACKET_VERSION, &ver, sizeof(ver));
if (ret == -1) {
perror(“setsockopt”);
exit(1);
}
return sock;
}
static void status_bar_update(void)
{
if (total_packets % 10 == 0) {
fprintf(stderr, “.”);
fflush(stderr);
}
}
static void create_payload(void *pay, size_t *len)
{
int i;
struct ethhdr *eth = pay;
struct iphdr *ip = pay + sizeof(*eth);
/* Lets create some broken crap, that still passes
* our BPF filter.
*/
*len = DATA_LEN + 42;
memset(pay, 0xff, ETH_ALEN * 2);
eth->h_proto = htons(ETH_P_IP);
for (i = 0; i < sizeof(*ip); ++i)
((uint8_t *) pay)[i + sizeof(*eth)] = (uint8_t) rand();
ip->ihl = 5;
ip->version = 4;
ip->protocol = 0x11;
ip->frag_off = 0;
ip->ttl = 64;
ip->tot_len = htons((uint16_t) *len - sizeof(*eth));
ip->saddr = inet_addr(“2.2.2.2”);
ip->daddr = inet_addr(“2.2.2.4”);
memset(pay + sizeof(*eth) + sizeof(*ip),
DATA_CHAR, DATA_LEN);
}
static inline int __v1_tx_kernel_ready(struct tpacket_hdr *hdr)
{
return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
}
static inline void __v1_tx_user_ready(struct tpacket_hdr *hdr)
{
hdr->tp_status = TP_STATUS_SEND_REQUEST;
__sync_synchronize();
}
static inline int __v2_tx_kernel_ready(struct tpacket2_hdr *hdr)
{
return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
}
static inline void __v2_tx_user_ready(struct tpacket2_hdr *hdr)
{
hdr->tp_status = TP_STATUS_SEND_REQUEST;
__sync_synchronize();
}
static inline int __v3_tx_kernel_ready(struct tpacket3_hdr *hdr)
{
return !(hdr->tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING));
}
static inline void __v3_tx_user_ready(struct tpacket3_hdr *hdr)
{
hdr->tp_status = TP_STATUS_SEND_REQUEST;
__sync_synchronize();
}
static inline int __tx_kernel_ready(void *base, int version)
{
switch (version) {
case TPACKET_V1:
return __v1_tx_kernel_ready(base);
case TPACKET_V2:
return __v2_tx_kernel_ready(base);
case TPACKET_V3:
return __v3_tx_kernel_ready(base);
default:
return 0;
}
}
static inline void __tx_user_ready(void *base, int version)
{
switch (version) {
case TPACKET_V1:
__v1_tx_user_ready(base);
break;
case TPACKET_V2:
__v2_tx_user_ready(base);
break;
case TPACKET_V3:
__v3_tx_user_ready(base);
break;
}
}
static void __v1_v2_set_packet_loss_discard(int sock)
{
int ret, discard = 1;
ret = setsockopt(sock, SOL_PACKET, PACKET_LOSS, (void *) &discard,
sizeof(discard));
if (ret == -1) {
perror(“setsockopt”);
exit(1);
}
}
static inline void *get_next_frame(struct ring *ring, int n)
{
uint8_t *f0 = ring->rd[0].iov_base;
switch (ring->version) {
case TPACKET_V1:
case TPACKET_V2:
return ring->rd[n].iov_base;
case TPACKET_V3:
return f0 + (n * ring->req3.tp_frame_size);
}
}
static void walk_tx(int sock, struct ring *ring, char* name)
{
struct pollfd pfd;
int ret;
size_t packet_len;
union frame_map ppd;
char packet[1024];
unsigned int frame_num = 0, got = 0;
struct sockaddr_ll ll = {
.sll_family = PF_PACKET,
.sll_halen = ETH_ALEN,
};
int nframes;
/* TPACKET_V{1,2} sets up the ring->rd* related variables based
* on frames (e.g., rd_num is tp_frame_nr) whereas V3 sets these
* up based on blocks (e.g, rd_num is tp_block_nr)
*/
if (ring->version <= TPACKET_V2)
nframes = ring->rd_num;
else
nframes = ring->req3.tp_frame_nr;
memset(&pfd, 0, sizeof(pfd));
pfd.fd = sock;
pfd.events = POLLOUT | POLLERR;
pfd.revents = 0;
total_packets = NUM_PACKETS;
create_payload(packet, &packet_len);
int nic_down= 0;
int err = 0;
while (total_packets > 0) {
if (get_status(sock, name)) {
if (nic_down) {
printf(“sleep 5 s\n”);
sleep(2);
nic_down = 0;
//err = get_sock_err(sock);
//printf(“socket error is %d\n”, err);
}
void *next = get_next_frame(ring, frame_num);
if (!__tx_kernel_ready(next, ring->version) || total_packets <= 0)
continue;
printf(“NIC is up, send one packet\n”);
ppd.raw = next;
switch (ring->version) {
case TPACKET_V1:
ppd.v1->tp_h.tp_snaplen = packet_len;
ppd.v1->tp_h.tp_len = packet_len;
memcpy((uint8_t *) ppd.raw + TPACKET_HDRLEN -
sizeof(struct sockaddr_ll), packet,
packet_len);
total_bytes += ppd.v1->tp_h.tp_snaplen;
break;
case TPACKET_V2:
ppd.v2->tp_h.tp_snaplen = packet_len;
ppd.v2->tp_h.tp_len = packet_len;
memcpy((uint8_t *) ppd.raw + TPACKET2_HDRLEN -
sizeof(struct sockaddr_ll), packet,
packet_len);
total_bytes += ppd.v2->tp_h.tp_snaplen;
break;
case TPACKET_V3: {
struct tpacket3_hdr *tx = next;
tx->tp_snaplen = packet_len;
tx->tp_len = packet_len;
tx->tp_next_offset = 0;
memcpy((uint8_t *)tx + TPACKET3_HDRLEN -
sizeof(struct sockaddr_ll), packet,
packet_len);
total_bytes += tx->tp_snaplen;
break;
}
}
status_bar_update();
total_packets--;
__tx_user_ready(next, ring->version);
frame_num = (frame_num + 1) % nframes;
ret = sendto(sock, NULL, 0, 0, NULL, 0);
if (ret != -1) {
printf(“send success\n”);
}
else {
perror(“sendto fail\n”);
}
}
else {
printf(“NIC is down, don’t send packet\n”);
nic_down = 1;
}
sleep(1);
poll(&pfd, 1, 1);
}
}
static void __v1_v2_fill(struct ring *ring, unsigned int blocks)
{
ring->req.tp_block_size = getpagesize() << 2;
ring->req.tp_frame_size = TPACKET_ALIGNMENT << 7;
ring->req.tp_block_nr = blocks;
ring->req.tp_frame_nr = ring->req.tp_block_size /
ring->req.tp_frame_size *
ring->req.tp_block_nr;
ring->mm_len = ring->req.tp_block_size * ring->req.tp_block_nr;
ring->walk = walk_tx;
ring->rd_num = ring->req.tp_frame_nr;
ring->flen = ring->req.tp_frame_size;
}
static void setup_ring(int sock, struct ring *ring, int version, int type)
{
int ret = 0;
unsigned int blocks = 256;
ring->type = type;
ring->version = version;
switch (version) {
case TPACKET_V1:
case TPACKET_V2:
if (type == PACKET_TX_RING)
__v1_v2_set_packet_loss_discard(sock);
__v1_v2_fill(ring, blocks);
ret = setsockopt(sock, SOL_PACKET, type, &ring->req,
sizeof(ring->req));
break;
}
if (ret == -1) {
perror(“setsockopt”);
exit(1);
}
ring->rd_len = ring->rd_num * sizeof(*ring->rd);
ring->rd = malloc(ring->rd_len);
if (ring->rd == NULL) {
perror(“malloc”);
exit(1);
}
total_packets = 0;
total_bytes = 0;
}
static void mmap_ring(int sock, struct ring *ring)
{
int i;
ring->mm_space = mmap(0, ring->mm_len, PROT_READ | PROT_WRITE,
MAP_SHARED | MAP_LOCKED | MAP_POPULATE, sock, 0);
if (ring->mm_space == MAP_FAILED) {
perror(“mmap”);
exit(1);
}
memset(ring->rd, 0, ring->rd_len);
for (i = 0; i < ring->rd_num; ++i) {
ring->rd[i].iov_base = ring->mm_space + (i * ring->flen);
ring->rd[i].iov_len = ring->flen;
}
}
static void bind_ring(int sock, struct ring *ring, char* name)
{
int ret;
ring->ll.sll_family = PF_PACKET;
ring->ll.sll_protocol = htons(ETH_P_ALL);
ring->ll.sll_ifindex = if_nametoindex(name);
ring->ll.sll_hatype = 0;
ring->ll.sll_pkttype = 0;
ring->ll.sll_halen = 0;
ret = bind(sock, (struct sockaddr *) &ring->ll, sizeof(ring->ll));
if (ret == -1) {
perror(“bind”);
exit(1);
}
}
static void walk_ring(int sock, struct ring *ring, char* name)
{
ring->walk(sock, ring, name);
}
static void unmap_ring(int sock, struct ring *ring)
{
munmap(ring->mm_space, ring->mm_len);
free(ring->rd);
}
static int test_tpacket(int version, int type, char* name)
{
int sock;
struct ring ring;
sock = pfsocket(version);
memset(&ring, 0, sizeof(ring));
setup_ring(sock, &ring, version, type);
mmap_ring(sock, &ring);
bind_ring(sock, &ring, name);
printf(“send data to NIC %s\n”, name);
walk_ring(sock, &ring, name);
unmap_ring(sock, &ring);
close(sock);
fprintf(stderr, “\n”);
return 0;
}
int main(int argc, char* argv[])
{
test_tpacket(TPACKET_V2, PACKET_TX_RING, argv[1]);
return 0;
}