随着TCP Small Queues (TSQ)、TSO 自动分段 (TSO auto sizing) 以及 TCP 发送速率控制 (TCP pacing) 等底层机制的完善,Linux 内核得以引入 自动封包 (Automatic Corking) 功能。这项优化旨在帮助那些频繁调用 write() 或 sendmsg() 系统调用、发送小数据块的应用程序,在不修改代码的前提下显著提升网络性能。
其核心原理在于修改内核中的 tcp_push() 函数,使其在推送数据包前检查当前 socket 缓冲区 (skb) 的有效载荷长度是否小于最优尺寸(通常是 MSS 的整数倍)。
当同时满足以下两个条件时,自动封包机制将被触发:
- 当前 skb 的尺寸小于预定义的
size_goal(即最优尺寸);
- 该 socket 至少还有一个数据包滞留在 qdisc 队列或网卡的 TX 发送队列中;
此时,内核会设置该 socket 的 TCP Small Queue 限流标志位 (TSQ_THROTTLED),从而 推迟本次数据推送操作。推迟的最大时长,会持续到前一个数据包完成发送 (TX completion) 为止。
这段短暂的延迟窗口,为应用程序在随后的 write()、sendmsg() 或 sendfile() 系统调用中将更多数据 合并 (coalesce) 到同一个 skb 中创造了机会,最终提升网络数据包的有效载荷率,减少小包数量。
⚠️ 注意:实际的延迟时长取决于系统的动态行为。如果该数据流在 qdisc 或网卡 TX 环中实际上并没有待发送的数据包,那么延迟将为零,推送会立即执行。
在实践中,使用 FQ (Fair Queueing) 队列调度器 配合 TCP pacing 可以显著增加自动封包被触发的概率,使优化效果更加稳定。
为了控制此功能,内核新增了一个 sysctl 控制参数:
- 路径:
/proc/sys/net/ipv4/tcp_autocorking
- 默认值:1(启用)
同时,内核也新增了一个 SNMP 统计计数器,用于监控自动封包触发的频率。可以通过以下命令查看:
nstat -a | grep TcpExtTCPAutoCorking
每当内核检测到一个 skb 未被充分利用且其发送被推迟时,该计数器的值就会增加一次。
性能测试效果
- 在通过 SSH 执行行缓冲命令(例如
echo)时,效果尤为明显,能够有效减少网络上小数据包的数量。
- 在 CPU 占用率 与 总吞吐量 方面均观察到了显著的性能提升。
测试用例 1:启用自动封包 (tcp_autocorking = 1)
lpq83:~# echo 1 > /proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
9410.39 # ← 吞吐量:9410.39 Mbps
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
35209.439626 task-clock # 2.901 CPUs utilized
2,294 context-switches # 0.065 K/sec
101 CPU-migrations # 0.003 K/sec
4,079 page-faults # 0.116 K/sec
97,923,241,298 cycles # 2.781 GHz [83.31%]
51,832,908,236 stalled-cycles-frontend # 52.93% frontend cycles idle [83.30%]
25,697,986,603 stalled-cycles-backend # 26.24% backend cycles idle [66.70%]
102,225,978,536 instructions # 1.04 insns per cycle
# 0.51 stalled cycles per insn [83.38%]
18,657,696,819 branches # 529.906 M/sec [83.29%]
91,679,646 branch-misses # 0.49% of all branches [83.40%]
12.136204899 seconds time elapsed
测试用例 2:禁用自动封包 (tcp_autocorking = 0)
lpq83:~# echo 0 > /proc/sys/net/ipv4/tcp_autocorking
lpq83:~# perf stat ./super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128
6624.89 # ← 吞吐量降至:6624.89 Mbps(下降约30%!)
Performance counter stats for './super_netperf 4 -t TCP_STREAM -H lpq84 -- -m 128':
40045.864494 task-clock # 3.301 CPUs utilized ↑
171 context-switches # 0.004 K/sec
53 CPU-migrations # 0.001 K/sec
4,080 page-faults # 0.102 K/sec
111,340,458,645 cycles # 2.780 GHz ↑ 13.6%
61,778,039,277 stalled-cycles-frontend # 55.49% frontend cycles idle ↑
29,295,522,759 stalled-cycles-backend # 26.31% backend cycles idle ↑
108,654,349,355 instructions # 0.98 insns per cycle ↓
# 0.57 stalled cycles per insn ↑
19,552,170,748 branches # 488.243 M/sec
157,875,417 branch-misses # 0.81% of all branches ↑
12.130267788 seconds time elapsed
📊 对比分析:启用 tcp_autocorking 后,可以观察到:
- 吞吐量提升约 42% (从 6624 Mbps 提升至 9410 Mbps)。
- CPU 利用效率更高:完成相同任务所需的
task-clock 时间降低了约 12%。
- 分支预测失误率大幅下降:从 0.81% 降至 0.49%。
- 前端停顿周期减少:从 55.49% 降至 52.93%,表明处理器指令流水线更加顺畅。
💡 技术背景补充:
“Corking”(塞住)原意为封瓶塞,在网络编程中特指 暂存数据、延迟发送以等待更多数据合并 的技术。传统上需要应用程序显式调用 TCP_CORK socket 选项来实现,而 Automatic Corking 则由Linux内核根据实时网络状况智能决策,对应用程序完全透明,是现代高性能 TCP/IP 协议栈的一项重要优化。
内核实现代码摘要
以下补丁代码摘要展示了 tcp_autocorking 在内核中的实现要点,主要包括 sysctl 控制参数、SNMP 计数器以及 tcp_push() 函数的核心修改逻辑:
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Documentation/networking/ip-sysctl.txt | 10 +++
include/net/tcp.h | 1
include/uapi/linux/snmp.h | 1
net/ipv4/proc.c | 1
net/ipv4/sysctl_net_ipv4.c | 9 +++
net/ipv4/tcp.c | 63 ++++++++++++++++++-----
6 files changed, 72 insertions(+), 13 deletions(-)
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 3c12d9a..12ba2cd 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -156,6 +156,16 @@ tcp_app_win - INTEGER
buffer. Value 0 is special, it means that nothing is reserved.
Default: 31
+tcp_autocorking - BOOLEAN
+ Enable TCP auto corking :
+ When applications do consecutive small write()/sendmsg() system calls,
+ we try to coalesce these small writes as much as possible, to lower
+ total amount of sent packets. This is done if at least one prior
+ packet for the flow is waiting in Qdisc queues or device transmit
+ queue. Applications can still use TCP_CORK for optimal behavior
+ when they know how/when to uncork their sockets.
+ Default : 1
+
tcp_available_congestion_control - STRING
Shows the available congestion control choices that are registered.
More congestion control algorithms may be available as modules,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70e55d2..f7e1ab2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -282,6 +282,7 @@ extern int sysctl_tcp_limit_output_bytes;
extern int sysctl_tcp_challenge_ack_limit;
extern unsigned int sysctl_tcp_notsent_lowat;
extern int sysctl_tcp_min_tso_segs;
+extern int sysctl_tcp_autocorking;
extern atomic_long_t tcp_memory_allocated;
extern struct percpu_counter tcp_sockets_allocated;
diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 1bdb4a3..bbaba22 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -258,6 +258,7 @@ enum
LINUX_MIB_TCPFASTOPENCOOKIEREQD, /* TCPFastOpenCookieReqd */
LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES, /* TCPSpuriousRtxHostQueues */
LINUX_MIB_BUSYPOLLRXPACKETS, /* BusyPollRxPackets */
+ LINUX_MIB_TCPAUTOCORKING, /* TCPAutoCorking */
__LINUX_MIB_MAX
};
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 4a03358..8ecd7ad 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -279,6 +279,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPFastOpenCookieReqd", LINUX_MIB_TCPFASTOPENCOOKIEREQD),
SNMP_MIB_ITEM("TCPSpuriousRtxHostQueues", LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES),
SNMP_MIB_ITEM("BusyPollRxPackets", LINUX_MIB_BUSYPOLLRXPACKETS),
+ SNMP_MIB_ITEM("TCPAutoCorking", LINUX_MIB_TCPAUTOCORKING),
SNMP_MIB_SENTINEL
};
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 3d69ec8..38c8ec9 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -733,6 +733,15 @@ static struct ctl_table ipv4_table[] = {
.extra2 = &gso_max_segs,
},
{
+ .procname = "tcp_autocorking",
+ .data = &sysctl_tcp_autocorking,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ .extra2 = &one,
+ },
+ {
.procname = "udp_mem",
.data = &sysctl_udp_mem,
.maxlen = sizeof(sysctl_udp_mem),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6..0ca8754 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -285,6 +285,8 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
int sysctl_tcp_min_tso_segs __read_mostly = 2;
+int sysctl_tcp_autocorking __read_mostly = 1;
+
struct percpu_counter tcp_orphan_count;
EXPORT_SYMBOL_GPL(tcp_orphan_count);
@@ -619,19 +621,52 @@ static inline void tcp_mark_urg(struct tcp_sock *tp, int flags)
tp->snd_up = tp->write_seq;
}
-static inline void tcp_push(struct sock *sk, int flags, int mss_now,
- int nonagle)
+/* If a not yet filled skb is pushed, do not send it if
+ * we have packets in Qdisc or NIC queues :
+ * Because TX completion will happen shortly, it gives a chance
+ * to coalesce future sendmsg() payload into this skb, without
+ * need for a timer, and with no latency trade off.
+ * As packets containing data payload have a bigger truesize
+ * than pure acks (dataless) packets, the last check prevents
+ * autocorking if we only have an ACK in Qdisc/NIC queues.
+ */
+static bool tcp_should_autocork(struct sock *sk, struct sk_buff *skb,
+ int size_goal)
{
- if (tcp_send_head(sk)) {
- struct tcp_sock *tp = tcp_sk(sk);
+ return skb->len < size_goal &&
+ sysctl_tcp_autocorking &&
+ atomic_read(&sk->sk_wmem_alloc) > skb->truesize;
+}
+
+static void tcp_push(struct sock *sk, int flags, int mss_now,
+ int nonagle, int size_goal)
+{
+ struct tcp_sock *tp = tcp_sk(sk);
+ struct sk_buff *skb;
- if (!(flags & MSG_MORE) || forced_push(tp))
- tcp_mark_push(tp, tcp_write_queue_tail(sk));
+ if (!tcp_send_head(sk))
+ return;
- tcp_mark_urg(tp, flags);
+ skb = tcp_write_queue_tail(sk);
+ if (!(flags & MSG_MORE) || forced_push(tp))
+ tcp_mark_push(tp, skb);
- __tcp_push_pending_frames(sk, mss_now,
- (flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
+ tcp_mark_urg(tp, flags);
+
+ if (tcp_should_autocork(sk, skb, size_goal)) {
+ /* avoid atomic op if TSQ_THROTTLED bit is already set */
+ if (!test_bit(TSQ_THROTTLED, &tp->tsq_flags)) {
+ NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPAUTOCORKING);
+ set_bit(TSQ_THROTTLED, &tp->tsq_flags);
+ }
+ return;
}
+
+ if (flags & MSG_MORE)
+ nonagle = TCP_NAGLE_CORK;
+
+ __tcp_push_pending_frames(sk, mss_now, nonagle);
}
static int tcp_splice_data_recv(read_descriptor_t *rd_desc, struct sk_buff *skb,
@@ -934,7 +969,8 @@ new_segment:
wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
- tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
+ tcp_push(sk, flags & ~MSG_MORE, mss_now,
+ TCP_NAGLE_PUSH, size_goal);
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
@@ -944,7 +980,7 @@ wait_for_memory:
out:
if (copied && !(flags & MSG_SENDPAGE_NOTLAST))
- tcp_push(sk, flags, mss_now, tp->nonagle);
+ tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
return copied;
do_error:
@@ -1225,7 +1261,8 @@ wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
if (copied)
- tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
+ tcp_push(sk, flags & ~MSG_MORE, mss_now,
+ TCP_NAGLE_PUSH, size_goal);
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
@@ -1236,7 +1273,7 @@ wait_for_memory:
out:
if (copied)
- tcp_push(sk, flags, mss_now, tp->nonagle);
+ tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
release_sock(sk);
return copied + copied_syn;