找回密码
立即注册
搜索
热搜: Java Python Linux Go
发回帖 发新帖

2697

积分

0

好友

353

主题
发表于 4 天前 | 查看: 15| 回复: 0
本帖最后由 云栈开源日记 于 2026-1-20 17:57 编辑

说明:本文计算量很大、步骤较多,难免存在疏漏,欢迎指正。文中每个计算环节使用的函数均已给出,便于自行修改。本文仅代表个人观点,与任职机构无关。

每当看到代码里出现 “Low Latency” 这类字眼,很多人会下意识把关注点放到“降低静态延迟、扩大带宽、上大规模 ScaleUP 网络”上。但事实真的是这样吗?

本文作为第一篇,先分析 H800H20DeepSeek-R1EP 并行推理性能峰值;后续会继续分析 B200-NVL72 这类实例,看看 ScaleUP 网络是否真的有优势。

TL;DR

H800 和 H20 的分析结果如下(H800 基本能和 DeepSeek 官方数据对齐)。

Prefill 阶段

H800 H20
TPS(Overlap) 52240.1 9377.0
TPS 33741.0 8536.9

Decoding 阶段

H800(TP1) H800(TP1) H800(TP1) H20(TP4) H20_3e(TP8) H20_3e(TP8)
BatchSize 32.000 64.000 128.000 32.000 32.000 64.000
TPOT(Overlap) 9.858 19.716 39.431 35.367 29.613 49.005
TPOT 17.023 34.045 68.090 42.532 36.778 63.334
TPS(Overlap) 101.442 50.721 25.360 28.275 33.768 20.406
TPS 58.746 29.373 14.686 23.512 27.190 15.789
Total(Overlap) 3246.137 3246.137 3246.137 904.803 1080.591 1306.001
Total 1879.856 1879.856 1879.856 752.383 870.082 1010.516

其中 H20-3E(HBM3e-141GB) 在 Decoding 阶段相对于 H20 有接近 1.4x 的性能收益。

本文目录

1. DeepSeek-V3/R1模型架构及计算复杂度分析
1.1 MLA计算复杂度
1.1.1 标准模式
1.1.2 矩阵吸收模式
1.2 DenseMLP计算复杂度
1.3 MoE Expert计算复杂度
1.4 数据汇总
2. Prefill阶段
2.1 MLA计算耗时
2.2 DenseMLP计算耗时
2.3 MoE计算耗时
2.4 AlltoAll通信耗时
2.5 总耗时
2.6 Overlap分析
2.7 KVCache计算
3. Decoding阶段
3.1 EP策略分析
3.2 Memory利用率分析
3.3 MLA耗时
3.4 DenseMLP耗时
3.5 AlltoAll通信耗时
3.6 总耗时
3.7 Overlap分析
4. 小结

1. DeepSeek-V3/R1 模型架构及计算复杂度分析

DeepSeek-V3/R1 模型架构如下:

DeepSeek-V3/R1 架构

模型参数定义如下:

class ModelArgs:
    max_batch_size: int = 8
    max_seq_len: int = 4096 * 4
    vocab_size: int = 129280
    dim: int = 7168
    inter_dim: int = 18432
    moe_inter_dim: int = 2048
    n_layers: int = 61
    n_dense_layers: int = 3
    n_heads: int = 128
# moe
    n_routed_experts: int = 256
    n_shared_experts: int = 1
    n_activated_experts: int = 8
    n_expert_groups: int = 8
    n_limited_groups: int = 4
    route_scale: float = 2.5
# mla
    q_lora_rank: int = 1536
    kv_lora_rank: int = 512
    qk_nope_head_dim: int = 128
    qk_rope_head_dim: int = 64
    v_head_dim: int = 128

虽然各模块的浮点运算量、参数量有一种很省事的方法:用 ptflopsget_model_complexity_info 直接处理 block 得出。但该库对某些复杂算子会统计有误,因此本文对关键部分做了手工校正。

若你对推理系统/模型结构更感兴趣,可以在 人工智能 板块找到更多相关讨论。

1.1 MLA 计算复杂度

1.1.1 标准实现

MLA 模块代码来自 DeepSeek-V3 GitHub 示例,是一个标准 MLA 实现:

class MLA(nn.Module):
def __init__(self, args: ModelArgs):
        super().__init__()
        self.dim = args.dim #隐藏层维度
        self.n_heads = args.n_heads
        self.n_local_heads = args.n_heads // world_size
        self.q_lora_rank = args.q_lora_rank #q的低秩压缩的维度
        self.kv_lora_rank = args.kv_lora_rank #kv的低秩压缩的维度
        self.qk_nope_head_dim = args.qk_nope_head_dim #qk不带旋转位置编码的头的维度
        self.qk_rope_head_dim = args.qk_rope_head_dim #qk旋转位置编码的头的维度
        self.qk_head_dim = args.qk_nope_head_dim + args.qk_rope_head_dim
        self.v_head_dim = args.v_head_dim #v的多头注意力中头的维度

        self.wq_a = nn.Linear(self.dim, self.q_lora_rank)
#q的down-projection矩阵

        self.q_norm = nn.RMSNorm(self.q_lora_rank)

        self.wq_b = nn.Linear(self.q_lora_rank, self.n_heads * self.qk_head_dim)
#q的up-projection矩阵

        self.wkv_a = nn.Linear(self.dim, self.kv_lora_rank + self.qk_rope_head_dim)
# wkv_a为K和V的down-projection矩阵
        self.kv_norm = nn.RMSNorm(self.kv_lora_rank)

        self.wkv_b = nn.Linear(self.kv_lora_rank, self.n_heads * (self.qk_nope_head_dim + self.v_head_dim))
# wkv_b为K和V的up-projection矩阵

        self.wo = nn.Linear(self.n_heads * self.v_head_dim, self.dim) #output权重矩阵
        self.softmax_scale = self.qk_head_dim ** -0.5#计算1/sqrt(d_k)
        self.register_buffer("kv_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.kv_lora_rank), persistent=False)
        self.register_buffer("pe_cache", torch.zeros(args.max_batch_size, args.max_seq_len, self.qk_rope_head_dim), persistent=False)

def forward(self, x: torch.Tensor):
        bsz, seqlen, _ = x.size()
        start_pos = 1
        end_pos = start_pos + seqlen
# ---- 计算q--------
        q = self.wq_b(self.q_norm(self.wq_a(x)))
        q = q.view(bsz, seqlen, self.n_local_heads, self.qk_head_dim)
        q_nope, q_pe = torch.split(q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1) #分离nope,rope
        q_pe = apply_rotary_emb(q_pe, freqs_cis) #执行RoPE计算

# ----计算KV----------
        kv = self.wkv_a(x)
#KV-Cache大小为wkv_a outputdim(self.kv_lora_rank + self.qk_rope_head_dim)
        kv, k_pe = torch.split(kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1) #分离KV和K位置编码
        k_pe = apply_rotary_emb(k_pe.unsqueeze(2), freqs_cis) #执行RoPE计算

# -----处理KV u-pprojection矩阵
        wkv_b = self.wkv_b.weight
        wkv_b = wkv_b.view(self.n_local_heads, -1, self.kv_lora_rank)

# q中不需要位置编码的先和K的不需要位置编码的权重相乘
        q_nope = torch.einsum("bshd,hdc->bshc", q_nope, wkv_b[:, :self.qk_nope_head_dim])
        self.kv_cache[:bsz, start_pos:end_pos] = self.kv_norm(kv)#保存KV Cache
        self.pe_cache[:bsz, start_pos:end_pos] = k_pe.squeeze(2) #保存K的位置编码Cache(pe cache)

# 计算QK^T/sqrt(d_k)
        scores = (torch.einsum("bshc,btc->bsht", q_nope, self.kv_cache[:bsz, :end_pos]) +
                  torch.einsum("bshr,btr->bsht", q_pe, self.pe_cache[:bsz, :end_pos])) * self.softmax_scale
        scores = scores.softmax(dim=-1, dtype=torch.float32).type_as(x)

# 计算V
        x = torch.einsum("bsht,btc->bshc", scores, self.kv_cache[:bsz, :end_pos])
        x = torch.einsum("bshc,hdc->bshd", x, wkv_b[:, -self.v_head_dim:])

        x = self.wo(x.flatten(2)) #wo权重, 从n_head * v_head_dim -> dim
return x

为了更直观理解执行流程,对应的计算流程图如下:

MLA 执行流程图

从图中可知,单个 Token 的 KVCache 用量可从 forwardkv = self.wkv_a(x) 推出:维度为 kv_lora_rank(512)+ qk_rope_head_dim(64)=576

使用 ptflops 做复杂度统计(但存在漏算):

args = ModelArgs()
m = MLA(args)
num_tokens = 1

mla_flops, mla_params = get_model_complexity_info(m, (num_tokens,args.dim),as_strings=True,print_per_layer_stat=True)

##输出结果如下
MLA(
187.17 M, 99.999% Params, 170.36 MMac, 100.000% MACs,
  (wq_a): Linear(11.01 M, 5.883% Params, 11.01 MMac, 6.464% MACs, in_features=7168, out_features=1536, bias=True)
  (q_norm): RMSNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (1536,), eps=None, elementwise_affine=True)
  (wq_b): Linear(37.77 M, 20.181% Params, 37.77 MMac, 22.172% MACs, in_features=1536, out_features=24576, bias=True)
  (wkv_a): Linear(4.13 M, 2.206% Params, 4.13 MMac, 2.424% MACs, in_features=7168, out_features=576, bias=True)
  (kv_norm): RMSNorm(0, 0.000% Params, 0.0 Mac, 0.000% MACs, (512,), eps=None, elementwise_affine=True)
  (wkv_b): Linear(16.81 M, 8.981% Params, 0.0 Mac, 0.000% MACs, in_features=512, out_features=32768, bias=True)
  (wo): Linear(117.45 M, 62.748% Params, 117.45 MMac, 68.940% MACs, in_features=16384, out_features=7168, bias=True)
)

单个 MLA block 参数量 187.17M 没问题;但单 token 的 170.36M Mac 实际有误:wkv_b 在 split 为 w_ukw_uv 时算力消耗没被统计。为此定义手工计算函数:

def mla_flops(q_len, kv_len, args:ModelArgs, kv_cache_rate=0):
#calculate MACs and estimate Flops approx. 2xMAC.
    q_down_proj = q_len * args.dim * args.q_lora_rank #wq_a
    q_up_proj = q_len * args.q_lora_rank * args.n_heads * (args.qk_nope_head_dim + args.qk_rope_head_dim) #wq_b
    kv_down_proj = kv_len * args.dim * (args.kv_lora_rank + args.qk_rope_head_dim) #wkv_a
    k_up_proj = kv_len * args.kv_lora_rank * args.n_heads * args.qk_nope_head_dim #w_uk
    v_up_proj = kv_len * args.kv_lora_rank * args.n_heads * args.v_head_dim #w_uv

    kv_down_proj = kv_down_proj * (1 - kv_cache_rate)
    gemm_sum = q_down_proj + q_up_proj + kv_down_proj + k_up_proj + v_up_proj

#把它看成一个标准的args.n_heads的MHA
    mha = args.n_heads * ( q_len * args.qk_rope_head_dim * kv_len #QK_score_rope
                          + q_len * args.qk_nope_head_dim * kv_len #QK_score_nope
                          + q_len * kv_len * args.v_head_dim) #ScoreV
    wo = q_len * args.n_heads * args.v_head_dim * args.dim #wo
    attn_sum = mha + wo

#return flops by 2* Sum(MACs)
    GEMM_FP8_FLOPS = gemm_sum * 2/1e9
    ATTN_FP16_FLOPS =  attn_sum * 2/1e9

return GEMM_FP8_FLOPS+ATTN_FP16_FLOPS, GEMM_FP8_FLOPS,ATTN_FP16_FLOPS

单 token 实际复杂度:

mla_flops(1,1,args,0)

(0.37429248000000004, 0.139329536, 0.234962944)

1.1.2 矩阵吸收模式

DeepSeek-V2 论文提到:

Fortunately, due to the associative law of matrix multiplication, we can absorb 𝑊_𝑈𝐾 into 𝑊𝑈𝑄, and 𝑊_𝑈𝑉 into 𝑊𝑂

其中 WU_Q 对应上文 wq_b。在流程图第 (3) 步前可将 w_ukwq_b 先相乘;第 (7) 步可将 w_uvwo 相乘:

MLA 矩阵吸收示意

  • wq_b_nope[q_lora_rank(1536), n_head(128) x qk_nope_head_dim(128)]
  • w_uk[kv_lora_rank(512), n_head(128) x qk_nope_head_dim(128)]

吸收后 q_absorb[q_lora_rank(1536), n_head(128) x kv_lora_rank(512)]

同理吸收 wow_uv

  • wo[n_head(128) x v_head_dim(128), dim(7168)]
  • w_uv[kv_lora_rank(512), n_head(128) x v_head_dim(128)]

吸收后 o_absorb[dim(7168), n_head(128) x kv_lora_rank(512)]

对应算力消耗函数:

def mla_matabsob_flops(q_len, kv_len, args:ModelArgs, kv_cache_rate=0):
#calculate MACs and estimate Flops approx. 2xMAC.
    q_down_proj = q_len * args.dim * args.q_lora_rank #wq_a
    q_rope_up_proj = q_len * args.q_lora_rank * args.n_heads * args.qk_rope_head_dim #wq_b_rope
    q_absorb = q_len * args.n_heads * args.q_lora_rank * args.kv_lora_rank

    kv_down_proj = kv_len * args.dim * (args.kv_lora_rank + args.qk_rope_head_dim) #wkv_a
    kv_down_proj = kv_down_proj * (1 - kv_cache_rate) #KV-Cache命中率修正
    gemm_sum = q_down_proj + q_rope_up_proj + q_absorb + kv_down_proj

#把它看成一个标准的args.n_heads的MQA
    mqa = args.n_heads * ( q_len * args.qk_rope_head_dim * kv_len #Score_rope
                          + q_len * args.kv_lora_rank * kv_len #Score_nope
                          + q_len * kv_len * args.kv_lora_rank) #Score V
    o_absorb = q_len * args.n_heads * args.kv_lora_rank * args.dim
    attn_sum = mqa + o_absorb

#return flops by 2* Sum(MACs)
    gemm_sum =  gemm_sum * 2/1e9
    attn_sum = attn_sum * 2/1e9

return gemm_sum + attn_sum, gemm_sum,attn_sum

单 token 复杂度:

mla_matabsob_flops(1,1,args,0)

(1.196572672, 0.256770048, 0.939802624)

相对于非吸收版本:

mla_matabsob_flops(1,1,args,0)[0] / mla_flops(1,1,args,0)[0],运算复杂度反而增加 3.197 倍。

吸收后参数量估计:

def mla_matabsob_mem(args:ModelArgs):
    q_down_proj = args.dim * args.q_lora_rank #wq_a
    q_rope_up_proj =  args.q_lora_rank * args.n_heads * args.qk_rope_head_dim #wq_b_rope
    q_absorb = args.n_heads * args.q_lora_rank * args.kv_lora_rank
    kv_down_proj =  args.dim * (args.kv_lora_rank + args.qk_rope_head_dim) #wkv_a
    o_absorb = args.n_heads * args.kv_lora_rank * args.dim
return q_down_proj + q_rope_up_proj + q_absorb + kv_down_proj + o_absorb

mla_matabsob_mem(args)/1e6
598.147072

参数量也增加到 598.14M(同样约 3.197 倍)。

MLA_Absorb 在 Decoding 阶段会有额外收益。根据官方资料《DeepSeek-V3 / R1 推理系统概览》[1],平均每输出一个 token 的 KVCache 长度约为 4989;据此计算两者差异显著:

#Prefill
mla_matabsob_flops(4989,4989,args,0)[0] / mla_flops(4989,4989,args,0)[0]

3.3028

#Decoding时qlen=1,KVcache不需要计算kv_cache_rate=1
mla_matabsob_flops(1,4989,args,1)[0] / mla_flops(1,4989,args,1)[0]

0.015

结论:Prefill 阶段采用非吸收版本,Decoding 阶段采用矩阵吸收版本。

1.2 DenseMLP 计算复杂度

模型前三层采用 Dense MLP:

class DenseMLP(nn.Module):
def __init__(self, dim: int, inter_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, inter_dim, dtype=torch.bfloat16)
        self.w2 = nn.Linear(inter_dim, dim, dtype=torch.bfloat16)
        self.w3 = nn.Linear(dim, inter_dim, dtype=torch.bfloat16)

def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.w2(F.silu(self.w1(x)) * self.w3(x))

args = ModelArgs()
#dim=7168,inter_dim=18432
d = DenseMLP(args.dim, args.inter_dim)
num_tokens = 1

mlp_flops, mlp_params = get_model_complexity_info(d, (1,num_tokens,args.dim),as_strings=True,print_per_layer_stat=True)
##输出结果如下:
DenseMLP(
396.41 M, 100.000% Params, 396.41 MMac, 99.995% MACs,
  (w1): Linear(132.14 M, 33.334% Params, 132.14 MMac, 33.333% MACs, in_features=7168, out_features=18432, bias=True)
  (w2): Linear(132.13 M, 33.331% Params, 132.13 MMac, 33.330% MACs, in_features=18432, out_features=7168, bias=True)
  (w3): Linear(132.14 M, 33.334% Params, 132.14 MMac, 33.333% MACs, in_features=7168, out_features=18432, bias=True)
)

单个 MLP block 参数量 396.41M,单 token 计算复杂度约 396.41M Mac(约 792.82 MFLOPS)。定义计算复杂度函数:

def densmlp_flops(args:ModelArgs, seq_len):
return 3 * seq_len * args.dim * args.inter_dim *2 /1e9

1.3 MoE Expert 计算复杂度

后 58 层采用 MoE,其 Expert 计算复杂度:

class Expert(nn.Module):
def __init__(self, dim: int, inter_dim: int):
        super().__init__()
        self.w1 = nn.Linear(dim, inter_dim, dtype=torch.bfloat16)
        self.w2 = nn.Linear(inter_dim, dim, dtype=torch.bfloat16)
        self.w3 = nn.Linear(dim, inter_dim, dtype=torch.bfloat16)

def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.w2(F.silu(self.w1(x)) * self.w3(x))
args = ModelArgs()
num_tokens = 1

#dim=7168,moe_inter_dim=2048
e = Expert(args.dim, args.moe_inter_dim)

moe_flops, moe_params = get_model_complexity_info(e, (1,num_tokens,args.dim),as_strings=True,print_per_layer_stat=True)
##输出结果如下:
Expert(
44.05 M, 100.000% Params, 44.05 MMac, 99.995% MACs,
  (w1): Linear(14.68 M, 33.329% Params, 14.68 MMac, 33.328% MACs, in_features=7168, out_features=2048, bias=True)
  (w2): Linear(14.69 M, 33.341% Params, 14.69 MMac, 33.340% MACs, in_features=2048, out_features=7168, bias=True)
  (w3): Linear(14.68 M, 33.329% Params, 14.68 MMac, 33.328% MACs, in_features=7168, out_features=2048, bias=True)
)

单个 Expert 参数量 44.05M,单 token 计算复杂度约 44.05M Mac(约 88.1 MFLOPS)。定义计算复杂度函数:

def moe_expert_flops(args:ModelArgs, seq_len):
return 3 * seq_len * args.dim * args.moe_inter_dim *2/1e9

1.4 数据汇总

模型整体参数分布如下(另有 MoE Gating 参数 dim x n_routed_expert + n_routed_expert(bias)=1.83M,以及 Embedding/Output 参数 vocab_size x dim=926.67M)。

Block 单层参数量 层数 累计参数
MLA 187.17M 61 11.41B
DenseMLP 396.41M 3 1.19B
Expert 44.05Mx(256_routed+1_shared) 58 656.6B
Gate 1.83M 58 106.14M
Embedding 926.67M 1 926.67M
Output 926.67M 1 926.67M
SUM - - 671.16B

不同 block 的算力消耗统计:

Block 参数量 运算复杂度(FLops) KVCache用量
MLA 187.17M 374.29M 576B(FP8)
MLA_absorb 598.14M 1196.57M 576(FP8)
DenseMLP 396.41M 792.82M -
Expert 44.05M 488.1M -

实际评估会结合 Prefill/Decode 阶段,以及 KVCache 命中率。

KVCache 用量:单个 Token 的 KVCache 累积 61 层;按 FP16 保存为 2 x 576 x 61 = 68.62KB,按 FP8 保存为 34.31KB。

H20/H800 指标如下:

GPU类型 SM FP16算力 FP8算力 显存大小 显存带宽 NVLINK带宽 PCIe带宽
H800 132 989.5 1979 80GB 3350 200 50
H20 78 148 296 96GB 3350 450 50
  • 注:算力单位 TFLOPS,带宽单位 GB/s

为了便于后续计算,定义 GPU 性能函数,整体性能按峰值 85% 估计。H800 需要 24 个通信 SM;考虑 H20 浮点算力较弱,估计 H20 需要 10 个通信 SM:

class GPU_perf():
def __init__(self,sm,comm_sm, fp16_flops,fp8_flops,mem,mem_bw, nvlink_bw,pcie_bw, discount_rate):
        self.sm = sm
        self.comm_sm = comm_sm #用于通信的SM数量
        self.fp16_flops = fp16_flops
        self.fp8_flops = fp8_flops
        self.mem = mem
        self.mem_bw = mem_bw
        self.nvlink_bw = nvlink_bw
        self.pcie_bw = pcie_bw
        self.discount_rate = discount_rate #整体性能按峰值性能折扣
#TODO: 可以分离网络性能折扣和算力性能折扣

def get_fp16_flops(self):
return self.fp16_flops * self.discount_rate  * ( self.sm  - self.comm_sm) / self.sm

def get_fp8_flops(self):
return self.fp8_flops *  self.discount_rate * ( self.sm  - self.comm_sm) / self.sm

def get_mem_bw(self):
return self.mem_bw *  self.discount_rate

def get_nvlink_bw(self):
return self.nvlink_bw *  self.discount_rate

def get_pcie_bw(self):
return self.pcie_bw *  self.discount_rate

h800 = GPU_perf( sm = 132 ,comm_sm = 24,
                 fp16_flops = 791.6, fp8_flops = 1583.2,
                 mem = 80,mem_bw = 3350,
                 nvlink_bw = 200,pcie_bw = 50,
                 discount_rate = 0.85)

h20 = GPU_perf( sm = 78 ,comm_sm = 10,
                 fp16_flops = 118.4, fp8_flops = 236.8,
                 mem = 96,mem_bw = 3350,
                 nvlink_bw = 400,pcie_bw = 50,
                 discount_rate = 0.85)

gpu = dict({'H800': h800, 'H20': h20})

2. Prefill 阶段

根据 DeepSeek 官方报告:Prefill 采用 路由专家 EP32、MLA 和共享专家 DP32。最小部署单元为 4 节点共 32 GPU;每张卡 9 个路由专家 + 1 个共享专家。论文中 attention 并行策略:

The minimum deployment unit of prefilling stage consists of 4 nodes with 32 GPUs. The attention part employs 4-way Tensor Parallelism (TP4) with Sequence Parallelism (SP), combined with 8-way Data Parallelism (DP8). For the MoE part, we use 32-way Expert Parallelism (EP32)

从 Attention 视角:推理请求在 API Server 通过负载均衡以 DP=8 分配到不同 Prefill 节点的 DP 组内;每个 DP 组内 4 张 H800 组成 TP+SP 组进行 MLA。

从 MoE 视角:32 GPU 组成 EP32 组;每层 256 个 Expert 平均每卡 8 个 routed expert;每卡还有 1 个 shared expert,并按论文还承载 1 个 redundant expert,总计约 10 个 Expert。

这类并行拆分/通信-计算权衡,本质是典型的 分布式系统 工程问题:算力、内存带宽、网络带宽要一起算。

2.1 MLA 计算耗时

参考《DeepSeek V3/R1 推理效率分析(2): DeepSeek 满血版逆向工程分析》[2] 中对 Prefill/Decoding 长度估计:

假设 P 代表平均输入长度,D 代表平均输出长度,则平均每输出 token 的 KVcache 长度约等于 P + D/2 = 4989;再加上 P/D = 608B/168B;P≈4383,D≈1210。

以平均 Prefill seq_len=4383 计算,KVCache 命中率取官方 56.3%,GPU 性能取峰值 85%。

计算函数如下(考虑 TP 并行;假设 seq_len 中有 56.3% 可以从 KVCache 提取,则 Prefill 仅需计算 (1-kv_cache_rate) 的 token):

def prefill_mla_elapse_time(args:ModelArgs,gpu:GPU_perf, discount, comm_sm, seq_len, kv_cache_rate):
    _ , gemm_fp8_flops, attn_fp16_flops = mla_flops(q_len,kv_len,args, 1)
    gemm_fp8_time = gemm_fp8_flops / gpu.get_fp8_flops(discount, comm_sm)
    print("GEMM_FP8 Elapsed time(ms): %.3f" % gemm_fp8_time)
    attn_fp16_time = attn_fp16_flops / gpu.get_fp16_flops(discount, comm_sm)
    print("ATTN_FP16 Elapsed time(ms): %.3f" % attn_fp16_time)
    total_time = gemm_fp8_time + attn_fp16_time
    print("Total Elapsed time(ms):%.3f" % total_time)

    all_reduce_comm_size = seq_len * args.dim * 2 /1024/1024#fp16 take 2Bytes
    ar_elapsed_time = all_reduce_comm_size / gpu.get_nvlink_bw(discount)
    print("AR Elapsed time(ms):%.3f" % ar_elapsed_time)

    tp4_time = total_time/4 + ar_elapsed_time
    print("TP4 Elapsed time(ms):%.3f" % tp4_time)

    tp8_time = total_time/8 + ar_elapsed_time
    print("TP8 Elapsed time(ms):%.3f" % tp8_time)
return total_time, tp4_time,tp8_time

def prefill_mla(args:ModelArgs, gpu_dict, seq_len, kv_cache_rate):
    df = pd.DataFrame(columns=['GPU','TP1','TP4','TP8'])
for key in gpu_dict.keys():
        print('------------ %s --------------' % key)
        tp1,tp4,tp8 = prefill_mla_elapse_time(args,gpu_dict[key], seq_len, kv_cache_rate)
        df.loc[len(df)]=[key,tp1,tp4,tp8]
    print(df.set_index('GPU').to_markdown(floatfmt=".3f"))

H800 24 个通信 SM;H20 10 个通信 SM,同时计算 TP=4 和 TP=8。TP 并行 AllReduce 通信量为 seq_len x dim x 2Bytes(BF16)

seq_len = 4383
kv_cache_rate = 0.563

prefill_mla(args,gpu,seq_len,kv_cache_rate)
------------ H800 --------------
GEMM_FP8 Elapsed time(ms): 0.536
ATTN_FP16 Elapsed time(ms): 4.729
Total Elapsed time(ms):5.265
AR Elapsed time(ms):0.352
TP4 Elapsed time(ms):1.669
TP8 Elapsed time(ms):1.011
------------ H20 --------------
GEMM_FP8 Elapsed time(ms): 3.364
ATTN_FP16 Elapsed time(ms): 29.671
Total Elapsed time(ms):33.035
AR Elapsed time(ms):0.176
TP4 Elapsed time(ms):8.435
TP8 Elapsed time(ms):4.306

MLA 的 GPU 计算时间(ms)汇总:

GPU TP1 TP4 TP8
H800 5.265 1.669 1.011
H20 33.035 8.435 4.306

2.2 DenseMLP 计算耗时

DenseMLP 运算量函数:

def densmlp_flops(args:ModelArgs, seq_len):
return 3 * seq_len * args.dim * args.inter_dim *2 /1e9

def dense_mlp_elapse_time(args:ModelArgs,gpu:GPU_perf, seq_len):
    gemm_fp8_flops = densmlp_flops(args, seq_len)
    gemm_fp8_time = gemm_fp8_flops / gpu.get_fp8_flops()
    print("Elapsed time(ms): %.3f" % gemm_fp8_time)
return gemm_fp8_time

def prefill_dense_mlp(args:ModelArgs, gpu_dict, seq_len):
    df = pd.DataFrame(columns=['GPU','DenseMLP耗时'])
for key in gpu_dict.keys():
        print('------------ %s --------------' % key)
        t = dense_mlp_elapse_time(args,gpu_dict[key], seq_len)
        df.loc[len(df)]=[key,t]
    print(df.set_index('GPU').to_markdown(floatfmt=".3f"))

实际运算长度:

q_len = seq_len *( 1- kv_cache_rate)
------------ H800 --------------
Elapsed time(ms): 3.156
------------ H20 --------------
Elapsed time(ms): 19.801

DenseMLP 累计耗时(ms):

GPU DenseMLP耗时
H800 3.156
H20 19.801

2.3 MoE 计算耗时

TP=4 时,DP=8,相当于 MLA 同时产生 8 组 seq_len token。每卡 shared expert 处理 token 数约为 seq_len * dp_group / num_gpu

对 routed expert:topk=8 时,总 routed token 计算量为 seq_len * dp_group * topk,平均摊到 32 卡,每卡 routed token 为 seq_len * dp_group * topk / num_gpu

def moe_expert_flops(args:ModelArgs, seq_len):
return 3 * seq_len * args.dim * args.moe_inter_dim *2/1e9

def moe_expert_elapse_time(args:ModelArgs,gpu:GPU_perf, seq_len, tp, dp):
    num_device = tp * dp
    num_shared_token = dp * seq_len / num_device
    shared_flops = moe_expert_flops(args, num_shared_token)
    shared_time = shared_flops / gpu.get_fp8_flops()
    print("Shared Expert Elapsed time(ms): %.3f" % shared_time)

    num_routed_token = seq_len * dp * args.n_activated_experts / num_device
    routed_flops = moe_expert_flops(args, num_routed_token)
    routed_time = routed_flops / gpu.get_fp8_flops()
    print("Routed Expert Elapsed time(ms): %.3f" % routed_time)

return shared_time, routed_time

def prefill_moe(args:ModelArgs, gpu_dict, seq_len, tp, dp ):
    df = pd.DataFrame(columns=['GPU','Shared Expert','Routed Expert'])
for key in gpu_dict.keys():
        print('------------ %s --------------' % key)
        s, r = moe_expert_elapse_time(args,gpu_dict[key], seq_len,tp,dp)
        df.loc[len(df)]=[key,s,r]
    print(df.set_index('GPU').to_markdown(floatfmt=".3f"))

TP=4,DP=8:

prefill_moe(args,gpu, seq_len, tp=4,dp=8)
------------ H800 --------------
Shared Expert Elapsed time(ms): 0.088
Routed Expert Elapsed time(ms): 0.701
------------ H20 --------------
Shared Expert Elapsed time(ms): 0.550
Routed Expert Elapsed time(ms): 4.400
GPU Shared Expert Routed Expert
H800 0.088 0.701
H20 0.550 4.400

TP=8,DP=4:

prefill_moe(args,gpu, seq_len, tp=8,dp=4)
------------ H800 --------------
Shared Expert Elapsed time(ms): 0.044
Routed Expert Elapsed time(ms): 0.351
------------ H20 --------------
Shared Expert Elapsed time(ms): 0.275
Routed Expert Elapsed time(ms): 2.200
GPU Shared Expert Routed Expert
H800 0.044 0.351
H20 0.275 2.200

2.4 AlltoAll 通信耗时

DeepSeek-V3 设计了 MoE Group 用于平衡 NVLINK 与 IB 带宽:一个 token 通过 gating 后最多只会分发到 4 个节点。假设 EP 负载完全均衡,跨机 RDMA 通信约为 3 * token数

Dispatch 通信量估计:

  • TP=4:每节点 2 个 DP 组,需要发送 2 * 3 * seq_len * dim
  • TP=8:每节点 1 个 DP 组,需要发送 3 * seq_len * dim

Combine 阶段 FP16,通信量翻倍。H800/H20 ScaleOut 带宽相同,按 DeepEP 约 45GB/s,考虑利用率取 80%(约 40GB/s),总带宽 40GB/s * 8 = 320GB/s

def prefill_alltoall_time(args:ModelArgs, gpu, seq_len, dispatch_node, tp):
##通信量估计
    gpu_per_node = 8
    dp = gpu_per_node/tp
    dispatch_size = (dispatch_node - 1) * dp * seq_len * args.dim /1024/1024
    combine_size = 2 * dispatch_size  #fp16
    comm_bw = gpu.get_pcie_bw() * gpu_per_node
    dispatch_time = dispatch_size / comm_bw
    combine_time = combine_size / comm_bw
return dispatch_time, combine_time

def prefill_alltoall(args:ModelArgs, gpu_dict, seq_len, dispatch_node, tp):
    df = pd.DataFrame(columns=['GPU','Dispatch','Combine'])
for key in gpu_dict.keys():
        print('------------ %s --------------' % key)
        dispatch_time, combine_time = prefill_alltoall_time(args, gpu_dict[key],seq_len, dispatch_node, tp)
        print("Dispatch Elapsed time(ms): %.3f" % dispatch_time)
        print("Combine Elapsed time(ms): %.3f" % combine_time)
        df.loc[len(df)]=[key,dispatch_time,combine_time]
    print(df.set_index('GPU').to_markdown(floatfmt=".3f"))

TP=4(每节点 2 个 DP 组):

prefill_alltoall(args,gpu,seq_len,dispatch_node=4,tp=4)
------------ H800 --------------
Dispatch Elapsed time(ms): 0.529
Combine Elapsed time(ms): 1.057
------------ H20 --------------
Dispatch Elapsed time(ms): 0.529
Combine Elapsed time(ms): 1.057
GPU Dispatch Combine
H800 0.529 1.057
H20 0.529 1.057

TP=8(每节点 1 个 DP 组):

prefill_alltoall(args,gpu,seq_len,dispatch_node=4,tp=8)
------------ H800 --------------
Dispatch Elapsed time(ms): 0.264
Combine Elapsed time(ms): 0.529
------------ H20 --------------
Dispatch Elapsed time(ms): 0.264
Combine Elapsed time(ms): 0.529
GPU Dispatch Combine
H800 0.264 0.529
H20 0.264 0.529

2.5 总耗时

累计耗时(非 Overlap):

  • 3x(MLA_tp1 + DenseMLP) + 58x(MLA_tpN + Shared Expert + Routed Expert +Dispatch + Combine)

累计耗时(完全 Overlap):

  • 3x(MLA_tp1 + DenseMLP) + 58x(MLA_tpN + Shared Expert + Routed Expert)

计算函数:

def prefill_time(args:ModelArgs, gpu, seq_len, kv_cache_rate, tp , dp):
    dispatch_node = 4
    gpu_per_node = 8
    num_device  =  tp * dp
    dense_mla,tp4_mla,tp8_mla = prefill_mla_elapse_time(args, gpu,  seq_len, kv_cache_rate)
    tp_mla = tp4_mla if tp == 4 else tp8_mla
    dense_mlp = dense_mlp_elapse_time(args, gpu, seq_len)
    shared, routed = moe_expert_elapse_time(args, gpu, seq_len, tp, dp)
    dispatch, combine = prefill_alltoall_time(args, gpu, seq_len, dispatch_node, tp)
return dense_mla, dense_mlp, tp_mla, shared, routed, dispatch, combine

def prefill_time_sum(args:ModelArgs, gpu_dict, seq_len, kv_cache_rate, tp , dp):
    df = pd.DataFrame(columns=['MLA','DenseMLP','TP_MLA','Shared Expert','Routed Expert','Dispatch','Combine','GPU'])
    df2 = pd.DataFrame(columns=['Sum(Overlap)','Sum','GPU'])
    n_sparse_layers = args.n_layers - args.n_dense_layers
    df.loc[len(df)]= [ args.n_dense_layers, args.n_dense_layers,  #MLA+ DenseMLP
                       n_sparse_layers, n_sparse_layers, n_sparse_layers, #SparseLayer MLA + MoE
                       n_sparse_layers, n_sparse_layers, 'Layers'] #Dispatch & Combine Layers
for key in gpu_dict.keys():
        t  = list(prefill_time(args, gpu_dict[key], seq_len, kv_cache_rate , tp , dp))
        t.append(key)
        df.loc[len(df)]= t
        sum_overlap = args.n_dense_layers * (t[0] + t[1]) + n_sparse_layers * ( t[2] + t[3] + t[4])
        sum_non_overlap = sum_overlap + n_sparse_layers * ( t[5] + t[6]) #alltoall
        df2.loc[len(df2)]= [ sum_overlap, sum_non_overlap, key]
    df = df.set_index('GPU').T
    df['Layers'] = df['Layers'].astype(int).astype(str)
    print(df.to_markdown(floatfmt=".3f"))
    print('-----------SUM-------------')
    df2 = df2.set_index('GPU').T
    print(df2.to_markdown(floatfmt=".3f"))

return df,df2

TP=4,DP=8(ms):

tp4_detail,tp4_sum = prefill_time_sum(args, gpu, seq_len, kv_cache_rate,tp=4 , dp=8)
Layers H800 H20
MLA 3 5.265 33.035
DenseMLP 3 3.156 19.801
TP_MLA 58 1.669 8.435
Shared Expert 58 0.088 0.550
Routed Expert 58 0.701 4.400
Dispatch 58 0.529 0.529
Combine 58 1.057 1.057

累计时间(ms):

H800 H20
Sum(Overlap) 167.802 934.839
Sum 259.803 1026.840

TP=8,DP=4(ms):

tp8_detail,tp8_sum  = prefill_time_sum(args, gpu, seq_len, kv_cache_rate,tp=8 , dp=4)
Layers H800 H20
MLA 3 5.265 33.035
DenseMLP 3 3.156 19.801
TP_MLA 58 1.011 4.306
Shared Expert 58 0.044 0.275
Routed Expert 58 0.351 2.200
Dispatch 58 0.264 0.264
Combine 58 0.529 0.529

累计时间(ms):

H800 H20
Sum(Overlap) 106.754 551.784
Sum 152.754 597.784

由于累计为 DP 组 seq_len 的推理,平均 1s 单机可处理 token 数为:

DP * seq_len * (1000ms / 计算时间) / 节点数

官方 TP=4 部署方式:

dp = 8
num_node = 4
print(tp4_sum.apply(lambda x: dp * seq_len * (1000/ x)/num_node).to_markdown(floatfmt=".1f"))
H800 H20
Sum(Overlap) 52240.1 9377.0
Sum 33741.0 8536.9

TP=8 部署方式吞吐:

H800 H20
Sum(Overlap) 41057.0 7943.3
Sum 28693.1 7332.1

可见官方选择 TP=4 的配置是吞吐更优的选择。官方数据为单机 73.7K tokens/s(含缓存命中),折算非命中需要计算的平均 token/s 为 32207;考虑峰谷效应,该值符合预期。

另一方面,H20 对 TTFT(首 token 延迟)影响较大:TP=4 已超过 1s,可采用 TP=8 降低首 token 延迟。

2.6 Overlap 分析

官方部署方案可按两个 micro-batch 做 overlap:

Prefill overlap 示意

对 prefill.json trace 标注如下(实际 trace 仍有部分未 overlap):

Prefill trace 标注

TP=4 时 Prefill 计算耗时关键项(ms)如下,可看到通信可被计算 overlap:

Layers H800 H20
TP_MLA 58 1.669 8.435
Shared Expert 58 0.088 0.550
Combine 58 1.057 1.057
- - - -
Routed Expert 58 0.701 4.400
Dispatch 58 0.529 0.529

特别地,针对 H20 还可以降低 RDMA ScaleOut 带宽,做初步估计:

h20_32 = GPU_perf( sm = 78 ,comm_sm = 10,
                 fp16_flops = 118.4, fp8_flops = 236.8,
                 mem = 96,mem_bw = 3350,
                 nvlink_bw = 400,pcie_bw = 50,
                 discount_rate = 0.85)

h20_16 = GPU_perf( sm = 78 ,comm_sm = 10,
                 fp16_flops = 118.4, fp8_flops = 236.8,
                 mem = 96,mem_bw = 3350,
                 nvlink_bw = 400,pcie_bw = 25,
                 discount_rate = 0.85)

h20_8 = GPU_perf( sm = 78 ,comm_sm = 10,
                 fp16_flops = 118.4, fp8_flops = 236.8,
                 mem = 96,mem_bw = 3350,
                 nvlink_bw = 400,pcie_bw = 12.5,
                 discount_rate = 0.85)

gpu_h20 = dict({ 'H20-3.2T': h20_32, 'H20-1.6T': h20_16 , 'H20-800G': h20_8})
tp4_detail,tp4_sum = prefill_time_sum(args, gpu_h20, seq_len, kv_cache_rate,tp=4 , dp=8)
Layers H20-3.2T H20-1.6T H20-800G
TP_MLA 58 8.435 8.435 8.435
Shared Expert 58 0.550 0.550 0.550
Dispatch 58 0.529 1.057 2.115
- - - - -
Routed Expert 58 4.400 4.400 4.400
Combine 58 1.057 2.115 4.230
H20-3.2T H20-1.6T H20-800G
Sum(Overlap) 934.839 934.839 934.839
Sum 1026.840 1118.841 1302.842
  • 注:若大量 Prefill 长度在 1000~2000 左右,仍需要 1.6Tbps~3.2Tbps RDMA。

2.7 KVCache 计算

将 token/s 折算为 KVCache 传输总量:

dp = 8
num_node = 4

tp4_detail,tp4_sum = prefill_time_sum(args, gpu, seq_len, kv_cache_rate,tp=4 , dp=8)
kvcache_fp8 = tp4_sum.apply(lambda x: dp * seq_len * (1000/ x)/num_node * (args.kv_lora_rank + args.qk_rope_head_dim)/1024/1024)
kvcache_fp16 = kvcache_fp8 *2
kvcache=kvcache_fp8.join(kvcache_fp16, lsuffix='(FP8)',rsuffix='(FP16)')
print(kvcache.to_markdown(floatfmt=".1f"))
GB/s H800(FP8) H20(FP8) H800(FP16) H20(FP16)
Sum(Overlap) 28.7 5.2 57.4 10.3
Sum 18.5 4.7 37.1 9.4
  • 注:未考虑 KVCache 命中率;考虑后带宽应体现读写两个方向。

对 H800 而言,如果 KVCache 采用 FP16 存储,已超过连接 CPU 的 400Gbps(50GB/s)网卡带宽,需要采用 GPU 直连 RDMA 的 ScaleOut 网络进行传输;而在合理编排通信算子并保证 EP 并行实现恰当的情况下,对整体影响可以做到接近可忽略。

3. Decoding 阶段

Decoding 集群采用 18 台部署:路由专家 EP144,MLA 与共享专家 DP144;32 个冗余路由专家;每卡 2 个路由专家 + 1 个共享专家。论文做法为 40 台部署 EP320,每卡 1 个专家,TP=4,DP=80。Decode 阶段不需要独立通信 SM,GPU 性能建模如下:

h800 = GPU_perf( sm = 132 ,comm_sm = 0,
                 fp16_flops = 791.6, fp8_flops = 1583.2,
                 mem = 80,mem_bw = 3350,
                 nvlink_bw = 200,pcie_bw = 50,
                 discount_rate = 0.85)

h20 = GPU_perf( sm = 78 ,comm_sm = 0,
                 fp16_flops = 118.4, fp8_flops = 236.8,
                 mem = 96,mem_bw = 3350,
                 nvlink_bw = 400,pcie_bw = 50,
                 discount_rate = 0.85)
h20_3e = GPU_perf( sm = 78 ,comm_sm = 0,
                 fp16_flops = 118.4, fp8_flops = 236.8,
                 mem = 141,mem_bw = 4800,
                 nvlink_bw = 400,pcie_bw = 50,
                 discount_rate = 0.85)

gpu_decode = dict({'H800': h800, 'H20': h20,'H20_3e': h20_3e})
gpu_decode2 = dict({'H800': h800, 'H20': h20})

3.1 EP 策略分析

线上常见 Expert 负载不均衡,需要足够冗余专家用于 EPLB 调度。设冗余专家数不低于 16,常见 EP 策略:

冗余专家 每卡专家
EP34 16 8
EP72 32 4
EP144 32 2
EP320 64 1

不同并行策略下通信/计算 token 数估计:

class MoE_EP():
def __init__(self,args:ModelArgs,ep_num, redundant_exp):
        self.ep_num = ep_num
        self.redundant_exp = redundant_exp
        self.dispatch_num = args.n_activated_experts
        self.n_routed_experts = args.n_routed_experts
        self.expert_num = (args.n_routed_experts + redundant_exp) / self.ep_num

def expert_per_gpu(self):
return self.expert_num

def total_tokens(self,bs):
return bs * self.ep_num

def comm_tokens(self, bs):
#平均每个token有self.expert_num / self.n_routed_experts概率本地处理
return bs * self.dispatch_num *(1- self.expert_num / self.n_routed_experts)

def compute_tokens(self, bs):
#总token数为bs * dispatch_num * ep_num, 平摊到每张卡/ep_num
return bs * self.dispatch_num

ep_dict = { 'EP34': MoE_EP(args, 34,16),
'EP72' :MoE_EP(args, 72,32),
'EP144' :MoE_EP(args, 144,32),
'EP320' :MoE_EP(args, 320,64)}

3.2 Memory 利用率分析

先从显存容量推导最大 BatchSize。Decoding 阶段采用 matabsorb MLA。除去 MLA 与 Expert 参数后,其它参数为:

671.16B - MLA(187.17M)* 61 - Expert(44.05M)* (256-Routed+1-Shared) * 58 = 3.13B

折算显存约 3.13 * (1000/1024)^3 = 2.91GB

BatchSize 计算如下(Decoding 长度 1210;按 EP 策略估计专家数):

def _decoding_batchsize(args:ModelArgs, gpu:GPU_perf, seq_len,decode_len,tp, expert_num, absorb=True, kvcache_fp16=False):
    mem_util_rate = 0.9#torch/activation等其它开销的折扣
    mla = 598.14 if absorb else 187.17#MLA的参数(单位M)
    expert_mem = 44.05#expert的参数(单位M)
    others_parameter = 2.91#其它参数2.91GB
    kv_cache = (seq_len+decode_len) * (args.kv_lora_rank + args.qk_rope_head_dim) *args.n_layers *tp
if kvcache_fp16 :
        kv_cache *=2
    mem = gpu.mem * mem_util_rate - others_parameter - mla * args.n_layers/tp/1024
    mem -= expert_mem *(args.n_layers - args.n_dense_layers) * expert_num /1024
return mem * 1024 * 1024 * 1024 / kv_cache

def decode_batchsize(args:ModelArgs, gpu_dict, seq_len,decode_len, tp):
    df = pd.DataFrame(columns=['GPU','EP320','EP144','EP72','EP34'])
for fp16_kvcache in range(0,2):
for key in gpu_dict.keys():
for absorb in range(0,2):
                item = key
if bool(fp16_kvcache):
                    item +='_FP16'
else:
                    item +='_FP8'
if bool(absorb):
                    item +='_Absorb'
                value = [item]
for exp_num in [2,3,5,9]:
                    bs = _decoding_batchsize(args, gpu_dict[key], seq_len,decode_len, tp,exp_num, bool(absorb),bool(fp16_kvcache))
                    value.append(bs)
                df.loc[len(df)]= value
    print(df.set_index('GPU').to_markdown(floatfmt=".0f"))
return df

decode_len = 1210
df = decode_batchsize(args,gpu_decode, seq_len,decode_len, tp=1)

(以下为原统计表,略)

结论要点:

  • 显存越大越容易放下足够 BatchSize。
  • 需要保证每卡路由专家数不超过 8。
  • H800 在 MLA 矩阵吸收模式下,需要 KVCache 用 FP8 存储才能满足 batchsize=128 需求。

3.3 MLA 耗时

Decoding 使用矩阵吸收 MLA。由于计算延迟较低,还要计入加载 KVCache 的时间。

bs_list =[32, 64, 128, 256]

def decode_mla_elapse_time(args:ModelArgs, gpu:GPU_perf, seq_len, bs, absorb=True):
    mla_flops_func = mla_matabsob_flops if absorb else mla_flops
#Decoding时计算为qlen=1, kv_cache_rate = 1
    _ , gemm_fp8_flops, attn_fp16_flops = mla_flops_func(1,seq_len,args, 1)

    gemm_fp8_time = gemm_fp8_flops / gpu.get_fp8_flops() * bs
    print("GEMM_FP8 Elapsed time(ms): %.3f" % gemm_fp8_time)
    attn_fp16_time = attn_fp16_flops / gpu.get_fp16_flops() *bs
    print("ATTN_FP16 Elapsed time(ms): %.3f" % attn_fp16_time)
    total_time = gemm_fp8_time + attn_fp16_time
    print("Total Elapsed time(ms):%.3f" % total_time)
    all_reduce_comm_size = seq_len * args.dim * 2 /1024/1024#fp16 take 2Bytes
    ar_elapsed_time = all_reduce_comm_size / gpu.get_nvlink_bw()
    print("AR Elapsed time(ms):%.3f" % ar_elapsed_time)
    tp4_time = total_time/4 + ar_elapsed_time
    print("TP4 Elapsed time(ms):%.3f" % tp4_time)
    tp8_time = total_time/8 + ar_elapsed_time
    print("TP8 Elapsed time(ms):%.3f" % tp8_time)
return total_time, tp4_time, tp8_time

def decode_kvcache_load_time(args:ModelArgs, gpu:GPU_perf, seq_len, bs):
    kv_cache = seq_len * (args.kv_lora_rank + args.qk_rope_head_dim)  * bs
    load_kv_time = kv_cache /1024/1024/1024 / gpu.get_mem_bw() *1000
return load_kv_time

def decode_mla(args:ModelArgs, gpu_dict, seq_len,absorb=True):
    df = pd.DataFrame(columns=['GPU','BatchSize','TP1','TP4','TP8','LoadKV_FP8','LoadKV_FP16'])
for key in gpu_dict.keys():
for bs in bs_list:
             tp1, tp4,tp8 = decode_mla_elapse_time(args,gpu_dict[key], seq_len, bs,absorb)
             kv = decode_kvcache_load_time(args,gpu_dict[key], seq_len, bs)
             df.loc[len(df)]= [key, bs,tp1,tp4,tp8,kv, kv*2]
             df['BatchSize'] = df['BatchSize'].astype(int).astype(str)
    print(df.set_index('GPU').to_markdown(floatfmt=".3f"))
return df

(以下为原统计表,略)

结论要点:

  • H800/H20 在 Decoding 阶段都需要使用矩阵吸收 MLA。
  • H800:在 batchsize=128 时 TP=4 与 TP1 耗时相近,batchsize=64 时甚至更快;因此 EP144 部署中不必强制 TP。
  • H20:MLA 必须使用 TP 并行;但 TP 过大增加 KVCache 开销并限制 batchsize,综合最优为 TP=4。

3.4 DenseMLP 耗时

def decode_dense_mlp(args:ModelArgs, gpu_dict):
    df = pd.DataFrame(columns=['GPU','BatchSize','DenseMLP'])
for key in gpu_dict.keys():
for bs in bs_list:
            t = dense_mlp_elapse_time(args,gpu_dict[key], bs)
            df.loc[len(df)]=[key,bs,t]
    df['BatchSize'] = df['BatchSize'].astype(int).astype(str)
    print(df.set_index('GPU').to_markdown(floatfmt=".3f"))
return df

decode_dense_mlp(args,gpu_decode)

(以下为原统计表,略)

3.5 MoE 耗时计算

简化:任何 token 都要 dispatch 8 份到其它节点;考虑 GroupGEMM 在小 batch 下难以打满,按 DeepGEMM 设折算系数 0.7。

def _moe_expert_time(args:ModelArgs,gpu:GPU_perf,bs):
    group_gemm_discount_rate = 0.7
    shared_flops = moe_expert_flops(args, bs)
    shared_time = shared_flops / gpu.get_fp8_flops() / group_gemm_discount_rate

    num_routed_token = bs * args.n_activated_experts
    routed_flops = moe_expert_flops(args, num_routed_token)
    routed_time = routed_flops / gpu.get_fp8_flops() / group_gemm_discount_rate
return shared_time, routed_time

(以下为原统计表,略)

3.5 AlltoAll 通信耗时

IBGDA 方式直接 RDMA 传输,计算时主要考虑 GPU PCIe 带宽:

def _moe_a2a(args:ModelArgs,gpu:GPU_perf,bs):
    dispatch_size = bs * args.dim * args.n_activated_experts /1024/1024
    combine_size = dispatch_size * 2#FP16
    dispatch_t = dispatch_size / gpu.get_pcie_bw()
    combine_t = combine_size / gpu.get_pcie_bw()
return dispatch_t, combine_t

(以下为原统计表,略)

3.6 总耗时

统计总耗时表函数(这里按原文逻辑,后续计算仅对 H800/H20):

from functools import reduce

def _decoding_time(args:ModelArgs, gpu:GPU_perf,seq_len):
    mla = decode_mla(args,gpu,seq_len)
    dense_mlp = decode_dense_mlp(args,gpu)
    moe = moe_expert_time(args,gpu)
    a2a = decode_a2a(args,gpu)
    dfs = [ mla, dense_mlp, moe, a2a]
    df = reduce(lambda left, right: pd.merge(left,right, on=['GPU','BatchSize'], how='left'), dfs)
    print(df.set_index('GPU').T.to_markdown(floatfmt=".3f"))
return df

dfs = _decoding_time(args,gpu_decode2,seq_len)

并基于最优 TP 策略修正并计算 TPOT:

def decoding_time(args:ModelArgs, gpu_dict,seq_len):
    df = _decoding_time(args,gpu_dict,seq_len)
def mla_tp(r):
if r['TP1'] > r['TP4']:
if r['GPU'].find('H20_3e')!=-1:
return'TP8'
else:
return'TP4'
else:
return'TP1'

def mla_tp2(r):
        tp = r['MLA_TP']
return r[tp]

#使用最佳的TP策略估计
    df['MLA_TP'] = df.apply(lambda row:  mla_tp(row),axis=1)
    df['SparseMLA'] = df.apply(lambda row:  mla_tp2(row),axis=1)

# 修正TP执行时间, 按照加载FP8的KV计算
    df['DenseMLA'] = df['TP1'] + df['LoadKV_FP8']
    df['SparseMLA'] = df['SparseMLA'] + df['LoadKV_FP8']

    df['TPOT(Overlap)'] = (df['DenseMLA'] + df['DenseMLP']) * args.n_dense_layers
    df['TPOT(Overlap)'] += (df['SparseMLA'] + df['SharedExpert'] + df['RoutedExpert']) * (args.n_layers - args.n_dense_layers)
    df['TPOT'] = df['TPOT(Overlap)'] + (df['Dispatch'] + df['Combine']) * (args.n_layers - args.n_dense_layers)
    df['GPU'] = df['GPU']+ "(" + df['MLA_TP'] +")"
    df = df[['GPU','BatchSize','DenseMLA','DenseMLP','SparseMLA','Combine','SharedExpert','RoutedExpert','Dispatch','TPOT(Overlap)','TPOT']]
    df['TPS_O'] = 1000 / df['TPOT(Overlap)']
    df['TPS'] = 1000 / df['TPOT']
    df['Total_O'] =  df['TPS_O'] * df['BatchSize'].astype(int)
    df['Total'] =  df['TPS'] * df['BatchSize'].astype(int)
    print(df.set_index('GPU').T.to_markdown(floatfmt=".3f"))
return df

dfs= decoding_time(args,gpu_decode,seq_len)

(以下为原统计表,略)

当以 TPS>20 过滤后,关键结论:

  • H800:需 BatchSize<=128 才能满足用户 TPS>20;峰值每卡约 3246 token/s,线上平均约 1850 token/s(考虑峰谷与专家负载不均衡带来的不可 overlap 延迟)。
  • H20:需 BatchSize<=32 才能满足 TPS>20;此时约 900 token/s。
  • H20_3e:显存更大,可在 BatchSize<=64 时配合 TP=8 维持 TPS>20,吞吐约 1306 token/s。

3.7 Overlap 分析

官方 trace 未给出 decode 细节,仅给出 overlap 示意图:

Decode overlap 示意

从汇总表可见 Combine 显著小于 Attention,因此官方对 Attention 拆分以 overlap。按如下方式评估 TimeBudget(原文代码保持不变):

dfo=dfs[dfs['TPS_O']>20]
dfo['TimeBudget'] =  dfo['SparseMLA'] + dfo['SharedExpert'] - (dfo['Dispatch']+dfo['Combine'])
print(dfo[['GPU','BatchSize','TimeBudget']].set_index('GPU').to_markdown(floatfmt=".3f"))

结论:H800 通信余量非常小,因此需要 IBGDA;H20 仍存在一定通信余量,但 800G 实例无法满足需求,1.6T 仍有较大余量;而 H20_3e 为保证更大 batchsize 下性能仍需 3.2T 网络,涉及成本权衡。另外,从 timebudget 角度看,网卡等静态延迟影响可忽略。

4. 小结

本文从计算量、显存带宽、网络带宽约束出发,逆向分析了 DeepSeek-R1 在 H800 与 H20 上的推理性能。结论是:

  • H800:最佳部署基本对应官方 EP144 方案,分析数据与官方数据一致性较高;并且从 overlap 时间预算看,必须使用 IBGDA 以降低延迟。
  • H20:算力约束导致 MLA 计算偏慢,必须用 TP 并行加速;但 TP 过大会因 KVCache 占用使 batchsize 受限。此时 H20-3E(141GB)体现出额外性能收益;对互联带宽评估显示,在 EP 并行实现合理时,1.6Tbps 带宽即可满足需求。

更多相关讨论可到 云栈社区 继续交流。

参考资料




上一篇:掌握前端性能优化面试:从浏览器原理到实战策略
下一篇:ACI Fabric控制平面协议深度解析:ISIS、COOP与MP-BGP架构详解
您需要登录后才可以回帖 登录 | 立即注册

手机版|小黑屋|网站地图|云栈社区 ( 苏ICP备2022046150号-2 )

GMT+8, 2026-1-24 01:38 , Processed in 0.395603 second(s), 40 queries , Gzip On.

Powered by Discuz! X3.5

© 2025-2026 云栈社区.

快速回复 返回顶部 返回列表