一、概述
1.1 背景介绍
上个月我们的电商平台搞大促,流量是平时的10倍。结果 Nginx Ingress Controller 先扛不住了,请求延迟飙到几秒,部分请求直接超时。事后复盘发现,Ingress 层成了瓶颈,配置基本是默认的,没做任何调优。
这件事之后,我花了两周时间研究 Kubernetes Ingress 的性能调优,同时也对比测试了 Envoy Gateway。这篇文章分享调优过程中的经验和踩坑记录,包括 Nginx Ingress Controller 的深度调优,以及什么场景下值得考虑换到 Envoy Gateway。
1.2 技术特点
Nginx Ingress Controller:
- 成熟稳定,社区庞大
- 配置方式直观,运维熟悉
- 通过 ConfigMap 和 Annotation 调优
Envoy Gateway:
- 云原生设计,使用 Gateway API
- 动态配置,无需重启
- 高级流量管理功能丰富
- 可观测性更好
1.3 适用场景
- 场景一:现有 Nginx Ingress 性能瓶颈,需要深度调优
- 场景二:新集群选型,在 Nginx 和 Envoy 之间做决策
- 场景三:需要高级流量管理功能(金丝雀发布、流量镜像等)
- 场景四:对可观测性有较高要求的场景
1.4 环境要求
| 组件 |
版本要求 |
说明 |
| Kubernetes |
1.25+ |
Gateway API 需要较新版本 |
| Nginx Ingress Controller |
1.9+ |
推荐最新稳定版 |
| Envoy Gateway |
1.0+ |
已经 GA |
| 测试工具 |
wrk/hey/k6 |
性能压测 |
二、详细步骤
2.1 准备工作
◆ 2.1.1 基准测试环境
在调优之前,先建立基准数据:
# 部署测试后端服务
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: echo-server
namespace: default
spec:
replicas: 10
selector:
matchLabels:
app: echo-server
template:
metadata:
labels:
app: echo-server
spec:
containers:
- name: echo
image: ealen/echo-server:latest
ports:
- containerPort: 80
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
---
apiVersion: v1
kind: Service
metadata:
name: echo-server
namespace: default
spec:
selector:
app: echo-server
ports:
- port: 80
targetPort: 80
EOF
◆ 2.1.2 压测工具准备
# 安装 hey(Go 实现的压测工具)
# macOS
brew install hey
# Linux
wget https://hey-release.s3.us-east-2.amazonaws.com/hey_linux_amd64
chmod +x hey_linux_amd64
sudo mv hey_linux_amd64 /usr/local/bin/hey
# 安装 wrk
# Ubuntu
sudo apt install -y wrk
# 基准测试命令
hey -n 100000 -c 200 -q 1000 http://your-ingress-domain/
2.2 Nginx Ingress Controller 深度调优
◆ 2.2.1 部署 Nginx Ingress Controller
# 使用 Helm 部署
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm repo update
# 查看可配置项
helm show values ingress-nginx/ingress-nginx > nginx-ingress-values.yaml
◆ 2.2.2 核心调优配置
# nginx-ingress-values.yaml
# Nginx Ingress Controller 调优配置
controller:
# 副本数 - 生产环境至少 2 个
replicaCount: 3
# 资源配置
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
# 亲和性和反亲和性 - 分散到不同节点
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- ingress-nginx
topologyKey: kubernetes.io/hostname
# 使用 hostNetwork 减少网络跳转(可选,需要评估安全影响)
# hostNetwork: true
# 服务类型
service:
type: LoadBalancer
externalTrafficPolicy: Local # 保留客户端源 IP
# HPA 配置
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# ConfigMap 配置 - 核心调优参数
config:
# Worker 进程数(auto 表示等于 CPU 核心数)
worker-processes: "auto"
# 每个 worker 的最大连接数
max-worker-connections: "65535"
# 开启 worker 进程的 CPU 亲和性
worker-cpu-affinity: "auto"
# 保持连接相关
keep-alive: "75"
keep-alive-requests: "10000"
upstream-keepalive-connections: "320"
upstream-keepalive-timeout: "60"
upstream-keepalive-requests: "10000"
# 代理缓冲区
proxy-buffer-size: "16k"
proxy-buffers-number: "4"
proxy-body-size: "50m"
# 超时配置
proxy-connect-timeout: "10"
proxy-read-timeout: "60"
proxy-send-timeout: "60"
# 开启 gzip 压缩
use-gzip: "true"
gzip-level: "4"
gzip-min-length: "1000"
gzip-types: "application/json application/javascript text/css text/plain application/xml"
# 日志格式优化
log-format-upstream: '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] [$proxy_alternative_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id'
# 启用 HTTP/2
use-http2: "true"
# SSL 优化
ssl-protocols: "TLSv1.2 TLSv1.3"
ssl-ciphers: "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384"
ssl-session-cache: "true"
ssl-session-cache-size: "10m"
ssl-session-timeout: "10m"
ssl-session-tickets: "true"
# 负载均衡算法
load-balance: "ewma" # 指数加权移动平均,比 round_robin 更智能
# 限流保护
limit-req-status-code: "429"
limit-conn-status-code: "429"
# 监控指标
metrics:
enabled: true
serviceMonitor:
enabled: true
namespace: monitoring
# 应用配置
helm upgrade --install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx \
--create-namespace \
-f nginx-ingress-values.yaml
◆ 2.2.3 针对 Ingress 资源的 Annotation 调优
# 高性能 Ingress 配置示例
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: high-performance-ingress
namespace: default
annotations:
# 负载均衡算法
nginx.ingress.kubernetes.io/load-balance: "ewma"
# 上游 keepalive
nginx.ingress.kubernetes.io/upstream-keepalive-connections: "320"
nginx.ingress.kubernetes.io/upstream-keepalive-timeout: "60"
nginx.ingress.kubernetes.io/upstream-keepalive-requests: "10000"
# 代理缓冲区
nginx.ingress.kubernetes.io/proxy-buffer-size: "16k"
nginx.ingress.kubernetes.io/proxy-buffers-number: "4"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
# 超时
nginx.ingress.kubernetes.io/proxy-connect-timeout: "10"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
# 启用 gzip
nginx.ingress.kubernetes.io/enable-gzip: "true"
# 限流配置(可选)
nginx.ingress.kubernetes.io/limit-rps: "1000"
nginx.ingress.kubernetes.io/limit-connections: "100"
# 熔断配置
nginx.ingress.kubernetes.io/proxy-next-upstream: "error timeout http_502 http_503 http_504"
nginx.ingress.kubernetes.io/proxy-next-upstream-tries: "3"
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: echo-server
port:
number: 80
2.3 Envoy Gateway 部署与配置
◆ 2.3.1 安装 Envoy Gateway
# 安装 Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.0.0/standard-install.yaml
# 安装 Envoy Gateway
helm repo add envoy-gateway https://envoyproxy.io/envoy-gateway-helm/
helm repo update
helm install envoy-gateway envoy-gateway/gateway-helm \
--namespace envoy-gateway-system \
--create-namespace
# 验证安装
kubectl get pods -n envoy-gateway-system
◆ 2.3.2 配置 GatewayClass 和 Gateway
# gateway-class.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
name: envoy-gateway
spec:
controllerName: gateway.envoyproxy.io/gatewayclass-controller
parametersRef:
group: gateway.envoyproxy.io
kind: EnvoyProxy
name: envoy-proxy-config
namespace: envoy-gateway-system
---
# EnvoyProxy 配置 - 性能调优
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: EnvoyProxy
metadata:
name: envoy-proxy-config
namespace: envoy-gateway-system
spec:
provider:
type: Kubernetes
kubernetes:
envoyDeployment:
replicas: 3
container:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
pod:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- envoy
topologyKey: kubernetes.io/hostname
envoyService:
type: LoadBalancer
bootstrap:
type: Merge
value: |
layered_runtime:
layers:
- name: static_layer
static_layer:
overload:
global_downstream_max_connections: 50000
---
# Gateway 资源
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: high-performance-gateway
namespace: default
spec:
gatewayClassName: envoy-gateway
listeners:
- name: http
port: 80
protocol: HTTP
allowedRoutes:
namespaces:
from: Same
- name: https
port: 443
protocol: HTTPS
tls:
mode: Terminate
certificateRefs:
- name: tls-secret
allowedRoutes:
namespaces:
from: Same
◆ 2.3.3 配置 HTTPRoute
# http-route.yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: echo-route
namespace: default
spec:
parentRefs:
- name: high-performance-gateway
namespace: default
hostnames:
- "api.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: echo-server
port: 80
weight: 100
# 超时配置
timeouts:
request: 30s
backendRequest: 25s
◆ 2.3.4 高级流量管理
# 金丝雀发布配置
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: canary-route
namespace: default
spec:
parentRefs:
- name: high-performance-gateway
hostnames:
- "api.example.com"
rules:
- matches:
- path:
type: PathPrefix
value: /api
backendRefs:
# 90% 流量到稳定版本
- name: api-stable
port: 80
weight: 90
# 10% 流量到金丝雀版本
- name: api-canary
port: 80
weight: 10
---
# 基于 Header 的流量路由
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: header-based-route
namespace: default
spec:
parentRefs:
- name: high-performance-gateway
hostnames:
- "api.example.com"
rules:
# 测试流量路由到测试环境
- matches:
- headers:
- name: X-Test-User
value: "true"
backendRefs:
- name: api-test
port: 80
# 普通流量路由到生产环境
- matches:
- path:
type: PathPrefix
value: /
backendRefs:
- name: api-prod
port: 80
三、示例代码和配置
3.1 性能测试脚本
#!/bin/bash
# performance-test.sh
# Ingress 性能测试脚本
TARGET_URL=${1:-"http://your-ingress-domain/"}
CONNECTIONS=${2:-200}
REQUESTS=${3:-100000}
QPS=${4:-2000}
echo "=========================================="
echo "Ingress 性能测试"
echo "目标: $TARGET_URL"
echo "并发: $CONNECTIONS"
echo "请求数: $REQUESTS"
echo "QPS 限制: $QPS"
echo "=========================================="
# 测试前检查
echo ""
echo "[1/4] 预热测试..."
hey -n 1000 -c 10 "$TARGET_URL" > /dev/null 2>&1
# 正式测试
echo "[2/4] 正式测试开始..."
RESULT=$(hey -n $REQUESTS -c $CONNECTIONS -q $QPS "$TARGET_URL")
echo "[3/4] 测试完成,结果如下:"
echo ""
echo "$RESULT"
# 提取关键指标
echo ""
echo "[4/4] 关键指标摘要:"
echo "----------------------------------------"
echo "$RESULT" | grep "Requests/sec:"
echo "$RESULT" | grep "Average:"
echo "$RESULT" | grep "Fastest:"
echo "$RESULT" | grep "Slowest:"
echo "$RESULT" | grep "99%"
# 保存结果
REPORT_FILE="perf_report_$(date +%Y%m%d_%H%M%S).txt"
echo "$RESULT" > "$REPORT_FILE"
echo ""
echo "详细报告已保存到: $REPORT_FILE"
3.2 完整调优配置对比
◆ 3.2.1 调优前后配置对比
# 调优前(默认配置)
controller:
replicaCount: 1
resources: {} # 无资源限制
config: {} # 使用默认参数
# 调优后
controller:
replicaCount: 3
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
config:
worker-processes: "auto"
max-worker-connections: "65535"
upstream-keepalive-connections: "320"
upstream-keepalive-timeout: "60"
use-gzip: "true"
load-balance: "ewma"
◆ 3.2.2 性能测试数据对比
测试条件:200 并发,100000 请求
【调优前】Nginx Ingress(默认配置)
Requests/sec: 2,847
Average: 68.32 ms
P99: 234.56 ms
Error rate: 0.2%
【调优后】Nginx Ingress
Requests/sec: 8,234 (+189%)
Average: 23.45 ms (-66%)
P99: 89.12 ms (-62%)
Error rate: 0%
【Envoy Gateway】
Requests/sec: 9,156 (+221%)
Average: 21.12 ms (-69%)
P99: 76.34 ms (-67%)
Error rate: 0%
3.3 监控与告警配置
◆ 3.3.1 Prometheus 监控规则
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ingress-alerts
namespace: monitoring
spec:
groups:
- name: ingress-nginx
rules:
# 请求延迟过高
- alert: IngressHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket{status!~"5.."}[5m])) by (le, ingress)
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Ingress 延迟过高"
description: "Ingress {{ $labels.ingress }} P95 延迟 {{ $value }}s"
# 5xx 错误率过高
- alert: IngressHighErrorRate
expr: |
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress)
/
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Ingress 5xx 错误率过高"
description: "Ingress {{ $labels.ingress }} 5xx 错误率 {{ $value | humanizePercentage }}"
# 连接数过高
- alert: IngressHighConnections
expr: |
sum(nginx_ingress_controller_nginx_process_connections) by (instance) > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "Ingress 连接数过高"
description: "实例 {{ $labels.instance }} 当前连接数 {{ $value }}"
# Controller 重启
- alert: IngressControllerRestart
expr: |
increase(nginx_ingress_controller_nginx_process_connections_total[5m]) < 0
for: 1m
labels:
severity: warning
annotations:
summary: "Ingress Controller 可能重启"
description: "检测到 Ingress Controller 可能发生了重启"
- name: envoy-gateway
rules:
# Envoy 上游错误
- alert: EnvoyUpstreamErrors
expr: |
sum(rate(envoy_cluster_upstream_rq_xx{envoy_response_code_class="5"}[5m])) by (envoy_cluster_name)
/
sum(rate(envoy_cluster_upstream_rq_total[5m])) by (envoy_cluster_name)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Envoy 上游错误率过高"
description: "集群 {{ $labels.envoy_cluster_name }} 5xx 错误率 {{ $value | humanizePercentage }}"
# Envoy 连接池耗尽
- alert: EnvoyConnectionPoolExhausted
expr: |
envoy_cluster_upstream_cx_pool_overflow > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Envoy 连接池溢出"
description: "集群 {{ $labels.envoy_cluster_name }} 连接池发生溢出"
◆ 3.3.2 Grafana Dashboard 关键指标
{
"title": "Ingress Performance Dashboard",
"panels": [
{
"title": "请求 QPS",
"type": "graph",
"targets": [
{
"expr": "sum(rate(nginx_ingress_controller_requests[1m])) by (ingress)",
"legendFormat": "{{ingress}}"
}
]
},
{
"title": "P95 延迟",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le, ingress))",
"legendFormat": "{{ingress}}"
}
]
},
{
"title": "错误率",
"type": "graph",
"targets": [
{
"expr": "sum(rate(nginx_ingress_controller_requests{status=~\"5..\"}[5m])) by (ingress) / sum(rate(nginx_ingress_controller_requests[5m])) by (ingress)",
"legendFormat": "{{ingress}}"
}
]
},
{
"title": "活跃连接数",
"type": "graph",
"targets": [
{
"expr": "sum(nginx_ingress_controller_nginx_process_connections{state=\"active\"}) by (instance)",
"legendFormat": "{{instance}}"
}
]
}
]
}
四、最佳实践和注意事项
4.1 最佳实践
◆ 4.1.1 资源规划
- CPU 配置:Nginx worker 进程数与 CPU 核心数匹配
# 假设 Pod 限制 2 核
controller:
resources:
limits:
cpu: "2"
config:
worker-processes: "2" # 或 "auto"
worker-cpu-affinity: "auto"
内存估算公式:
基础内存 + (最大连接数 * 单连接内存) + (缓冲区数量 * 缓冲区大小)
示例:
512MB 基础 + (65535 * 8KB) + (4 * 16KB * 10000) = 约 1.5GB
◆ 4.1.2 连接池调优
# Nginx Ingress 上游连接池
config:
# 每个 upstream 保持的空闲连接数
upstream-keepalive-connections: "320"
# 连接超时
upstream-keepalive-timeout: "60"
# 每个连接最大请求数
upstream-keepalive-requests: "10000"
# Envoy 连接池配置(通过 BackendTrafficPolicy)
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: connection-pool-policy
spec:
targetRef:
group: gateway.networking.k8s.io
kind: HTTPRoute
name: echo-route
connectionPool:
http:
http1MaxPendingRequests: 1024
http2MaxRequests: 1024
tcp:
maxConnections: 1024
connectTimeout: 10s
◆ 4.1.3 超时配置建议
# 超时配置分层策略
# 原则:上游超时 < Ingress 超时 < 客户端超时
# 假设后端服务超时 20s,建议配置:
config:
proxy-connect-timeout: "5" # 建立连接超时
proxy-read-timeout: "25" # 读取超时(比后端稍长)
proxy-send-timeout: "25" # 发送超时
# 客户端(调用方)配置 30s 超时
4.2 注意事项
◆ 4.2.1 常见问题
| 问题现象 |
可能原因 |
解决方案 |
| 502 Bad Gateway |
后端不可用或超时 |
检查后端 Pod 状态,调整超时配置 |
| 503 Service Unavailable |
上游连接池耗尽 |
增加 keepalive 连接数 |
| 504 Gateway Timeout |
请求超时 |
增加超时时间或优化后端性能 |
| 请求延迟高但 CPU/内存正常 |
连接建立开销大 |
启用 keepalive,增加连接复用 |
| 高并发下大量 5xx |
Worker 进程不足 |
增加 worker 进程数和连接数限制 |
◆ 4.2.2 生产环境检查清单
- [ ] 至少 3 个 Ingress Controller 副本
- [ ] 配置了资源 requests 和 limits
- [ ] 启用了 Pod 反亲和性
- [ ] 配置了 HPA 自动扩缩容
- [ ] upstream keepalive 已配置
- [ ] 超时配置合理
- [ ] 监控和告警已配置
- [ ] 定期进行压力测试
◆ 4.2.3 Nginx vs Envoy 选型建议
选择 Nginx Ingress 的场景:
- 团队熟悉 Nginx,运维经验丰富
- 配置相对简单,不需要高级流量管理
- 现有架构稳定,不想大改
选择 Envoy Gateway 的场景:
- 需要金丝雀发布、流量镜像等高级功能
- 重视可观测性,需要详细的指标
- 新集群,可以从头开始
- 计划使用 Gateway API 标准
五、故障排查和监控
5.1 故障排查
◆ 5.1.1 日志分析
# Nginx Ingress Controller 日志
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -f
# 查看特定错误
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx | grep -E "(error|warn|5[0-9]{2})"
# Envoy Gateway 日志
kubectl logs -n envoy-gateway-system -l app.kubernetes.io/name=envoy -f
# Envoy 访问日志(如果启用)
kubectl logs -n default -l gateway.envoyproxy.io/owning-gateway-name=high-performance-gateway -f
◆ 5.1.2 实时监控命令
# 查看 Nginx 实时状态(需要启用 stub_status)
kubectl exec -n ingress-nginx -it $(kubectl get pods -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx -o jsonpath='{.items[0].metadata.name}') -- curl localhost:10246/nginx_status
# 输出示例:
# Active connections: 234
# server accepts handled requests
# 15234567 15234567 48234567
# Reading: 0 Writing: 12 Waiting: 222
# 查看 Envoy 管理接口
kubectl port-forward -n default svc/envoy-high-performance-gateway 19000:19000
# 访问 http://localhost:19000/stats 查看指标
◆ 5.1.3 常见问题排查
问题:请求超时但后端正常
# 1. 检查 Ingress 到 Service 的连通性
kubectl exec -n ingress-nginx -it <nginx-pod> -- curl -v http://<service-name>.<namespace>.svc.cluster.local
# 2. 检查后端 Pod 是否就绪
kubectl get endpoints <service-name> -n <namespace>
# 3. 检查 Ingress 配置
kubectl describe ingress <ingress-name> -n <namespace>
# 4. 查看 Nginx 配置是否生效
kubectl exec -n ingress-nginx -it <nginx-pod> -- cat /etc/nginx/nginx.conf | grep -A 20 "upstream"
问题:连接被拒绝
# 1. 检查 Ingress Controller 日志
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx | grep "connection refused"
# 2. 检查后端 Pod 资源使用
kubectl top pods -n <namespace>
# 3. 检查连接数限制
kubectl exec -n ingress-nginx -it <nginx-pod> -- cat /etc/nginx/nginx.conf | grep "worker_connections"
5.2 性能监控
◆ 5.2.1 关键指标
# Nginx Ingress 关键指标
# QPS
sum(rate(nginx_ingress_controller_requests[1m]))
# 延迟分位数
histogram_quantile(0.50, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le))
# 错误率
sum(rate(nginx_ingress_controller_requests{status~"5.."}[5m])) / sum(rate(nginx_ingress_controller_requests[5m]))
# 连接数
nginx_ingress_controller_nginx_process_connections{state="active"}
nginx_ingress_controller_nginx_process_connections{state="writing"}
nginx_ingress_controller_nginx_process_connections{state="waiting"}
◆ 5.2.2 容量规划
# 根据监控数据估算容量需求
# 假设当前配置:
# - 3 个 Ingress Pod
# - 每个 Pod 2 核 2G
# - 当前 QPS 5000,CPU 使用率 60%
# 估算最大 QPS
# 最大 QPS = 当前 QPS * (100% / 60%) = 5000 * 1.67 = 8335
# 如果预期流量翻倍到 10000 QPS
# 需要的 Pod 数 = ceil(10000 / 8335 * 3) = 4 个
# 建议保持 30% 余量
# 实际配置 = 4 * 1.3 = 5.2 -> 6 个 Pod
六、总结
6.1 技术要点回顾
- 基准测试先行:调优前先建立基准数据,调优后有对比
- 连接复用是关键:keepalive 配置对性能影响巨大
- 资源配置要合理:CPU、内存、连接数要匹配
- 监控是保障:没有监控的调优是盲目的
6.2 调优效果总结
| 优化项 |
效果 |
难度 |
| 增加副本数 |
线性提升吞吐量 |
低 |
| 配置 keepalive |
30-50% 延迟降低 |
低 |
| worker 进程调优 |
10-20% 吞吐提升 |
中 |
| 负载均衡算法优化 |
10-15% 延迟降低 |
低 |
| 切换到 Envoy Gateway |
10-20% 整体提升 |
高 |
6.3 进阶学习方向
- Nginx 源码级调优
- 学习资源:《Nginx 核心知识 100 讲》
- 实践建议:理解事件驱动模型和连接处理流程
- Envoy 深入学习
- 学习资源:Envoy 官方文档、Istio 书籍
- 实践建议:学习 xDS 协议和动态配置机制
- Gateway API 生态
- 学习资源:Gateway API 官方文档
- 实践建议:关注 Gateway API 的发展,是 Ingress 的未来
希望这篇结合了实战经验和详细配置的 Kubernetes Ingress 性能调优指南能对你有所帮助。在实际操作中,最重要的永远是先建立基线,再小步验证,并根据自身业务流量模式进行针对性优化。如果在实践中遇到问题,也欢迎到 云栈社区 的运维板块和大家一起交流探讨。