分布式系统的可观测性已成现代架构标配,而全链路追踪恰恰是其中最复杂、也最容易失控的一环。OpenTelemetry(简称 OTel)作为 CNCF 毕业项目,统一了 OpenTracing 和 OpenCensus,一跃成为事实上的可观测性标准。但“标准”不等于“好用”。OTel 的设计哲学是灵活、可扩展、厂商中立,付出的代价便是配置繁复、默认值激进、文档分散。不少团队照着官方 Quick Start 一通操作直接上线,结果被各类性能问题打了个措手不及。在云栈社区的技术讨论中,大家常常感慨:全链路追踪没用好,反倒先把系统拖垮了。
一、概述
1.1 背景介绍
OTel 的核心组件包括:
- SDK:嵌入应用的埋点库,负责生成 Span
- Collector:独立部署的数据管道,负责接收、处理、导出数据
- Exporter:将数据发送到后端存储(如 Jaeger、Zipkin、Tempo)
- Propagator:在服务间传递 Trace Context
每一层都可自定义,这意味着每一层都可能出问题。
1.2 适用场景
全链路追踪适合:
- 微服务架构,服务间调用关系复杂
- 定位跨服务的性能瓶颈
- 排查分布式事务的一致性问题
- 分析请求在各环节的耗时分布
不适合:
- 单体应用,直接用 APM 更省心
- 对延迟极度敏感的高频交易系统
- 预算有限、运维人力不足的小团队
1.3 环境要求
本文实践基于以下环境:
| 组件 |
版本 |
说明 |
| OpenTelemetry SDK |
1.32.0 |
Java/Go/Python 均测试通过 |
| OpenTelemetry Collector |
0.96.0 |
使用 contrib 版本 |
| Jaeger |
1.54.0 |
后端存储用 Elasticsearch 8.x |
| Kubernetes |
1.28 |
部署环境 |
二、详细步骤:从入门到踩坑
2.1 准备工作:别急于写代码
采样率设多少?
很多人一上来就是 100% 采样,觉得数据越全越好。但生产环境日均请求量达 5 亿时,100% 采样每天会产生几十 TB 的 Trace 数据,存储成本直接爆炸。
经验值:
- 开发/测试环境:100%
- 预发环境:10%–50%
- 生产环境:0.1%–1%,并配合动态采样
Collector 怎么部署?
三种模式各有优劣:
# 模式一:Sidecar(每个 Pod 一个 Collector)
# 优点:隔离性好,故障不扩散
# 缺点:资源占用大,N 个 Pod 就要 N 份内存
# 模式二:DaemonSet(每个 Node 一个 Collector)
# 优点:资源利用率高
# 缺点:单节点 Collector 挂了影响整个节点
# 模式三:Deployment(独立的 Collector 集群)
# 优点:统一管理,便于扩缩容
# 缺点:增加一跳网络延迟
我们最终采用 DaemonSet + Deployment 混合模式:DaemonSet 负责接收和初步处理,Deployment 负责聚合和导出。
后端存储选什么?
| 后端 |
优点 |
缺点 |
适用场景 |
| Jaeger + ES |
功能全面,查询灵活 |
ES 运维成本高 |
大规模生产环境 |
| Jaeger + Cassandra |
写入性能好 |
查询能力弱 |
超大规模写入 |
| Tempo + S3 |
成本低,免运维 |
查询需要 TraceID |
成本敏感场景 |
| Zipkin |
简单易用 |
功能相对基础 |
小规模试点 |
2.2 核心配置:魔鬼在细节里
2.2.1 SDK 配置
以 Java 为例,Spring Boot 项目最常用的配置方式:
# application.yml
otel:
service:
name: order-service
traces:
exporter: otlp
exporter:
otlp:
endpoint: http://otel-collector:4317
# 别用 4318,那是 HTTP 端口,gRPC 用 4317
protocol: grpc
resource:
attributes:
deployment.environment: production
service.version: 1.2.3
坑点一:endpoint 配错协议
4317 是 gRPC 端口,4318 是 HTTP 端口。如果用 gRPC 协议却连了 4318,不会报错,但数据根本发不出去。我们排查了整整两天才定位到这个问题。
坑点二:resource attributes 没设全
service.name 是必填项,但光有它远远不够。建议至少加上:
deployment.environment:区分环境
service.version:版本号
service.instance.id:实例标识
k8s.pod.name:K8s 环境下的 Pod 名
这些信息在排查问题时极为关键,否则你只知道“订单服务出了问题”,却不清楚是哪个实例、哪个版本。
2.2.2 Collector 配置
Collector 的配置才是最容易翻车的地方。一个完整的生产级配置如下:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
max_recv_msg_size_mib: 16
# 默认 4MB 太小,大 Span 会被截断
http:
endpoint: 0.0.0.0:4318
processors:
# 批处理:减少网络请求次数
batch:
timeout: 5s
send_batch_size: 8192
send_batch_max_size: 16384
# 别设太大,内存会爆
# 内存限制:防止 OOM
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
# 超过限制会丢数据,但总比 OOM 强
# 采样:控制数据量
probabilistic_sampler:
sampling_percentage: 1
# 1% 采样率
# 属性处理:脱敏和过滤
attributes:
actions:
- key: http.request.header.authorization
action: delete
# 删除敏感头信息
- key: db.statement
action: hash
# 对 SQL 语句做哈希,防止泄露数据
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: false
cert_file: /etc/ssl/certs/collector.crt
key_file: /etc/ssl/private/collector.key
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
sending_queue:
enabled: true
num_consumers: 10
queue_size: 10000
# 队列满了会丢数据,设大一点
# Debug 用,生产环境关掉
# logging:
# loglevel: debug
extensions:
health_check:
endpoint: 0.0.0.0:13133
zpages:
endpoint: 0.0.0.0:55679
# zpages 是调试神器,能看到内部状态
service:
extensions: [health_check, zpages]
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, probabilistic_sampler, attributes]
exporters: [otlp/jaeger]
# processors 顺序很重要!memory_limiter 必须在最前面
坑点三:processors 顺序错误
memory_limiter 必须放在第一个!如果放在 batch 后面,batch processor 已经将数据攒起来了,memory_limiter 就来不及限制了。
坑点四:batch 参数设置不当
send_batch_size 设太小(比如默认的 8192),网络请求太频繁;设太大(比如 100000),内存占用高,而且一批数据出问题就全丢了。我们的经验是:根据单个 Span 的平均大小来算。假设平均每个 Span 1KB,send_batch_size 设为 8192 就是约 8MB 一批,这个量级比较合适。
2.2.3 Kubernetes 部署配置
DaemonSet 部署示例:
# otel-collector-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: otel-collector
namespace: monitoring
spec:
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.96.0
# 用 contrib 版本,包含更多组件
args:
- --config=/conf/otel-collector-config.yaml
ports:
- containerPort: 4317
hostPort: 4317
protocol: TCP
- containerPort: 4318
hostPort: 4318
protocol: TCP
- containerPort: 13133
protocol: TCP
resources:
limits:
cpu: "2"
memory: 4Gi
requests:
cpu: "500m"
memory: 1Gi
# 内存给够!Collector 很吃内存
livenessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 10
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 13133
initialDelaySeconds: 5
periodSeconds: 5
volumeMounts:
- name: config
mountPath: /conf
volumes:
- name: config
configMap:
name: otel-collector-config
tolerations:
- operator: Exists
# 让 Collector 能调度到所有节点
坑点五:资源限制设太小
Collector 的内存需求被严重低估了。官方示例给的 512Mi 在生产环境根本不够用。我们的经验是:按每秒处理的 Span 数量来估算,每 1000 Span/s 大约需要 200MB 内存。如果你每秒处理 10000 个 Span,至少给 2GB。
2.3 启动验证
部署完成后,用这个 checklist 验证:
# 1. 检查 Collector 是否正常
kubectl get pods -n monitoring -l app=otel-collector
kubectl logs -n monitoring -l app=otel-collector --tail=100
# 2. 检查健康检查端点
curl http://otel-collector:13133/
# 3. 查看 zpages,确认数据在流动
# 浏览器访问 http://otel-collector:55679/debug/tracez
# 4. 检查 Jaeger 是否收到数据
curl "http://jaeger-query:16686/api/services"
# 5. 发一个测试请求,看能不能在 Jaeger 查到
curl -H "traceparent: 00-12345678901234567890123456789012-1234567890123456-01" \
http://your-service/api/test
# 然后去 Jaeger 搜索 TraceID: 12345678901234567890123456789012
三、示例代码和配置
3.1 完整配置示例:Java Spring Boot
<!-- pom.xml -->
<dependencyManagement>
<dependencies>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-bom</artifactId>
<version>1.32.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<!-- Auto instrumentation agent is recommended over manual SDK -->
<dependency>
<groupId>io.opentelemetry.instrumentation</groupId>
<artifactId>opentelemetry-instrumentation-bom</artifactId>
<version>1.32.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
// CustomSpanProcessor.java
// Handle sensitive data and add custom attributes
package com.example.tracing;
import io.opentelemetry.context.Context;
import io.opentelemetry.sdk.trace.ReadWriteSpan;
import io.opentelemetry.sdk.trace.ReadableSpan;
import io.opentelemetry.sdk.trace.SpanProcessor;
public class CustomSpanProcessor implements SpanProcessor {
@Override
public void onStart(Context parentContext, ReadWriteSpan span) {
// Add custom attributes at span creation
span.setAttribute("custom.thread.name", Thread.currentThread().getName());
span.setAttribute("custom.timestamp", System.currentTimeMillis());
}
@Override
public boolean isStartRequired() {
return true;
}
@Override
public void onEnd(ReadableSpan span) {
// Log slow spans for debugging
long durationMs = span.getLatencyNanos() / 1_000_000;
if (durationMs > 1000) {
System.err.println("Slow span detected: " + span.getName()
+ ", duration: " + durationMs + "ms"
+ ", traceId: " + span.getSpanContext().getTraceId());
}
}
@Override
public boolean isEndRequired() {
return true;
}
}
// TracingConfig.java
// Production-ready OTel configuration
package com.example.tracing;
import io.opentelemetry.api.OpenTelemetry;
import io.opentelemetry.api.trace.Tracer;
import io.opentelemetry.exporter.otlp.trace.OtlpGrpcSpanExporter;
import io.opentelemetry.sdk.OpenTelemetrySdk;
import io.opentelemetry.sdk.resources.Resource;
import io.opentelemetry.sdk.trace.SdkTracerProvider;
import io.opentelemetry.sdk.trace.export.BatchSpanProcessor;
import io.opentelemetry.sdk.trace.samplers.Sampler;
import io.opentelemetry.semconv.resource.attributes.ResourceAttributes;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.time.Duration;
import java.util.concurrent.TimeUnit;
@Configuration
public class TracingConfig {
@Value("${otel.service.name:unknown-service}")
private String serviceName;
@Value("${otel.exporter.otlp.endpoint:http://localhost:4317}")
private String otlpEndpoint;
@Value("${otel.traces.sampler.probability:0.01}")
private double samplerProbability;
@Bean
public OpenTelemetry openTelemetry() {
// Build resource with service info
Resource resource = Resource.getDefault()
.merge(Resource.builder()
.put(ResourceAttributes.SERVICE_NAME, serviceName)
.put(ResourceAttributes.SERVICE_VERSION, "1.0.0")
.put(ResourceAttributes.DEPLOYMENT_ENVIRONMENT,
System.getenv().getOrDefault("ENV", "development"))
.build());
// Configure OTLP exporter with timeout and retry
OtlpGrpcSpanExporter spanExporter = OtlpGrpcSpanExporter.builder()
.setEndpoint(otlpEndpoint)
.setTimeout(Duration.ofSeconds(10))
.build();
// Batch processor with tuned parameters
BatchSpanProcessor batchProcessor = BatchSpanProcessor.builder(spanExporter)
.setMaxQueueSize(10000)
.setMaxExportBatchSize(512)
.setScheduleDelay(5, TimeUnit.SECONDS)
.setExporterTimeout(30, TimeUnit.SECONDS)
.build();
// Build tracer provider with sampling
SdkTracerProvider tracerProvider = SdkTracerProvider.builder()
.setResource(resource)
.addSpanProcessor(new CustomSpanProcessor())
.addSpanProcessor(batchProcessor)
.setSampler(Sampler.traceIdRatioBased(samplerProbability))
.build();
// Register shutdown hook
Runtime.getRuntime().addShutdownHook(new Thread(tracerProvider::close));
return OpenTelemetrySdk.builder()
.setTracerProvider(tracerProvider)
.buildAndRegisterGlobal();
}
@Bean
public Tracer tracer(OpenTelemetry openTelemetry) {
return openTelemetry.getTracer(serviceName, "1.0.0");
}
}
3.2 完整配置示例:Go
// tracing/tracing.go
// Production-grade OpenTelemetry setup for Go services
package tracing
import (
"context"
"fmt"
"os"
"time"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/propagation"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
)
type Config struct {
ServiceName string
ServiceVersion string
Environment string
OTLPEndpoint string
SamplingRatio float64
}
func InitTracer(ctx context.Context, cfg Config) (func(), error) {
// Create OTLP exporter with connection options
conn, err := grpc.DialContext(ctx, cfg.OTLPEndpoint,
grpc.WithTransportCredentials(insecure.NewCredentials()),
grpc.WithBlock(),
grpc.WithTimeout(5*time.Second),
)
if err != nil {
return nil, fmt.Errorf("failed to connect to collector: %w", err)
}
exporter, err := otlptrace.New(ctx,
otlptracegrpc.NewClient(otlptracegrpc.WithGRPCConn(conn)),
)
if err != nil {
return nil, fmt.Errorf("failed to create exporter: %w", err)
}
// Build resource with service attributes
res, err := resource.Merge(
resource.Default(),
resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(cfg.ServiceName),
semconv.ServiceVersion(cfg.ServiceVersion),
semconv.DeploymentEnvironment(cfg.Environment),
attribute.String("host.name", getHostname()),
),
)
if err != nil {
return nil, fmt.Errorf("failed to create resource: %w", err)
}
// Configure batch span processor
bsp := sdktrace.NewBatchSpanProcessor(exporter,
sdktrace.WithMaxQueueSize(10000),
sdktrace.WithMaxExportBatchSize(512),
sdktrace.WithBatchTimeout(5*time.Second),
sdktrace.WithExportTimeout(30*time.Second),
)
// Build tracer provider with sampling
tp := sdktrace.NewTracerProvider(
sdktrace.WithResource(res),
sdktrace.WithSpanProcessor(bsp),
sdktrace.WithSampler(sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(cfg.SamplingRatio),
)),
)
// Set global tracer provider and propagator
otel.SetTracerProvider(tp)
otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(
propagation.TraceContext{},
propagation.Baggage{},
))
// Return cleanup function
cleanup := func() {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
if err := tp.Shutdown(ctx); err != nil {
fmt.Fprintf(os.Stderr, "failed to shutdown tracer: %v\n", err)
}
}
return cleanup, nil
}
func getHostname() string {
hostname, err := os.Hostname()
if err != nil {
return "unknown"
}
return hostname
}
// main.go
// Example usage
package main
import (
"context"
"log"
"net/http"
"os"
"os/signal"
"syscall"
"your-project/tracing"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
)
func main() {
ctx := context.Background()
// Initialize tracer
cleanup, err := tracing.InitTracer(ctx, tracing.Config{
ServiceName: "order-service",
ServiceVersion: "1.0.0",
Environment: os.Getenv("ENV"),
OTLPEndpoint: os.Getenv("OTEL_EXPORTER_OTLP_ENDPOINT"),
SamplingRatio: 0.01, // 1% sampling in production
})
if err != nil {
log.Fatalf("failed to init tracer: %v", err)
}
defer cleanup()
tracer := otel.Tracer("order-service")
// HTTP handler with tracing
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), "handleOrder")
defer span.End()
// Your business logic here
processOrder(ctx)
w.WriteHeader(http.StatusOK)
})
// Wrap with OTel HTTP middleware
wrappedHandler := otelhttp.NewHandler(handler, "HTTP")
server := &http.Server{
Addr: ":8080",
Handler: wrappedHandler,
}
// Graceful shutdown
go func() {
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
<-sigCh
server.Shutdown(ctx)
}()
log.Println("starting server on :8080")
if err := server.ListenAndServe(); err != http.ErrServerClosed {
log.Fatalf("server error: %v", err)
}
}
func processOrder(ctx context.Context) {
tracer := otel.Tracer("order-service")
_, span := tracer.Start(ctx, "processOrder")
defer span.End()
// Business logic
}
3.3 实际应用案例:动态采样
固定采样率在实际场景中往往不够灵活。比如:出错的请求要 100% 保留,慢请求要提高采样率,特定用户的请求要全量采集。我们实现了一个自定义 Sampler:
// DynamicSampler.java
// Smart sampling based on request characteristics
package com.example.tracing;
import io.opentelemetry.api.common.Attributes;
import io.opentelemetry.api.trace.SpanKind;
import io.opentelemetry.context.Context;
import io.opentelemetry.sdk.trace.data.LinkData;
import io.opentelemetry.sdk.trace.samplers.Sampler;
import io.opentelemetry.sdk.trace.samplers.SamplingDecision;
import io.opentelemetry.sdk.trace.samplers.SamplingResult;
import java.util.List;
import java.util.Set;
import java.util.concurrent.ThreadLocalRandom;
public class DynamicSampler implements Sampler {
private final double defaultRatio;
private final double errorRatio;
private final double slowRequestRatio;
private final Set<String> alwaysSampleUsers;
public DynamicSampler(double defaultRatio, Set<String> alwaysSampleUsers) {
this.defaultRatio = defaultRatio;
this.errorRatio = 1.0; // Always sample errors
this.slowRequestRatio = 0.5; // 50% for slow requests
this.alwaysSampleUsers = alwaysSampleUsers;
}
@Override
public SamplingResult shouldSample(
Context parentContext,
String traceId,
String name,
SpanKind spanKind,
Attributes attributes,
List<LinkData> parentLinks) {
// Check if this is a user we always want to sample
String userId = attributes.get(AttributeKey.stringKey("user.id"));
if (userId != null && alwaysSampleUsers.contains(userId)) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
// Check for error indicators
String httpStatus = attributes.get(AttributeKey.stringKey("http.status_code"));
if (httpStatus != null && httpStatus.startsWith("5")) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
// Check for debug flag in headers
String debugFlag = attributes.get(AttributeKey.stringKey("http.request.header.x_debug"));
if ("true".equalsIgnoreCase(debugFlag)) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
// Default probabilistic sampling
if (ThreadLocalRandom.current().nextDouble() < defaultRatio) {
return SamplingResult.create(SamplingDecision.RECORD_AND_SAMPLE);
}
return SamplingResult.create(SamplingDecision.DROP);
}
@Override
public String getDescription() {
return String.format("DynamicSampler{defaultRatio=%f}", defaultRatio);
}
}
3.4 Collector 的高级配置:Tail-Based Sampling
Head-based sampling(在请求开始时决定是否采样)的问题在于:做采样决策时,你根本不知道这个请求会不会出错、会不会很慢。Tail-based sampling(在请求结束后决定)解决了这个问题,但需要 Collector 集群来支撑:
# otel-collector-tail-sampling.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
# Tail-based sampling requires grouping spans by trace
groupbytrace:
wait_duration: 10s
num_traces: 100000
# Must buffer traces until they complete
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 1000
policies:
# Policy 1: Always sample errors
- name: errors-policy
type: status_code
status_code:
status_codes: [ERROR]
# Policy 2: Sample slow requests (>2s)
- name: latency-policy
type: latency
latency:
threshold_ms: 2000
# Policy 3: Sample requests with specific attribute
- name: debug-policy
type: string_attribute
string_attribute:
key: debug
values: ["true"]
# Policy 4: Probabilistic for everything else
- name: probabilistic-policy
type: probabilistic
probabilistic:
sampling_percentage: 1
# Composite policy: combine multiple policies
- name: composite-policy
type: composite
composite:
max_total_spans_per_second: 10000
policy_order: [errors-policy, latency-policy, debug-policy, probabilistic-policy]
rate_allocation:
- policy: errors-policy
percent: 30
- policy: latency-policy
percent: 30
- policy: debug-policy
percent: 10
- policy: probabilistic-policy
percent: 30
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [groupbytrace, tail_sampling]
exporters: [otlp/jaeger]
坑点六:Tail-based sampling 的内存问题
Tail sampling 需要把完整的 Trace 缓存在内存里,直到所有 Span 都到齐。如果你的服务调用链很长,或者某些服务响应很慢,内存占用会非常可观。num_traces: 100000 + 平均每个 Trace 10KB = 1GB 内存。如果你的 Trace 更大或者并发更高,需要相应调整。
四、最佳实践和注意事项
4.1 性能优化
4.1.1 减少 Span 数量
不是所有操作都需要创建 Span。以下场景可以跳过:
// Bad: Too many spans
public void processItems(List<Item> items) {
for (Item item : items) {
Span span = tracer.spanBuilder("processItem").startSpan();
try {
// process item
} finally {
span.end();
}
}
}
// Good: Single span with attributes
public void processItems(List<Item> items) {
Span span = tracer.spanBuilder("processItems")
.setAttribute("item.count", items.size())
.startSpan();
try {
for (Item item : items) {
// process item
}
} finally {
span.end();
}
}
4.1.2 异步导出
SDK 默认使用 BatchSpanProcessor,已经是异步的。但要注意配置:
BatchSpanProcessor.builder(exporter)
.setMaxQueueSize(10000) // Queue size before dropping
.setMaxExportBatchSize(512) // Spans per batch
.setScheduleDelay(5, TimeUnit.SECONDS) // Max wait time
.build();
setMaxQueueSize 太小会导致 Span 被丢弃,太大会占用过多内存。建议设为预期 QPS 的 5–10 倍。
4.1.3 Context Propagation 开销
每次跨服务调用都需要注入和提取 Trace Context,这有一定开销。如果服务是超高频调用(比如每秒几万次),可以考虑:减少 Header 数量,只传 traceparent,不传 tracestate 和 baggage;或者使用二进制传播格式,比 W3C 文本格式更紧凑。
// Use binary propagator for gRPC
TextMapPropagator propagator = W3CTraceContextPropagator.getInstance();
// vs
BinaryPropagator binaryPropagator = ...; // Custom implementation
4.2 安全加固
4.2.1 敏感数据脱敏
Span 的属性可能包含敏感数据,必须在导出前脱敏:
# In Collector config
processors:
attributes:
actions:
# Delete sensitive headers
- key: http.request.header.authorization
action: delete
- key: http.request.header.cookie
action: delete
# Hash PII data
- key: user.email
action: hash
- key: user.phone
action: hash
# Truncate long values
- key: db.statement
action: truncate
truncate:
max_length: 1000
# Redact patterns
- key: http.url
pattern: "password=[^&]*"
replacement: "password=REDACTED"
action: extract
4.2.2 网络安全
Collector 和后端之间应使用 TLS:
exporters:
otlp/jaeger:
endpoint: jaeger-collector:4317
tls:
insecure: false
cert_file: /etc/ssl/certs/collector.crt
key_file: /etc/ssl/private/collector.key
ca_file: /etc/ssl/certs/ca.crt
在 Kubernetes 内部,可以用 Istio 或 Linkerd 的 mTLS 代替手动配置证书。
4.3 高可用配置
4.3.1 Collector 高可用
单点 Collector 是危险的。推荐多副本 + HPA 配置:
# Deployment with multiple replicas
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector-gateway
spec:
replicas: 3
---
# HPA for auto-scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: otel-collector-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: otel-collector-gateway
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
4.3.2 后端存储高可用
Jaeger + Elasticsearch 的高可用配置:
# Elasticsearch should be a cluster
# At least 3 master-eligible nodes, 2 data nodes
# Jaeger Collector should be stateless and scalable
# jaeger-collector deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger-collector
spec:
replicas: 3
template:
spec:
containers:
- name: jaeger-collector
image: jaegertracing/jaeger-collector:1.54.0
env:
- name: SPAN_STORAGE_TYPE
value: elasticsearch
- name: ES_SERVER_URLS
value: http://elasticsearch:9200
- name: ES_NUM_SHARDS
value: "5"
- name: ES_NUM_REPLICAS
value: "1"
4.4 常见错误
| 错误现象 |
可能原因 |
解决方案 |
| Trace 断链 |
Context 没有正确传递 |
检查 HTTP Client 是否注入了 Propagator |
| Span 丢失 |
Queue 满了 |
增大 maxQueueSize,或增加 Collector 实例 |
| 高延迟 |
同步导出 |
确认使用 BatchSpanProcessor |
| OOM |
内存限制太小 |
增加 Collector 内存,或降低 num_traces |
| 数据不一致 |
采样决策不一致 |
使用 Parent-based sampling |
五、故障排查和监控
5.1 日志查看
5.1.1 SDK 日志
Java SDK 使用 JUL,需要配置日志级别:
# logging.properties
io.opentelemetry.level = FINE
io.opentelemetry.exporter.level = FINE
Go SDK:
import "go.opentelemetry.io/otel/sdk/trace"
// Enable debug logging
os.Setenv("OTEL_LOG_LEVEL", "debug")
5.1.2 Collector 日志
# Enable debug logging in Collector
service:
telemetry:
logs:
level: debug
# Be careful: debug logs are VERY verbose
查看 Collector 日志:
# Kubernetes
kubectl logs -n monitoring deployment/otel-collector -f
# Filter for errors
kubectl logs -n monitoring deployment/otel-collector | grep -i error
# Check for dropped spans
kubectl logs -n monitoring deployment/otel-collector | grep -i "dropped"
5.2 常见问题排查
5.2.1 Trace 断链问题
症状:在 Jaeger 中看到的 Trace 只有部分 Span,或者完全分离的多个小 Trace。
排查步骤:
# 1. Check if TraceID is consistent across services
# Add logging in your service
logger.info("TraceID: {}", Span.current().getSpanContext().getTraceId());
# 2. Verify HTTP headers are propagated
curl -v http://your-service/api/test 2>&1 | grep -i traceparent
# 3. Check if different services use compatible propagators
# W3C TraceContext is the standard, but some old services might use B3
修复方案:
// Ensure all services use the same propagator
TextMapPropagator propagator = TextMapPropagator.composite(
W3CTraceContextPropagator.getInstance(),
W3CBaggagePropagator.getInstance(),
B3Propagator.injectingMultiHeaders() // For compatibility with old services
);
5.2.2 性能问题排查
症状:应用 P99 延迟上升,CPU 或内存使用率增加。
# 1. Check if spans are being exported synchronously
grep -r "SimpleSpanProcessor" src/
# 2. Check batch processor queue status via zpages
curl http://otel-collector:55679/debug/tracez
# 3. Profile your application
# For Java
async-profiler -e cpu -d 30 -f profile.html <pid>
# 4. Check Collector metrics
curl http://otel-collector:8888/metrics | grep otelcol
关键指标:
otelcol_processor_batch_batch_send_size:批次大小
otelcol_exporter_sent_spans:成功发送的 Span 数
otelcol_exporter_send_failed_spans:发送失败的 Span 数
otelcol_processor_dropped_spans:丢弃的 Span 数
5.2.3 数据丢失问题
症状:发送的 Span 数量和 Jaeger 中看到的不一致。
# 1. Check sampling rate
grep "SamplingDecision" app.log | sort | uniq -c
# 2. Check Collector queue status
curl http://otel-collector:8888/metrics | grep queue
# 3. Check Jaeger collector status
curl http://jaeger-collector:14269/metrics | grep spans
# 4. Check Elasticsearch indexing
curl http://elasticsearch:9200/_cat/indices?v | grep jaeger
5.3 性能监控
建议为 OTel 组件本身设置监控:
# Prometheus scrape config for Collector
- job_name: otel-collector
static_configs:
- targets: ['otel-collector:8888']
metric_relabel_configs:
- source_labels: [__name__]
regex: 'otelcol_.*'
action: keep
关键告警规则:
# Prometheus alerting rules
groups:
- name: otel-collector
rules:
- alert: OTelCollectorHighMemory
expr: |
container_memory_usage_bytes{container="otel-collector"}
/ container_spec_memory_limit_bytes{container="otel-collector"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "OTel Collector memory usage above 90%"
- alert: OTelCollectorSpansDropped
expr: |
rate(otelcol_processor_dropped_spans[5m]) > 100
for: 5m
labels:
severity: critical
annotations:
summary: "OTel Collector dropping spans"
- alert: OTelExporterFailures
expr: |
rate(otelcol_exporter_send_failed_spans[5m])
/ rate(otelcol_exporter_sent_spans[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "OTel exporter failure rate above 1%"
5.4 备份恢复
Trace 数据通常不需要长期保留,但如果需要备份:
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/jaeger_backup" -H 'Content-Type: application/json' -d'
{
"type": "fs",
"settings": {
"location": "/mnt/backups/jaeger"
}
}'
# Create snapshot
curl -X PUT "localhost:9200/_snapshot/jaeger_backup/snapshot_$(date +%Y%m%d)"
# Restore (if needed)
curl -X POST "localhost:9200/_snapshot/jaeger_backup/snapshot_20240101/_restore"
六、总结
6.1 技术要点回顾
- 采样率是关键:生产环境 100% 采样是灾难,从 1% 开始,配合 tail-based sampling 智能采集重要 Trace
- Collector 资源要给够:每 1000 Span/s 约需 200MB 内存,memory_limiter 必须配置
- processors 顺序很重要:memory_limiter 在前,batch 在后
- Context Propagation 要统一:所有服务用同样的 Propagator,否则 Trace 会断链
- 敏感数据要脱敏:在 Collector 层统一处理,不要依赖各应用自己做
- 监控 OTel 组件本身:别让可观测性系统成为不可观测的黑盒
6.2 进阶学习方向
- Exemplars:将 Trace 和 Metrics 关联起来,从 Metrics 异常直接跳转到相关 Trace
- Continuous Profiling:结合 Pyroscope 等工具,从 Trace 钻取到火焰图,深入了解后端架构的性能特征
- eBPF-based Tracing:无侵入式的内核级追踪,性能开销更低
- OpenTelemetry Operator:Kubernetes 原生的 OTel 管理方案
6.3 参考资料
- OpenTelemetry 官方文档
- OpenTelemetry Collector Contrib
- Jaeger 官方文档
- W3C Trace Context 规范
附录
A. 命令速查表
# Collector health check
curl http://otel-collector:13133/
# Collector metrics
curl http://otel-collector:8888/metrics
# Collector zpages (debug UI)
# Browser: http://otel-collector:55679/debug/tracez
# Jaeger services list
curl http://jaeger-query:16686/api/services
# Jaeger trace by ID
curl http://jaeger-query:16686/api/traces/<trace-id>
# Test trace propagation
curl -H "traceparent: 00-12345678901234567890123456789012-1234567890123456-01" \
http://your-service/api/test
# Check Elasticsearch indices
curl http://elasticsearch:9200/_cat/indices?v | grep jaeger
# Force flush Collector
kill -SIGUSR1 <collector-pid>
B. 配置参数详解
SDK 参数
| 参数 |
默认值 |
说明 |
otel.traces.exporter |
otlp |
导出器类型 |
otel.exporter.otlp.endpoint |
http://localhost:4317 |
Collector 地址 |
otel.exporter.otlp.protocol |
grpc |
协议类型(grpc/http) |
otel.traces.sampler |
parentbased_always_on |
采样器类型 |
otel.traces.sampler.arg |
1.0 |
采样率参数 |
Collector Batch Processor 参数
| 参数 |
默认值 |
推荐值 |
说明 |
timeout |
200ms |
5s |
批次等待时间 |
send_batch_size |
8192 |
8192 |
触发发送的 Span 数 |
send_batch_max_size |
0 |
16384 |
最大批次大小 |
Collector Memory Limiter 参数
| 参数 |
默认值 |
推荐值 |
说明 |
check_interval |
0s |
1s |
检查间隔 |
limit_mib |
0 |
2048 |
内存硬限制 |
spike_limit_mib |
0 |
512 |
突发限制 |
limit_percentage |
0 |
80 |
内存限制百分比 |
C. 术语表
| 术语 |
解释 |
| Span |
一次操作的记录,包含名称、时间戳、属性等 |
| Trace |
一组相关 Span 的集合,代表一次完整请求 |
| TraceID |
Trace 的唯一标识,128 位 |
| SpanID |
Span 的唯一标识,64 位 |
| Parent Span |
当前 Span 的父级,用于构建调用树 |
| Baggage |
跨服务传递的键值对,会随 Context 传播 |
| Propagator |
负责注入和提取 Trace Context 的组件 |
| Sampler |
决定是否采样的组件 |
| Head-based Sampling |
在请求开始时决定是否采样 |
| Tail-based Sampling |
在请求结束后决定是否采样 |
| Collector |
独立的数据收集和处理组件 |
| Receiver |
Collector 中接收数据的组件 |
| Processor |
Collector 中处理数据的组件 |
| Exporter |
Collector 中导出数据的组件 |