ELK 的痛点
先说说我们在 ELK 上踩过的坑:
- 资源消耗惊人:15 个节点的 ES 集群,每个节点 64GB 内存,光内存就要 960GB,还经常 OOM
- 索引管理噩梦:每天产生 2TB 日志,索引分片太多导致 master 节点 CPU 打满
- 查询性能不稳定:跨多天的日志查询经常超时
- 成本居高不下:SSD 存储成本、硬件成本、运维人力成本
- 扩容不灵活:想加节点得重新平衡分片,过程痛苦
为什么选择 VictoriaMetrics + Loki
我们评估了几个方案:
| 方案 |
优点 |
缺点 |
适用场景 |
| ELK |
全文检索强 |
资源消耗大 |
复杂日志分析 |
| Loki |
轻量、成本低 |
不支持全文检索 |
云原生日志 |
| ClickHouse |
查询快 |
运维复杂 |
分析型场景 |
| VictoriaLogs |
性能好 |
产品较新 |
高性能日志 |
最终架构:
┌──────────────────────────────────────────────────────────────────────────────┐
│ 日志采集层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Vector │ │ Promtail │ │ Fluent Bit │ │ OpenTelemetry │ │
│ │ (推荐) │ │ │ │ │ │ Collector │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────────┬──────────┘ │
│ │ │ │ │ │
├─────────┼────────────────┼────────────────┼─────────────────────┼────────────┤
│ └────────────────┴────────────────┴─────────────────────┘ │
│ │ │
│ ┌─────▼─────┐ │
│ │ Kafka │ (可选,大规模时使用) │
│ └─────┬─────┘ │
│ │ │
├────────────────────────────────────┼──────────────────────────────────────────┤
│ 存储层 │
│ ┌───────────────────────────────┼───────────────────────────────────┐ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │
│ │ │ VictoriaMetrics │ │ Grafana Loki │ │ VictoriaLogs │ │ │
│ │ │ (指标) │ │ (日志) │ │ (高性能日志) │ │ │
│ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ │
│ │ │ │ │ │ │
│ │ │ Object Storage (S3/MinIO) │ │ │
│ │ │ ┌─────────────────────┐ │ │ │
│ │ │ │ │ │ │ │
│ └───────────┼──────────────┤ Long-term Store ├──────┼───────────┘ │
│ │ │ │ │ │
│ │ └─────────────────────┘ │ │
├────────────────┼───────────────────────────────────────────┼─────────────────┤
│ │ 可视化层 │ │
│ │ ┌─────────────────┐ │ │
│ └────────►│ Grafana │◄──────────────┘ │
│ │ │ │
│ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
技术特点
VictoriaMetrics:
- 单机版性能比 Prometheus 高 5-10 倍
- 存储压缩率高,磁盘占用只有 Prometheus 的 1/7
- 兼容 PromQL,无缝替换 Prometheus
- 支持多租户、长期存储
Grafana Loki:
- 只索引标签,不索引日志内容
- 存储成本比 ES 低 10 倍以上
- 与 Grafana 深度集成
- 支持 S3/MinIO 作为后端存储
环境要求
硬件配置:
| 组件 |
配置 |
数量 |
说明 |
| VictoriaMetrics |
16C/64G/2TB NVMe |
3 |
集群模式 |
| Loki |
8C/32G/500GB NVMe |
3 |
读写分离模式 |
| MinIO |
8C/16G/20TB HDD |
4 |
纠删码模式 |
| Grafana |
4C/8G |
2 |
高可用 |
软件版本:
- VictoriaMetrics:1.96.0 (Cluster 版)
- Grafana Loki:2.9.4
- Grafana:10.3.1
- Vector:0.35.0
- MinIO:RELEASE.2024-01-18T22-51-28Z
日志规模:
- 日均日志量:3TB
- 保留周期:热数据 7 天,冷数据 90 天
- 日志条数:约 50 亿条/天
- 查询 QPS:峰值 500
二、VictoriaMetrics 集群部署
2.1 架构设计
VictoriaMetrics 集群由三个组件组成:
┌──────────────────┐
│ vminsert │
│ (写入网关) │
└────────┬─────────┘
│
▼
┌────────────────────────────────────────┐
│ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ vmstorage-0 │ │ vmstorage-1 │ │ vmstorage-2 │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└────────────────────┼───────────────────┘
│
▼
┌──────────────────┐
│ vmselect │
│ (查询网关) │
└──────────────────┘
2.2 Helm 部署
# Add Helm repository
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update
完整的 values.yaml:
# victoriametrics-cluster-values.yaml
vmcluster:
enabled: true
spec:
retentionPeriod: "90d"
replicationFactor: 2
# VMSelect - query component
vmselect:
replicaCount: 3
image:
repository: victoriametrics/vmselect
tag: v1.96.0-cluster
extraArgs:
search.maxConcurrentRequests: "32"
search.maxQueueDuration: "30s"
search.maxQueryDuration: "120s"
search.maxUniqueTimeseries: "1000000"
search.maxSamplesPerQuery: "1000000000"
search.cacheTimestampOffset: "5m"
dedup.minScrapeInterval: "30s"
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "8"
memory: "16Gi"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: vmselect
topologyKey: kubernetes.io/hostname
storage:
volumeClaimTemplate:
spec:
storageClassName: rook-ceph-block
resources:
requests:
storage: 50Gi
# VMInsert - ingestion component
vminsert:
replicaCount: 3
image:
repository: victoriametrics/vminsert
tag: v1.96.0-cluster
extraArgs:
maxLabelsPerTimeseries: "50"
maxLabelValueLen: "1024"
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: vminsert
topologyKey: kubernetes.io/hostname
# VMStorage - storage component
vmstorage:
replicaCount: 3
image:
repository: victoriametrics/vmstorage
tag: v1.96.0-cluster
extraArgs:
dedup.minScrapeInterval: "30s"
search.maxUniqueTimeseries: "5000000"
retentionTimezoneOffset: "8h"
bigMergeConcurrency: "2"
smallMergeConcurrency: "8"
finalMergeDelay: "30s"
resources:
requests:
cpu: "4"
memory: "32Gi"
limits:
cpu: "16"
memory: "64Gi"
storage:
volumeClaimTemplate:
spec:
storageClassName: rook-ceph-block
resources:
requests:
storage: 2Ti
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: vmstorage
topologyKey: kubernetes.io/hostname
# VMAgent for scraping
vmagent:
enabled: true
spec:
image:
repository: victoriametrics/vmagent
tag: v1.96.0
replicaCount: 2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
extraArgs:
promscrape.streamParse: "true"
promscrape.maxScrapeSize: "256MB"
remoteWrite.maxDiskUsagePerURL: "10GB"
remoteWrite.queues: "8"
remoteWrite.showURL: "true"
remoteWrite:
- url: "http://vminsert-vmcluster.monitoring.svc:8480/insert/0/prometheus/"
selectAllByDefault: true
# VMAlert for alerting
vmalert:
enabled: true
spec:
image:
repository: victoriametrics/vmalert
tag: v1.96.0
replicaCount: 2
resources:
requests:
cpu: "200m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
datasource:
url: "http://vmselect-vmcluster.monitoring.svc:8481/select/0/prometheus/"
notifier:
url: "http://alertmanager.monitoring.svc:9093/"
extraArgs:
external.url: "https://grafana.internal.company.com"
external.alert.source: "{{ .ExternalURL }}/explore?orgId=1&left=%5B%22now-1h%22,%22now%22,%22VictoriaMetrics%22,%7B%22expr%22:%22{{ .Expr | urlquery }}%22%7D%5D"
remoteWrite:
url: "http://vminsert-vmcluster.monitoring.svc:8480/insert/0/prometheus/"
remoteRead:
url: "http://vmselect-vmcluster.monitoring.svc:8481/select/0/prometheus/"
部署:
helm upgrade --install vmcluster vm/victoria-metrics-k8s-stack \
--namespace monitoring \
--create-namespace \
--values victoriametrics-cluster-values.yaml \
--wait
# Verify deployment
kubectl get pods -n monitoring -l app.kubernetes.io/instance=vmcluster
2.3 性能调优
创建配置文件进行调优:
# vmstorage tuning ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: vmstorage-tuning
namespace: monitoring
data:
tuning.sh: |
#!/bin/bash
# Increase file descriptor limits
ulimit -n 1000000
# Tune kernel parameters for storage workload
sysctl -w vm.max_map_count=262144
sysctl -w net.core.somaxconn=65535
sysctl -w net.ipv4.tcp_max_syn_backlog=65535
# Disable transparent hugepages
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
三、Grafana Loki 部署
3.1 架构选择
Loki 有三种部署模式,我们选择了读写分离模式:
┌─────────────────────────────────────────────────────────────────┐
│ Loki 读写分离架构 │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 写入路径: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Distributor│───►│ Ingester │───►│ S3 │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ │
│ │ Kafka │ (WAL) │
│ └──────────┘ │
│ │
│ 查询路径: │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Query │───►│ Querier │───►│ Index │ │
│ │Frontend │ │ │ │ Gateway │ │
│ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ │
│ │ S3 │ │BoltDB │ │
│ │ (chunks) │ │(index) │ │
│ └──────────┘ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
3.2 Helm 部署 Loki
# loki-distributed-values.yaml
loki:
image:
repository: grafana/loki
tag: 2.9.4
# Common configuration
commonConfig:
replication_factor: 3
ring:
kvstore:
store: memberlist
# Storage configuration
storage:
bucketNames:
chunks: loki-chunks
ruler: loki-ruler
admin: loki-admin
type: s3
s3:
endpoint: minio.storage.svc:9000
region: us-east-1
secretAccessKey: ${MINIO_SECRET_KEY}
accessKeyId: ${MINIO_ACCESS_KEY}
s3ForcePathStyle: true
insecure: true
# Schema configuration
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h
# Limits configuration
limits_config:
enforce_metric_name: false
reject_old_samples: true
reject_old_samples_max_age: 168h
max_cache_freshness_per_query: 10m
split_queries_by_interval: 15m
max_query_parallelism: 32
max_query_series: 500
ingestion_rate_mb: 100
ingestion_burst_size_mb: 200
per_stream_rate_limit: 10MB
per_stream_rate_limit_burst: 50MB
max_entries_limit_per_query: 50000
max_label_name_length: 1024
max_label_value_length: 2048
max_label_names_per_series: 30
retention_period: 2160h # 90 days
# Query frontend configuration
frontend:
max_outstanding_per_tenant: 4096
compress_responses: true
log_queries_longer_than: 10s
# Query scheduler
query_scheduler:
max_outstanding_requests_per_tenant: 32768
# Ingester configuration
ingester:
chunk_idle_period: 30m
chunk_block_size: 262144
chunk_retain_period: 1m
max_transfer_retries: 0
wal:
enabled: true
dir: /loki/wal
flush_on_shutdown: true
replay_memory_ceiling: 4GB
# Write components
write:
replicas: 3
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "8"
memory: "16Gi"
persistence:
enabled: true
size: 100Gi
storageClass: rook-ceph-block
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: write
topologyKey: kubernetes.io/hostname
# Read components
read:
replicas: 3
resources:
requests:
cpu: "1"
memory: "2Gi"
limits:
cpu: "4"
memory: "8Gi"
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app.kubernetes.io/component: read
topologyKey: kubernetes.io/hostname
# Backend components (compactor, ruler)
backend:
replicas: 2
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
persistence:
enabled: true
size: 50Gi
storageClass: rook-ceph-block
# Gateway
gateway:
enabled: true
replicas: 2
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
ingress:
enabled: true
ingressClassName: nginx
hosts:
- host: loki.internal.company.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: loki-tls
hosts:
- loki.internal.company.com
# Memcached for caching
memcached:
enabled: true
image:
repository: memcached
tag: 1.6.23-alpine
memcachedChunks:
enabled: true
replicas: 3
resources:
requests:
cpu: "500m"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
extraArgs:
- -m 3072
- -I 32m
- -c 4096
memcachedFrontend:
enabled: true
replicas: 2
resources:
requests:
cpu: "200m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
extraArgs:
- -m 1536
- -I 16m
- -c 2048
memcachedIndexQueries:
enabled: true
replicas: 2
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "1"
memory: "1Gi"
extraArgs:
- -m 512
- -I 8m
- -c 1024
部署:
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm upgrade --install loki grafana/loki-distributed \
--namespace logging \
--create-namespace \
--values loki-distributed-values.yaml \
--wait
3.3 MinIO 对象存储
Loki 需要对象存储作为长期存储后端:
# minio-values.yaml
mode: distributed
replicas: 4
persistence:
enabled: true
size: 5Ti
storageClass: local-storage
resources:
requests:
cpu: "2"
memory: "8Gi"
limits:
cpu: "8"
memory: "16Gi"
rootUser: admin
rootPassword: ${MINIO_ROOT_PASSWORD}
buckets:
- name: loki-chunks
policy: none
purge: false
- name: loki-ruler
policy: none
purge: false
- name: loki-admin
policy: none
purge: false
policies:
- name: loki-policy
statements:
- resources:
- "arn:aws:s3:::loki-*"
- "arn:aws:s3:::loki-*/*"
actions:
- "s3:*"
users:
- accessKey: loki
secretKey: ${LOKI_MINIO_SECRET}
policy: loki-policy
ingress:
enabled: true
ingressClassName: nginx
hosts:
- minio.internal.company.com
tls:
- secretName: minio-tls
hosts:
- minio.internal.company.com
consoleIngress:
enabled: true
ingressClassName: nginx
hosts:
- minio-console.internal.company.com
tls:
- secretName: minio-console-tls
hosts:
- minio-console.internal.company.com
metrics:
serviceMonitor:
enabled: true
namespace: monitoring
四、日志采集配置
4.1 Vector 采集器
Vector 是 Rust 写的高性能日志采集器,性能比 Fluent Bit 还好:
# vector-values.yaml
role: Agent
image:
repository: timberio/vector
tag: 0.35.0-alpine
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "1"
memory: "512Mi"
podLabels:
app: vector
component: log-collector
tolerations:
- operator: Exists
effect: NoSchedule
customConfig:
data_dir: /vector-data-dir
api:
enabled: true
address: 0.0.0.0:8686
playground: false
# Sources
sources:
kubernetes_logs:
type: kubernetes_logs
auto_partial_merge: true
exclude_paths_glob_patterns:
- "**/kube-system/**"
- "**/monitoring/**"
pod_annotation_fields:
container_image: container_image
container_name: container_name
pod_ip: pod_ip
pod_name: pod_name
pod_namespace: namespace
pod_node_name: node_name
pod_labels: pod_labels
pod_annotations: pod_annotations
internal_metrics:
type: internal_metrics
# Transforms
transforms:
# Parse structured logs
parse_json:
type: remap
inputs:
- kubernetes_logs
source: |
# Try to parse as JSON
parsed, err = parse_json(.message)
if err == null {
. = merge(., parsed)
}
# Extract log level
.level = .level || .severity || .log_level || "info"
.level = downcase!(.level)
# Normalize timestamp
.timestamp = .timestamp || .time || .@timestamp || now()
# Add processing metadata
.vector_processed_at = now()
# Filter noisy logs
filter_noise:
type: filter
inputs:
- parse_json
condition:
type: vrl
source: |
# Filter out health check logs
!match!(.message, r'(healthz|readyz|livez|health.*check)')
&&
# Filter out debug logs in production
.level != "debug" && .level != "trace"
# Add labels for Loki
add_loki_labels:
type: remap
inputs:
- filter_noise
source: |
# Create Loki labels from Kubernetes metadata
.loki_labels = {
"namespace": .namespace,
"pod": .pod_name,
"container": .container_name,
"node": .node_name,
"app": .pod_labels.app || .pod_labels."app.kubernetes.io/name" || "unknown",
"level": .level
}
# Remove high-cardinality fields from labels
del(.pod_labels)
del(.pod_annotations)
# Reduce log size
reduce_size:
type: remap
inputs:
- add_loki_labels
source: |
# Truncate very long messages
if length!(.message) > 10000 {
.message = slice!(.message, 0, 10000) + "... [truncated]"
}
# Remove unnecessary fields
del(.file)
del(.source_type)
# Sinks
sinks:
loki:
type: loki
inputs:
- reduce_size
endpoint: http://loki-gateway.logging.svc:80
encoding:
codec: json
labels:
namespace: "{{ loki_labels.namespace }}"
pod: "{{ loki_labels.pod }}"
container: "{{ loki_labels.container }}"
node: "{{ loki_labels.node }}"
app: "{{ loki_labels.app }}"
level: "{{ loki_labels.level }}"
out_of_order_action: accept
remove_label_fields: true
remove_timestamp: false
batch:
max_bytes: 10485760
timeout_secs: 5
buffer:
type: disk
max_size: 268435456
when_full: block
prometheus:
type: prometheus_exporter
inputs:
- internal_metrics
address: 0.0.0.0:9090
# Service monitor for Vector metrics
service:
enabled: true
ports:
- name: metrics
port: 9090
targetPort: 9090
protocol: TCP
podMonitor:
enabled: true
metricsEndpoints:
- port: metrics
interval: 30s
部署 Vector:
helm repo add vector https://helm.vector.dev
helm repo update
helm upgrade --install vector vector/vector \
--namespace logging \
--values vector-values.yaml \
--wait
4.2 多集群日志采集
对于多集群环境,我们使用 Kafka 作为中间层:
# vector-aggregator-values.yaml
role: Aggregator
replicas: 3
image:
repository: timberio/vector
tag: 0.35.0-alpine
resources:
requests:
cpu: "2"
memory: "4Gi"
limits:
cpu: "8"
memory: "16Gi"
customConfig:
data_dir: /vector-data-dir
sources:
kafka:
type: kafka
bootstrap_servers: kafka.messaging.svc:9092
group_id: vector-aggregator
topics:
- logs-dev
- logs-staging
- logs-production
auto_offset_reset: latest
decoding:
codec: json
transforms:
add_cluster_label:
type: remap
inputs:
- kafka
source: |
# Extract cluster from Kafka topic
.cluster = replace!(.topic, "logs-", "")
# Add to Loki labels
.loki_labels.cluster = .cluster
sinks:
loki:
type: loki
inputs:
- add_cluster_label
endpoint: http://loki-gateway.logging.svc:80
encoding:
codec: json
labels:
cluster: "{{ loki_labels.cluster }}"
namespace: "{{ loki_labels.namespace }}"
app: "{{ loki_labels.app }}"
level: "{{ loki_labels.level }}"
batch:
max_bytes: 52428800
timeout_secs: 10
五、Grafana 配置
5.1 数据源配置
# grafana-values.yaml
grafana:
image:
repository: grafana/grafana
tag: 10.3.1
replicas: 2
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
persistence:
enabled: true
size: 20Gi
storageClassName: rook-ceph-block
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
url: http://vmselect-vmcluster.monitoring.svc:8481/select/0/prometheus/
access: proxy
isDefault: true
jsonData:
timeInterval: "30s"
httpMethod: POST
manageAlerts: true
prometheusType: Prometheus
prometheusVersion: 2.48.0
- name: Loki
type: loki
url: http://loki-gateway.logging.svc:80
access: proxy
jsonData:
maxLines: 5000
timeout: 300
derivedFields:
- datasourceUid: VictoriaMetrics
matcherRegex: "trace_id=(\\w+)"
name: TraceID
url: '$${__value.raw}'
- name: Loki (Streaming)
type: loki
url: http://loki-gateway.logging.svc:80
access: proxy
jsonData:
maxLines: 1000
timeout: 60
dashboardProviders:
dashboardproviders.yaml:
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: ''
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/default
- name: 'infrastructure'
orgId: 1
folder: 'Infrastructure'
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards/infrastructure
dashboards:
default:
victoriametrics-cluster:
gnetId: 11176
revision: 18
datasource: VictoriaMetrics
loki-overview:
gnetId: 13639
revision: 2
datasource: Loki
vector-metrics:
gnetId: 12539
revision: 1
datasource: VictoriaMetrics
grafana.ini:
server:
root_url: https://grafana.internal.company.com
auth:
disable_login_form: false
auth.generic_oauth:
enabled: true
name: Keycloak
client_id: grafana
client_secret: ${GRAFANA_OAUTH_SECRET}
scopes: openid profile email groups
auth_url: https://sso.internal.company.com/realms/company/protocol/openid-connect/auth
token_url: https://sso.internal.company.com/realms/company/protocol/openid-connect/token
api_url: https://sso.internal.company.com/realms/company/protocol/openid-connect/userinfo
role_attribute_path: contains(groups, 'platform-team') && 'Admin' || contains(groups, 'developers') && 'Editor' || 'Viewer'
feature_toggles:
enable: tempoSearch,tempoApmTable,traceToLogs
unified_alerting:
enabled: true
alerting:
enabled: false
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.internal.company.com
tls:
- secretName: grafana-tls
hosts:
- grafana.internal.company.com
5.2 日志查询最佳实践
创建常用的 LogQL 查询保存为仪表盘:
# grafana-dashboard-logs.json (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards-logs
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
application-logs.json: |
{
"title": "Application Logs",
"panels": [
{
"title": "Error Logs by App",
"type": "logs",
"targets": [
{
"expr": "{namespace=\"production\", level=\"error\"} |= \"$search\"",
"refId": "A"
}
]
},
{
"title": "Error Rate by App",
"type": "timeseries",
"targets": [
{
"expr": "sum by (app) (rate({namespace=\"production\", level=\"error\"}[5m]))",
"legendFormat": "{{ app }}",
"refId": "A"
}
]
},
{
"title": "Log Volume",
"type": "timeseries",
"targets": [
{
"expr": "sum(rate({namespace=\"production\"}[5m])) by (app)",
"legendFormat": "{{ app }}",
"refId": "A"
}
]
}
],
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(namespace)"
},
{
"name": "app",
"type": "query",
"query": "label_values({namespace=\"$namespace\"}, app)"
},
{
"name": "search",
"type": "textbox",
"current": {
"value": ""
}
}
]
}
}
常用 LogQL 查询示例:
# 按应用查询错误日志
{namespace="production", app="order-service", level="error"}
# 搜索包含特定错误的日志
{namespace="production"} |= "NullPointerException" | json
# 按时间范围查询
{namespace="production", app="payment-service"} | json | duration > 1s
# 统计错误率
sum(rate({namespace="production", level="error"}[5m])) by (app)
# 提取字段并过滤
{namespace="production"} | json | status_code >= 500
# 日志量统计
sum by (app) (count_over_time({namespace="production"}[1h]))
# 多条件查询
{namespace="production"} |~ "error|exception|failed" | json | line_format "{{.timestamp}} [{{.level}}] {{.message}}"
六、迁移过程和对比
6.1 迁移策略
我们采用了渐进式迁移策略:
Week 1-2: 部署新系统,双写日志
┌────────────────┐
│ Applications │
└───────┬────────┘
│
┌───────▼────────┐
│ Vector │
└───────┬────────┘
│
┌──────────┴──────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Elasticsearch │ │ Loki │
│ (existing) │ │ (new) │
└───────────────┘ └───────────────┘
Week 3-4: 验证新系统,逐步切换查询
Week 5-6: 停止向 ES 写入,完全切换到 Loki
Week 7-8: 下线 ES 集群
6.2 成本对比
迁移前后的对比:
| 指标 |
ELK 方案 |
VM + Loki 方案 |
变化 |
| 硬件成本 |
|
|
|
| ES 集群 |
15 台 x 64GB x 2TB SSD |
- |
- |
| VM/Loki |
- |
6 台 x 64GB + 4 台 x 20TB HDD |
- |
| 月度硬件成本 |
¥45,000 |
¥18,000 |
-60% |
| 存储成本 |
|
|
|
| 日存储量 |
2TB (热) + 2TB (压缩) |
0.4TB (压缩) |
-80% |
| 月存储费用 |
¥15,000 |
¥3,000 |
-80% |
| 运维成本 |
|
|
|
| 人力投入 |
1.5 人 |
0.5 人 |
-67% |
| 故障恢复时间 |
平均 4 小时 |
平均 30 分钟 |
-88% |
| 性能指标 |
|
|
|
| 写入吞吐 |
50k EPS |
200k EPS |
+300% |
| 查询延迟 (P99) |
5s |
0.5s |
-90% |
6.3 查询性能对比
场景: 查询最近 1 小时某应用的错误日志
ELK:
查询时间: 2.3s
内存使用: 峰值 8GB
命中结果: 45,678 条
Loki:
查询时间: 0.3s
内存使用: 峰值 256MB
命中结果: 45,678 条
场景: 查询最近 7 天包含特定关键词的日志
ELK:
查询时间: 15s
经常超时需要分页
Loki:
查询时间: 3s
使用索引标签预过滤
七、最佳实践
7.1 日志规范
制定统一的日志格式规范:
{
"timestamp": "2024-01-15T10:30:45.123Z",
"level": "error",
"service": "order-service",
"trace_id": "abc123def456",
"span_id": "xyz789",
"message": "Failed to process order",
"error": {
"type": "PaymentException",
"message": "Insufficient funds",
"stack": "..."
},
"context": {
"order_id": "ORD-12345",
"user_id": "USR-67890",
"amount": 99.99
}
}
应用日志配置示例(Go):
// logger/logger.go
package logger
import (
"go.uber.org/zap"
"go.uber.org/zap/zapcore"
)
func NewLogger() (*zap.Logger, error) {
config := zap.NewProductionConfig()
config.EncoderConfig.TimeKey = "timestamp"
config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
config.EncoderConfig.LevelKey = "level"
config.EncoderConfig.EncodeLevel = zapcore.LowercaseLevelEncoder
config.EncoderConfig.MessageKey = "message"
// Add service name
config.InitialFields = map[string]interface{}{
"service": os.Getenv("SERVICE_NAME"),
}
return config.Build()
}
// Usage
func main() {
logger, _ := NewLogger()
defer logger.Sync()
logger.Info("Order processed",
zap.String("order_id", "ORD-12345"),
zap.String("trace_id", traceID),
)
logger.Error("Payment failed",
zap.Error(err),
zap.String("order_id", orderID),
)
}
7.2 标签设计原则
Loki 的性能高度依赖标签设计:
# Good - 低基数标签
labels:
namespace: production
app: order-service
level: error
node: node-01
# Bad - 高基数标签 (不要这样做!)
labels:
user_id: USR-12345 # 用户 ID 基数太高
request_id: abc123 # 请求 ID 基数太高
pod_name: order-xxx # Pod 名称会变化
7.3 告警配置
# Loki alerting rules
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-alerting-rules
namespace: logging
data:
rules.yaml: |
groups:
- name: application-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate({level="error"}[5m])) by (app) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.app }}"
description: "Error rate is {{ $value }} errors/sec"
- alert: NoLogsReceived
expr: |
absent(count_over_time({namespace="production"}[5m])) == 1
for: 10m
labels:
severity: critical
annotations:
summary: "No logs received from production"
- alert: LogVolumeSpike
expr: |
sum(rate({namespace="production"}[5m])) /
sum(rate({namespace="production"}[1h] offset 1d)) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "Log volume 2x higher than yesterday"
7.4 性能调优清单
# Performance tuning checklist
vector:
- buffer_size: 根据内存调整,建议 256MB
- batch_size: Loki 建议 10MB
- batch_timeout: 5-10s 平衡延迟和效率
loki:
- ingester_memory: 每个 ingester 建议 16GB+
- chunk_idle_period: 30m 平衡写入和压缩
- max_query_parallelism: 根据 CPU 调整
- split_queries_by_interval: 15m 加速大范围查询
- memcached: 必须开启,显著提升查询性能
victoriametrics:
- dedup_interval: 设置为采集间隔
- merge_concurrency: 根据 CPU 核数调整
- search_cache: 建议开启 timestamp offset
storage:
- use_ssd: 热数据用 NVMe SSD
- compression: 开启 zstd 压缩
- retention: 根据需求设置,过长会增加存储成本
八、故障排查
8.1 VictoriaMetrics 排查
# Check cluster status
curl http://vmselect:8481/select/0/prometheus/api/v1/status/tsdb
# Check ingestion rate
curl http://vminsert:8480/metrics | grep vm_rows_inserted
# Check storage usage
curl http://vmstorage:8482/metrics | grep vm_data_size
# Force merge (maintenance)
curl -X POST http://vmstorage:8482/internal/force_merge
# Debug slow queries
curl "http://vmselect:8481/select/0/prometheus/api/v1/query?query=up&stats=1"
8.2 Loki 排查
# Check Loki ring status
curl http://loki:3100/ring
# Check ingester status
curl http://loki:3100/ingester/ready
# Check flush status
curl http://loki:3100/flush
# Check compactor status
curl http://loki:3100/compactor/ring
# Debug query
curl -G -s "http://loki:3100/loki/api/v1/query_range" \
--data-urlencode 'query={namespace="production"}' \
--data-urlencode 'start=1705305600000000000' \
--data-urlencode 'end=1705309200000000000' \
--data-urlencode 'limit=100'
# Check stream info
curl -G -s "http://loki:3100/loki/api/v1/series" \
--data-urlencode 'match[]={namespace="production"}'
8.3 常见问题
问题1: Loki 写入失败,rate limit
# Check current limits
curl http://loki:3100/config | jq '.limits_config'
# Solution: Increase limits in config
limits_config:
ingestion_rate_mb: 200
ingestion_burst_size_mb: 400
per_stream_rate_limit: 20MB
问题2: 查询超时
# Solution 1: Reduce query range
{app="myapp"} | json | __error__="" # 最近 1 小时
# Solution 2: Add more specific filters
{namespace="production", app="myapp", level="error"} | json
# Solution 3: Increase query limits
query_frontend:
max_outstanding_per_tenant: 8192
query_scheduler:
max_outstanding_requests_per_tenant: 65536
问题3: 存储空间增长过快
# Check stream cardinality
logcli series '{}' | wc -l
# If too many streams, check label design
# High cardinality labels cause stream explosion
九、总结
迁移收益
- 成本大幅降低:硬件和存储成本降低 60-80%
- 性能显著提升:查询延迟从秒级降到毫秒级
- 运维更简单:不再需要管理复杂的 ES 分片和索引
- 与 Kubernetes 生态融合更好:标签模型与 K8s 天然契合
注意事项
- Loki 不是 ES 的替代品:如果需要全文检索,Loki 不适合
- 标签设计很重要:高基数标签会严重影响性能
- 日志格式要规范:结构化日志才能发挥 Loki 的优势
- 监控系统自身:VictoriaMetrics 和 Loki 也需要被监控
进阶方向
- Tempo 集成:实现 logs -> traces 的跳转
- VictoriaLogs:VM 官方的日志解决方案,性能更好
- OpenTelemetry:统一日志、指标、追踪的采集
参考资料
-
VictoriaMetrics 文档
-
Grafana Loki 文档
-
Vector 文档
-
LogQL 语法
-
附录
命令速查表
# VictoriaMetrics
vmctl prometheus-import ... # 从 Prometheus 导入数据
vmctl remote-read ... # 从远程读取数据
vmbackup -src=... -dst=... # 备份数据
vmrestore -src=... -dst=... # 恢复数据
# Loki CLI (logcli)
logcli query '{namespace="production"}'
logcli labels namespace
logcli series '{namespace="production"}'
logcli instant-query '{app="myapp"} | json | count_over_time([5m])'
# Vector
vector validate --config vector.toml # 验证配置
vector top # 实时监控
vector test --config vector.toml # 测试配置
# MinIO
mc alias set myminio http://minio:9000 admin password
mc ls myminio/loki-chunks
mc du myminio/loki-chunks
术语表
| 术语 |
说明 |
| TSDB |
时序数据库 |
| Cardinality |
基数,标签可能值的数量 |
| Ingester |
Loki 的写入组件 |
| Querier |
Loki 的查询组件 |
| Chunk |
日志数据块 |
| LogQL |
Loki 的查询语言 |
| PromQL |
Prometheus 查询语言 |
| Stream |
Loki 中相同标签的日志流 |
像这种体量的日志系统,单靠个人单打独斗很难踩完所有的坑。在方案调研和后期优化过程中,多亏了经常在 云栈社区 里跟同行交流思路,很多监控和自动化运维的灵感都是在那里碰撞出来的。从 ELK 到 VictoriaMetrics+Loki 的迁移,不仅仅是工具的替换,更是团队监控思维的一次全面升级。