云栈社区»论坛 › 技术文档「 Note & Doc 」 › VictoriaMetrics + Loki 日志系统实战：成本骤降60%的架构升级指 ...

发回帖发新帖

3330 积分	0 好友	440 主题

发消息

VictoriaMetrics + Loki 日志系统实战：成本骤降60%的架构升级指南

发表于 2 小时前 | 查看: 6| 回复: 0

ELK 的痛点

先说说我们在 ELK 上踩过的坑：

资源消耗惊人：15 个节点的 ES 集群，每个节点 64GB 内存，光内存就要 960GB，还经常 OOM
索引管理噩梦：每天产生 2TB 日志，索引分片太多导致 master 节点 CPU 打满
查询性能不稳定：跨多天的日志查询经常超时
成本居高不下：SSD 存储成本、硬件成本、运维人力成本
扩容不灵活：想加节点得重新平衡分片，过程痛苦

为什么选择 VictoriaMetrics + Loki

我们评估了几个方案：

方案	优点	缺点	适用场景
ELK	全文检索强	资源消耗大	复杂日志分析
Loki	轻量、成本低	不支持全文检索	云原生日志
ClickHouse	查询快	运维复杂	分析型场景
VictoriaLogs	性能好	产品较新	高性能日志

最终架构：

┌──────────────────────────────────────────────────────────────────────────────┐
│                              日志采集层                                       │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │ Vector      │  │ Promtail    │  │ Fluent Bit  │  │ OpenTelemetry       │  │
│  │ (推荐)      │  │             │  │             │  │ Collector           │  │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────────┬──────────┘  │
│         │                │                │                     │            │
├─────────┼────────────────┼────────────────┼─────────────────────┼────────────┤
│         └────────────────┴────────────────┴─────────────────────┘            │
│                                    │                                         │
│                              ┌─────▼─────┐                                   │
│                              │   Kafka   │ (可选，大规模时使用)              │
│                              └─────┬─────┘                                   │
│                                    │                                         │
├────────────────────────────────────┼──────────────────────────────────────────┤
│                              存储层                                          │
│    ┌───────────────────────────────┼───────────────────────────────────┐     │
│    │                               ▼                                    │     │
│    │  ┌─────────────────┐   ┌─────────────────┐   ┌─────────────────┐  │     │
│    │  │ VictoriaMetrics │   │  Grafana Loki   │   │ VictoriaLogs    │  │     │
│    │  │ (指标)          │   │  (日志)          │   │ (高性能日志)    │  │     │
│    │  └────────┬────────┘   └────────┬────────┘   └────────┬────────┘  │     │
│    │           │                     │                     │           │     │
│    │           │              Object Storage (S3/MinIO)    │           │     │
│    │           │              ┌─────────────────────┐      │           │     │
│    │           │              │                     │      │           │     │
│    └───────────┼──────────────┤   Long-term Store   ├──────┼───────────┘     │
│                │              │                     │      │                 │
│                │              └─────────────────────┘      │                 │
├────────────────┼───────────────────────────────────────────┼─────────────────┤
│                │               可视化层                     │                 │
│                │          ┌─────────────────┐               │                 │
│                └────────►│    Grafana      │◄──────────────┘                 │
│                           │                 │                                │
│                           └─────────────────┘                                │
└──────────────────────────────────────────────────────────────────────────────┘

技术特点

VictoriaMetrics：

单机版性能比 Prometheus 高 5-10 倍
存储压缩率高，磁盘占用只有 Prometheus 的 1/7
兼容 PromQL，无缝替换 Prometheus
支持多租户、长期存储

Grafana Loki：

只索引标签，不索引日志内容
存储成本比 ES 低 10 倍以上
与 Grafana 深度集成
支持 S3/MinIO 作为后端存储

环境要求

硬件配置：

组件	配置	数量	说明
VictoriaMetrics	16C/64G/2TB NVMe	3	集群模式
Loki	8C/32G/500GB NVMe	3	读写分离模式
MinIO	8C/16G/20TB HDD	4	纠删码模式
Grafana	4C/8G	2	高可用

软件版本：

VictoriaMetrics：1.96.0 (Cluster 版)
Grafana Loki：2.9.4
Grafana：10.3.1
Vector：0.35.0
MinIO：RELEASE.2024-01-18T22-51-28Z

日志规模：

日均日志量：3TB
保留周期：热数据 7 天，冷数据 90 天
日志条数：约 50 亿条/天
查询 QPS：峰值 500

二、VictoriaMetrics 集群部署

2.1 架构设计

VictoriaMetrics 集群由三个组件组成：

                          ┌──────────────────┐
                          │    vminsert      │
                          │  (写入网关)       │
                          └────────┬─────────┘
                                   │
                                   ▼
              ┌────────────────────────────────────────┐
              │                                        │
              ▼                    ▼                   ▼
    ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐
    │   vmstorage-0   │  │   vmstorage-1   │  │   vmstorage-2   │
    └─────────────────┘  └─────────────────┘  └─────────────────┘
              │                    │                   │
              └────────────────────┼───────────────────┘
                                   │
                                   ▼
                          ┌──────────────────┐
                          │    vmselect      │
                          │  (查询网关)       │
                          └──────────────────┘

2.2 Helm 部署

# Add Helm repository
helm repo add vm https://victoriametrics.github.io/helm-charts/
helm repo update

完整的 values.yaml：

# victoriametrics-cluster-values.yaml
vmcluster:
  enabled: true
  spec:
    retentionPeriod: "90d"
    replicationFactor: 2

# VMSelect - query component
vmselect:
  replicaCount: 3
  image:
    repository: victoriametrics/vmselect
    tag: v1.96.0-cluster
  extraArgs:
    search.maxConcurrentRequests: "32"
    search.maxQueueDuration: "30s"
    search.maxQueryDuration: "120s"
    search.maxUniqueTimeseries: "1000000"
    search.maxSamplesPerQuery: "1000000000"
    search.cacheTimestampOffset: "5m"
    dedup.minScrapeInterval: "30s"
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "8"
      memory: "16Gi"
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: vmselect
      topologyKey: kubernetes.io/hostname
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: rook-ceph-block
        resources:
          requests:
            storage: 50Gi

# VMInsert - ingestion component
vminsert:
  replicaCount: 3
  image:
    repository: victoriametrics/vminsert
    tag: v1.96.0-cluster
  extraArgs:
    maxLabelsPerTimeseries: "50"
    maxLabelValueLen: "1024"
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "4"
      memory: "8Gi"
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: vminsert
      topologyKey: kubernetes.io/hostname

# VMStorage - storage component
vmstorage:
  replicaCount: 3
  image:
    repository: victoriametrics/vmstorage
    tag: v1.96.0-cluster
  extraArgs:
    dedup.minScrapeInterval: "30s"
    search.maxUniqueTimeseries: "5000000"
    retentionTimezoneOffset: "8h"
    bigMergeConcurrency: "2"
    smallMergeConcurrency: "8"
    finalMergeDelay: "30s"
  resources:
    requests:
      cpu: "4"
      memory: "32Gi"
    limits:
      cpu: "16"
      memory: "64Gi"
  storage:
    volumeClaimTemplate:
      spec:
        storageClassName: rook-ceph-block
        resources:
          requests:
            storage: 2Ti
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
    - labelSelector:
        matchLabels:
          app: vmstorage
      topologyKey: kubernetes.io/hostname

# VMAgent for scraping
vmagent:
  enabled: true
  spec:
    image:
      repository: victoriametrics/vmagent
      tag: v1.96.0
    replicaCount: 2
    resources:
      requests:
        cpu: "500m"
        memory: "512Mi"
      limits:
        cpu: "2"
        memory: "2Gi"
    extraArgs:
      promscrape.streamParse: "true"
      promscrape.maxScrapeSize: "256MB"
      remoteWrite.maxDiskUsagePerURL: "10GB"
      remoteWrite.queues: "8"
      remoteWrite.showURL: "true"
    remoteWrite:
    - url: "http://vminsert-vmcluster.monitoring.svc:8480/insert/0/prometheus/"
      selectAllByDefault: true

# VMAlert for alerting
vmalert:
  enabled: true
  spec:
    image:
      repository: victoriametrics/vmalert
      tag: v1.96.0
    replicaCount: 2
    resources:
      requests:
        cpu: "200m"
        memory: "256Mi"
      limits:
        cpu: "1"
        memory: "1Gi"
    datasource:
      url: "http://vmselect-vmcluster.monitoring.svc:8481/select/0/prometheus/"
    notifier:
      url: "http://alertmanager.monitoring.svc:9093/"
    extraArgs:
      external.url: "https://grafana.internal.company.com"
      external.alert.source: "{{ .ExternalURL }}/explore?orgId=1&left=%5B%22now-1h%22,%22now%22,%22VictoriaMetrics%22,%7B%22expr%22:%22{{ .Expr | urlquery }}%22%7D%5D"
    remoteWrite:
      url: "http://vminsert-vmcluster.monitoring.svc:8480/insert/0/prometheus/"
    remoteRead:
      url: "http://vmselect-vmcluster.monitoring.svc:8481/select/0/prometheus/"

部署：

helm upgrade --install vmcluster vm/victoria-metrics-k8s-stack \
    --namespace monitoring \
    --create-namespace \
    --values victoriametrics-cluster-values.yaml \
    --wait

# Verify deployment
kubectl get pods -n monitoring -l app.kubernetes.io/instance=vmcluster

2.3 性能调优

创建配置文件进行调优：

# vmstorage tuning ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: vmstorage-tuning
  namespace: monitoring
data:
  tuning.sh: |
    #!/bin/bash
    # Increase file descriptor limits
    ulimit -n 1000000

    # Tune kernel parameters for storage workload
    sysctl -w vm.max_map_count=262144
    sysctl -w net.core.somaxconn=65535
    sysctl -w net.ipv4.tcp_max_syn_backlog=65535

    # Disable transparent hugepages
    echo never > /sys/kernel/mm/transparent_hugepage/enabled
    echo never > /sys/kernel/mm/transparent_hugepage/defrag

三、Grafana Loki 部署

3.1 架构选择

Loki 有三种部署模式，我们选择了读写分离模式：

┌─────────────────────────────────────────────────────────────────┐
│                     Loki 读写分离架构                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│   写入路径:                                                       │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│   │Distributor│───►│ Ingester │───►│   S3     │                  │
│   └──────────┘    └──────────┘    └──────────┘                  │
│                         │                                        │
│                         ▼                                        │
│                   ┌──────────┐                                   │
│                   │  Kafka   │ (WAL)                             │
│                   └──────────┘                                   │
│                                                                  │
│   查询路径:                                                       │
│   ┌──────────┐    ┌──────────┐    ┌──────────┐                  │
│   │Query     │───►│ Querier  │───►│  Index   │                  │
│   │Frontend  │    │          │    │  Gateway │                  │
│   └──────────┘    └──────────┘    └──────────┘                  │
│                         │               │                        │
│                         ▼               ▼                        │
│                   ┌──────────┐    ┌──────────┐                  │
│                   │   S3     │    │BoltDB    │                  │
│                   │ (chunks) │    │(index)   │                  │
│                   └──────────┘    └──────────┘                  │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

3.2 Helm 部署 Loki

# loki-distributed-values.yaml
loki:
  image:
    repository: grafana/loki
    tag: 2.9.4

# Common configuration
commonConfig:
  replication_factor: 3
  ring:
    kvstore:
      store: memberlist

# Storage configuration
storage:
  bucketNames:
    chunks: loki-chunks
    ruler: loki-ruler
    admin: loki-admin
  type: s3
  s3:
    endpoint: minio.storage.svc:9000
    region: us-east-1
    secretAccessKey: ${MINIO_SECRET_KEY}
    accessKeyId: ${MINIO_ACCESS_KEY}
    s3ForcePathStyle: true
    insecure: true

# Schema configuration
schemaConfig:
  configs:
  - from: "2024-01-01"
    store: tsdb
    object_store: s3
    schema: v12
    index:
      prefix: loki_index_
      period: 24h

# Limits configuration
limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_cache_freshness_per_query: 10m
  split_queries_by_interval: 15m
  max_query_parallelism: 32
  max_query_series: 500
  ingestion_rate_mb: 100
  ingestion_burst_size_mb: 200
  per_stream_rate_limit: 10MB
  per_stream_rate_limit_burst: 50MB
  max_entries_limit_per_query: 50000
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 30
  retention_period: 2160h # 90 days

# Query frontend configuration
frontend:
  max_outstanding_per_tenant: 4096
  compress_responses: true
  log_queries_longer_than: 10s

# Query scheduler
query_scheduler:
  max_outstanding_requests_per_tenant: 32768

# Ingester configuration
ingester:
  chunk_idle_period: 30m
  chunk_block_size: 262144
  chunk_retain_period: 1m
  max_transfer_retries: 0
  wal:
    enabled: true
    dir: /loki/wal
    flush_on_shutdown: true
    replay_memory_ceiling: 4GB

# Write components
write:
  replicas: 3
  resources:
    requests:
      cpu: "2"
      memory: "4Gi"
    limits:
      cpu: "8"
      memory: "16Gi"
  persistence:
    enabled: true
    size: 100Gi
    storageClass: rook-ceph-block
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/component: write
        topologyKey: kubernetes.io/hostname

# Read components
read:
  replicas: 3
  resources:
    requests:
      cpu: "1"
      memory: "2Gi"
    limits:
      cpu: "4"
      memory: "8Gi"
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app.kubernetes.io/component: read
        topologyKey: kubernetes.io/hostname

# Backend components (compactor, ruler)
backend:
  replicas: 2
  resources:
    requests:
      cpu: "500m"
      memory: "1Gi"
    limits:
      cpu: "2"
      memory: "4Gi"
  persistence:
    enabled: true
    size: 50Gi
    storageClass: rook-ceph-block

# Gateway
gateway:
  enabled: true
  replicas: 2
  resources:
    requests:
      cpu: "100m"
      memory: "128Mi"
    limits:
      cpu: "500m"
      memory: "256Mi"
  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
    - host: loki.internal.company.com
      paths:
      - path: /
        pathType: Prefix
    tls:
    - secretName: loki-tls
      hosts:
      - loki.internal.company.com

# Memcached for caching
memcached:
  enabled: true
  image:
    repository: memcached
    tag: 1.6.23-alpine

memcachedChunks:
  enabled: true
  replicas: 3
  resources:
    requests:
      cpu: "500m"
      memory: "2Gi"
    limits:
      cpu: "2"
      memory: "4Gi"
  extraArgs:
  - -m 3072
  - -I 32m
  - -c 4096

memcachedFrontend:
  enabled: true
  replicas: 2
  resources:
    requests:
      cpu: "200m"
      memory: "1Gi"
    limits:
      cpu: "1"
      memory: "2Gi"
  extraArgs:
  - -m 1536
  - -I 16m
  - -c 2048

memcachedIndexQueries:
  enabled: true
  replicas: 2
  resources:
    requests:
      cpu: "200m"
      memory: "512Mi"
    limits:
      cpu: "1"
      memory: "1Gi"
  extraArgs:
  - -m 512
  - -I 8m
  - -c 1024

部署：

helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

helm upgrade --install loki grafana/loki-distributed \
    --namespace logging \
    --create-namespace \
    --values loki-distributed-values.yaml \
    --wait

3.3 MinIO 对象存储

Loki 需要对象存储作为长期存储后端：

# minio-values.yaml
mode: distributed

replicas: 4

persistence:
  enabled: true
  size: 5Ti
  storageClass: local-storage

resources:
  requests:
    cpu: "2"
    memory: "8Gi"
  limits:
    cpu: "8"
    memory: "16Gi"

rootUser: admin
rootPassword: ${MINIO_ROOT_PASSWORD}

buckets:
- name: loki-chunks
  policy: none
  purge: false
- name: loki-ruler
  policy: none
  purge: false
- name: loki-admin
  policy: none
  purge: false

policies:
- name: loki-policy
  statements:
  - resources:
    - "arn:aws:s3:::loki-*"
    - "arn:aws:s3:::loki-*/*"
    actions:
    - "s3:*"

users:
- accessKey: loki
  secretKey: ${LOKI_MINIO_SECRET}
  policy: loki-policy

ingress:
  enabled: true
  ingressClassName: nginx
  hosts:
  - minio.internal.company.com
  tls:
  - secretName: minio-tls
    hosts:
    - minio.internal.company.com

consoleIngress:
  enabled: true
  ingressClassName: nginx
  hosts:
  - minio-console.internal.company.com
  tls:
  - secretName: minio-console-tls
    hosts:
    - minio-console.internal.company.com

metrics:
  serviceMonitor:
    enabled: true
    namespace: monitoring

四、日志采集配置

4.1 Vector 采集器

Vector 是 Rust 写的高性能日志采集器，性能比 Fluent Bit 还好：

# vector-values.yaml
role: Agent

image:
  repository: timberio/vector
  tag: 0.35.0-alpine

resources:
  requests:
    cpu: "100m"
    memory: "128Mi"
  limits:
    cpu: "1"
    memory: "512Mi"

podLabels:
  app: vector
  component: log-collector

tolerations:
- operator: Exists
  effect: NoSchedule

customConfig:
  data_dir: /vector-data-dir
  api:
    enabled: true
    address: 0.0.0.0:8686
    playground: false

  # Sources
  sources:
    kubernetes_logs:
      type: kubernetes_logs
      auto_partial_merge: true
      exclude_paths_glob_patterns:
      - "**/kube-system/**"
      - "**/monitoring/**"
      pod_annotation_fields:
        container_image: container_image
        container_name: container_name
        pod_ip: pod_ip
        pod_name: pod_name
        pod_namespace: namespace
        pod_node_name: node_name
        pod_labels: pod_labels
        pod_annotations: pod_annotations

    internal_metrics:
      type: internal_metrics

  # Transforms
  transforms:
    # Parse structured logs
    parse_json:
      type: remap
      inputs:
      - kubernetes_logs
      source: |
            # Try to parse as JSON
            parsed, err = parse_json(.message)
            if err == null {
              . = merge(., parsed)
            }

            # Extract log level
            .level = .level || .severity || .log_level || "info"
            .level = downcase!(.level)

            # Normalize timestamp
            .timestamp = .timestamp || .time || .@timestamp || now()

            # Add processing metadata
            .vector_processed_at = now()

    # Filter noisy logs
    filter_noise:
      type: filter
      inputs:
      - parse_json
      condition:
        type: vrl
        source: |
              # Filter out health check logs
              !match!(.message, r'(healthz|readyz|livez|health.*check)')
              &&
              # Filter out debug logs in production
              .level != "debug" && .level != "trace"

    # Add labels for Loki
    add_loki_labels:
      type: remap
      inputs:
      - filter_noise
      source: |
            # Create Loki labels from Kubernetes metadata
            .loki_labels = {
              "namespace": .namespace,
              "pod": .pod_name,
              "container": .container_name,
              "node": .node_name,
              "app": .pod_labels.app || .pod_labels."app.kubernetes.io/name" || "unknown",
              "level": .level
            }

            # Remove high-cardinality fields from labels
            del(.pod_labels)
            del(.pod_annotations)

    # Reduce log size
    reduce_size:
      type: remap
      inputs:
      - add_loki_labels
      source: |
            # Truncate very long messages
            if length!(.message) > 10000 {
              .message = slice!(.message, 0, 10000) + "... [truncated]"
            }

            # Remove unnecessary fields
            del(.file)
            del(.source_type)

  # Sinks
  sinks:
    loki:
      type: loki
      inputs:
      - reduce_size
      endpoint: http://loki-gateway.logging.svc:80
      encoding:
        codec: json
      labels:
        namespace: "{{ loki_labels.namespace }}"
        pod: "{{ loki_labels.pod }}"
        container: "{{ loki_labels.container }}"
        node: "{{ loki_labels.node }}"
        app: "{{ loki_labels.app }}"
        level: "{{ loki_labels.level }}"
      out_of_order_action: accept
      remove_label_fields: true
      remove_timestamp: false
      batch:
        max_bytes: 10485760
        timeout_secs: 5
      buffer:
        type: disk
        max_size: 268435456
        when_full: block

    prometheus:
      type: prometheus_exporter
      inputs:
      - internal_metrics
      address: 0.0.0.0:9090

# Service monitor for Vector metrics
service:
  enabled: true
  ports:
  - name: metrics
    port: 9090
    targetPort: 9090
    protocol: TCP

podMonitor:
  enabled: true
  metricsEndpoints:
  - port: metrics
    interval: 30s

部署 Vector：

helm repo add vector https://helm.vector.dev
helm repo update

helm upgrade --install vector vector/vector \
    --namespace logging \
    --values vector-values.yaml \
    --wait

4.2 多集群日志采集

对于多集群环境，我们使用 Kafka 作为中间层：

# vector-aggregator-values.yaml
role: Aggregator

replicas: 3

image:
  repository: timberio/vector
  tag: 0.35.0-alpine

resources:
  requests:
    cpu: "2"
    memory: "4Gi"
  limits:
    cpu: "8"
    memory: "16Gi"

customConfig:
  data_dir: /vector-data-dir

  sources:
    kafka:
      type: kafka
      bootstrap_servers: kafka.messaging.svc:9092
      group_id: vector-aggregator
      topics:
      - logs-dev
      - logs-staging
      - logs-production
      auto_offset_reset: latest
      decoding:
        codec: json

  transforms:
    add_cluster_label:
      type: remap
      inputs:
      - kafka
      source: |
            # Extract cluster from Kafka topic
            .cluster = replace!(.topic, "logs-", "")

            # Add to Loki labels
            .loki_labels.cluster = .cluster

  sinks:
    loki:
      type: loki
      inputs:
      - add_cluster_label
      endpoint: http://loki-gateway.logging.svc:80
      encoding:
        codec: json
      labels:
        cluster: "{{ loki_labels.cluster }}"
        namespace: "{{ loki_labels.namespace }}"
        app: "{{ loki_labels.app }}"
        level: "{{ loki_labels.level }}"
      batch:
        max_bytes: 52428800
        timeout_secs: 10

五、Grafana 配置

5.1 数据源配置

# grafana-values.yaml
grafana:
  image:
    repository: grafana/grafana
    tag: 10.3.1

  replicas: 2

  resources:
    requests:
      cpu: "500m"
      memory: "512Mi"
    limits:
      cpu: "2"
      memory: "2Gi"

  persistence:
    enabled: true
    size: 20Gi
    storageClassName: rook-ceph-block

  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
      - name: VictoriaMetrics
        type: prometheus
        url: http://vmselect-vmcluster.monitoring.svc:8481/select/0/prometheus/
        access: proxy
        isDefault: true
        jsonData:
          timeInterval: "30s"
          httpMethod: POST
          manageAlerts: true
          prometheusType: Prometheus
          prometheusVersion: 2.48.0

      - name: Loki
        type: loki
        url: http://loki-gateway.logging.svc:80
        access: proxy
        jsonData:
          maxLines: 5000
          timeout: 300
          derivedFields:
          - datasourceUid: VictoriaMetrics
            matcherRegex: "trace_id=(\\w+)"
            name: TraceID
            url: '$${__value.raw}'

      - name: Loki (Streaming)
        type: loki
        url: http://loki-gateway.logging.svc:80
        access: proxy
        jsonData:
          maxLines: 1000
          timeout: 60

  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
      - name: 'default'
        orgId: 1
        folder: ''
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/default
      - name: 'infrastructure'
        orgId: 1
        folder: 'Infrastructure'
        type: file
        disableDeletion: false
        editable: true
        options:
          path: /var/lib/grafana/dashboards/infrastructure

  dashboards:
    default:
      victoriametrics-cluster:
        gnetId: 11176
        revision: 18
        datasource: VictoriaMetrics
      loki-overview:
        gnetId: 13639
        revision: 2
        datasource: Loki
      vector-metrics:
        gnetId: 12539
        revision: 1
        datasource: VictoriaMetrics

  grafana.ini:
    server:
      root_url: https://grafana.internal.company.com
    auth:
      disable_login_form: false
    auth.generic_oauth:
      enabled: true
      name: Keycloak
      client_id: grafana
      client_secret: ${GRAFANA_OAUTH_SECRET}
      scopes: openid profile email groups
      auth_url: https://sso.internal.company.com/realms/company/protocol/openid-connect/auth
      token_url: https://sso.internal.company.com/realms/company/protocol/openid-connect/token
      api_url: https://sso.internal.company.com/realms/company/protocol/openid-connect/userinfo
      role_attribute_path: contains(groups, 'platform-team') && 'Admin' || contains(groups, 'developers') && 'Editor' || 'Viewer'
    feature_toggles:
      enable: tempoSearch,tempoApmTable,traceToLogs
    unified_alerting:
      enabled: true
    alerting:
      enabled: false

  ingress:
    enabled: true
    ingressClassName: nginx
    hosts:
    - grafana.internal.company.com
    tls:
    - secretName: grafana-tls
      hosts:
      - grafana.internal.company.com

5.2 日志查询最佳实践

创建常用的 LogQL 查询保存为仪表盘：

# grafana-dashboard-logs.json (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboards-logs
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  application-logs.json: |
    {
      "title": "Application Logs",
      "panels": [
        {
          "title": "Error Logs by App",
          "type": "logs",
          "targets": [
            {
              "expr": "{namespace=\"production\", level=\"error\"} |= \"$search\"",
              "refId": "A"
            }
          ]
        },
        {
          "title": "Error Rate by App",
          "type": "timeseries",
          "targets": [
            {
              "expr": "sum by (app) (rate({namespace=\"production\", level=\"error\"}[5m]))",
              "legendFormat": "{{ app }}",
              "refId": "A"
            }
          ]
        },
        {
          "title": "Log Volume",
          "type": "timeseries",
          "targets": [
            {
              "expr": "sum(rate({namespace=\"production\"}[5m])) by (app)",
              "legendFormat": "{{ app }}",
              "refId": "A"
            }
          ]
        }
      ],
      "templating": {
        "list": [
          {
            "name": "namespace",
            "type": "query",
            "query": "label_values(namespace)"
          },
          {
            "name": "app",
            "type": "query",
            "query": "label_values({namespace=\"$namespace\"}, app)"
          },
          {
            "name": "search",
            "type": "textbox",
            "current": {
              "value": ""
            }
          }
        ]
      }
    }

常用 LogQL 查询示例：

# 按应用查询错误日志
{namespace="production", app="order-service", level="error"}

# 搜索包含特定错误的日志
{namespace="production"} |= "NullPointerException" | json

# 按时间范围查询
{namespace="production", app="payment-service"} | json | duration > 1s

# 统计错误率
sum(rate({namespace="production", level="error"}[5m])) by (app)

# 提取字段并过滤
{namespace="production"} | json | status_code >= 500

# 日志量统计
sum by (app) (count_over_time({namespace="production"}[1h]))

# 多条件查询
{namespace="production"} |~ "error|exception|failed" | json | line_format "{{.timestamp}} [{{.level}}] {{.message}}"

六、迁移过程和对比

6.1 迁移策略

我们采用了渐进式迁移策略：

Week 1-2: 部署新系统，双写日志
           ┌────────────────┐
           │  Applications  │
           └───────┬────────┘
                   │
           ┌───────▼────────┐
           │     Vector     │
           └───────┬────────┘
                   │
        ┌──────────┴──────────┐
        ▼                     ▼
┌───────────────┐    ┌───────────────┐
│ Elasticsearch │    │     Loki      │
│   (existing)  │    │    (new)      │
└───────────────┘    └───────────────┘

Week 3-4: 验证新系统，逐步切换查询
Week 5-6: 停止向 ES 写入，完全切换到 Loki
Week 7-8: 下线 ES 集群

6.2 成本对比

迁移前后的对比：

指标	ELK 方案	VM + Loki 方案	变化
硬件成本
ES 集群	15 台 x 64GB x 2TB SSD	-	-
VM/Loki	-	6 台 x 64GB + 4 台 x 20TB HDD	-
月度硬件成本	¥45,000	¥18,000	-60%
存储成本
日存储量	2TB (热) + 2TB (压缩)	0.4TB (压缩)	-80%
月存储费用	¥15,000	¥3,000	-80%
运维成本
人力投入	1.5 人	0.5 人	-67%
故障恢复时间	平均 4 小时	平均 30 分钟	-88%
性能指标
写入吞吐	50k EPS	200k EPS	+300%
查询延迟 (P99)	5s	0.5s	-90%

6.3 查询性能对比

场景: 查询最近 1 小时某应用的错误日志

ELK:
  查询时间: 2.3s
  内存使用: 峰值 8GB
  命中结果: 45,678 条

Loki:
  查询时间: 0.3s
  内存使用: 峰值 256MB
  命中结果: 45,678 条

场景: 查询最近 7 天包含特定关键词的日志

ELK:
  查询时间: 15s
  经常超时需要分页

Loki:
  查询时间: 3s
  使用索引标签预过滤

七、最佳实践

7.1 日志规范

制定统一的日志格式规范：

{
  "timestamp": "2024-01-15T10:30:45.123Z",
  "level": "error",
  "service": "order-service",
  "trace_id": "abc123def456",
  "span_id": "xyz789",
  "message": "Failed to process order",
  "error": {
    "type": "PaymentException",
    "message": "Insufficient funds",
    "stack": "..."
  },
  "context": {
    "order_id": "ORD-12345",
    "user_id": "USR-67890",
    "amount": 99.99
  }
}

应用日志配置示例（Go）：

// logger/logger.go
package logger

import (
    "go.uber.org/zap"
    "go.uber.org/zap/zapcore"
)

func NewLogger() (*zap.Logger, error) {
    config := zap.NewProductionConfig()
    config.EncoderConfig.TimeKey = "timestamp"
    config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
    config.EncoderConfig.LevelKey = "level"
    config.EncoderConfig.EncodeLevel = zapcore.LowercaseLevelEncoder
    config.EncoderConfig.MessageKey = "message"

    // Add service name
    config.InitialFields = map[string]interface{}{
        "service": os.Getenv("SERVICE_NAME"),
    }

    return config.Build()
}

// Usage
func main() {
    logger, _ := NewLogger()
    defer logger.Sync()

    logger.Info("Order processed",
        zap.String("order_id", "ORD-12345"),
        zap.String("trace_id", traceID),
    )

    logger.Error("Payment failed",
        zap.Error(err),
        zap.String("order_id", orderID),
    )
}

7.2 标签设计原则

Loki 的性能高度依赖标签设计：

# Good - 低基数标签
labels:
  namespace: production
  app: order-service
  level: error
  node: node-01

# Bad - 高基数标签 (不要这样做!)
labels:
  user_id: USR-12345  # 用户 ID 基数太高
  request_id: abc123   # 请求 ID 基数太高
  pod_name: order-xxx  # Pod 名称会变化

7.3 告警配置

# Loki alerting rules
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-alerting-rules
  namespace: logging
data:
  rules.yaml: |
    groups:
      - name: application-alerts
        rules:
          - alert: HighErrorRate
            expr: |
              sum(rate({level="error"}[5m])) by (app) > 10
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High error rate for {{ $labels.app }}"
              description: "Error rate is {{ $value }} errors/sec"

          - alert: NoLogsReceived
            expr: |
              absent(count_over_time({namespace="production"}[5m])) == 1
            for: 10m
            labels:
              severity: critical
            annotations:
              summary: "No logs received from production"

          - alert: LogVolumeSpike
            expr: |
              sum(rate({namespace="production"}[5m])) /
              sum(rate({namespace="production"}[1h] offset 1d)) > 2
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Log volume 2x higher than yesterday"

7.4 性能调优清单

# Performance tuning checklist

vector:
- buffer_size: 根据内存调整，建议 256MB
- batch_size: Loki 建议 10MB
- batch_timeout: 5-10s 平衡延迟和效率

loki:
- ingester_memory: 每个 ingester 建议 16GB+
- chunk_idle_period: 30m 平衡写入和压缩
- max_query_parallelism: 根据 CPU 调整
- split_queries_by_interval: 15m 加速大范围查询
- memcached: 必须开启，显著提升查询性能

victoriametrics:
- dedup_interval: 设置为采集间隔
- merge_concurrency: 根据 CPU 核数调整
- search_cache: 建议开启 timestamp offset

storage:
- use_ssd: 热数据用 NVMe SSD
- compression: 开启 zstd 压缩
- retention: 根据需求设置，过长会增加存储成本

八、故障排查

8.1 VictoriaMetrics 排查

# Check cluster status
curl http://vmselect:8481/select/0/prometheus/api/v1/status/tsdb

# Check ingestion rate
curl http://vminsert:8480/metrics | grep vm_rows_inserted

# Check storage usage
curl http://vmstorage:8482/metrics | grep vm_data_size

# Force merge (maintenance)
curl -X POST http://vmstorage:8482/internal/force_merge

# Debug slow queries
curl "http://vmselect:8481/select/0/prometheus/api/v1/query?query=up&stats=1"

8.2 Loki 排查

# Check Loki ring status
curl http://loki:3100/ring

# Check ingester status
curl http://loki:3100/ingester/ready

# Check flush status
curl http://loki:3100/flush

# Check compactor status
curl http://loki:3100/compactor/ring

# Debug query
curl -G -s "http://loki:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={namespace="production"}' \
  --data-urlencode 'start=1705305600000000000' \
  --data-urlencode 'end=1705309200000000000' \
  --data-urlencode 'limit=100'

# Check stream info
curl -G -s "http://loki:3100/loki/api/v1/series" \
  --data-urlencode 'match[]={namespace="production"}'

8.3 常见问题

问题1: Loki 写入失败，rate limit

# Check current limits
curl http://loki:3100/config | jq '.limits_config'

# Solution: Increase limits in config
limits_config:
  ingestion_rate_mb: 200
  ingestion_burst_size_mb: 400
  per_stream_rate_limit: 20MB

问题2: 查询超时

# Solution 1: Reduce query range
{app="myapp"} | json | __error__=""  # 最近 1 小时

# Solution 2: Add more specific filters
{namespace="production", app="myapp", level="error"} | json

# Solution 3: Increase query limits
query_frontend:
  max_outstanding_per_tenant: 8192
query_scheduler:
  max_outstanding_requests_per_tenant: 65536

问题3: 存储空间增长过快

# Check stream cardinality
logcli series '{}' | wc -l

# If too many streams, check label design
# High cardinality labels cause stream explosion

九、总结

迁移收益

成本大幅降低：硬件和存储成本降低 60-80%
性能显著提升：查询延迟从秒级降到毫秒级
运维更简单：不再需要管理复杂的 ES 分片和索引
与 Kubernetes 生态融合更好：标签模型与 K8s 天然契合

注意事项

Loki 不是 ES 的替代品：如果需要全文检索，Loki 不适合
标签设计很重要：高基数标签会严重影响性能
日志格式要规范：结构化日志才能发挥 Loki 的优势
监控系统自身：VictoriaMetrics 和 Loki 也需要被监控

进阶方向

Tempo 集成：实现 logs -> traces 的跳转
VictoriaLogs：VM 官方的日志解决方案，性能更好
OpenTelemetry：统一日志、指标、追踪的采集

参考资料

VictoriaMetrics 文档
Grafana Loki 文档
Vector 文档
LogQL 语法
- *

附录

命令速查表

# VictoriaMetrics
vmctl prometheus-import ...           # 从 Prometheus 导入数据
vmctl remote-read ...                 # 从远程读取数据
vmbackup -src=... -dst=...            # 备份数据
vmrestore -src=... -dst=...           # 恢复数据

# Loki CLI (logcli)
logcli query '{namespace="production"}'
logcli labels namespace
logcli series '{namespace="production"}'
logcli instant-query '{app="myapp"} | json | count_over_time([5m])'

# Vector
vector validate --config vector.toml  # 验证配置
vector top                            # 实时监控
vector test --config vector.toml      # 测试配置

# MinIO
mc alias set myminio http://minio:9000 admin password
mc ls myminio/loki-chunks
mc du myminio/loki-chunks

术语表

术语	说明
TSDB	时序数据库
Cardinality	基数，标签可能值的数量
Ingester	Loki 的写入组件
Querier	Loki 的查询组件
Chunk	日志数据块
LogQL	Loki 的查询语言
PromQL	Prometheus 查询语言
Stream	Loki 中相同标签的日志流

像这种体量的日志系统，单靠个人单打独斗很难踩完所有的坑。在方案调研和后期优化过程中，多亏了经常在云栈社区里跟同行交流思路，很多监控和自动化运维的灵感都是在那里碰撞出来的。从 ELK 到 VictoriaMetrics+Loki 的迁移，不仅仅是工具的替换，更是团队监控思维的一次全面升级。

上一篇：网关路由越权漏洞的AI Agent审计实践与Token成本优化
下一篇：不写代码，怎么榨干模型 Token 额度？Skill+Agent 全天候自动化方案

VictoriaMetrics, Loki, Grafana, Kubernetes, 日志系统