一、概述
1.1 背景介绍
我曾经处理过一个看起来很普通的工单:某个微服务需要升级基础镜像。开发同事把 PR 提上来,CI 跑过了,代码 review 也通过了,我点了 merge 按钮,然后去接水。
回来的时候,安全团队的电话已经打过来了:“你们刚发布的服务,镜像里有个 CVE-2023-0286,CVSS 评分 9.8,高危漏洞。”
我当时的第一反应是:这怎么可能?CI 不是有镜像扫描吗?
后来查下来才发现,镜像扫描是有的,但配置的是“只告警不阻断”。扫描结果躺在 CI 日志里,但因为没有人去看(谁会在 CI 通过的情况下还翻日志呢),漏洞就这么溜进了生产环境。
这件事让我们团队彻底重新审视了 CI/CD 流水线里的镜像安全机制。我们花了两个月时间,把镜像安全扫描从“有总比没有好”升级到“安全左移、全链路防护、漏洞零容忍”。这篇文章就是我们实践经验的完整总结。
1.2 技术特点
容器镜像安全扫描这个领域,工具很多,但核心思路就那么几个:
静态扫描(SAST for Images)
在镜像构建完成后、推送到仓库前进行扫描。扫描内容包括:
- 操作系统软件包的已知漏洞
- 应用程序依赖库的漏洞
- 敏感信息泄露(密钥、密码)
- 镜像配置问题(以 root 运行、不安全的 ENTRYPOINT)
动态扫描(DAST for Containers)
容器运行时的安全检测,检查实际行为是否符合预期。这个不在本文讨论范围内。
软件物料清单(SBOM)
生成镜像中所有软件组件的清单,用于追踪和审计。这在供应链安全越来越重要的今天变得必不可少。
漏洞数据库
扫描工具的核心是漏洞数据库。主流工具会同时查询多个数据源:
- NVD(National Vulnerability Database)
- CVE 数据库
- 各 Linux 发行版的安全公告(RHEL、Ubuntu、Debian 等)
- 语言生态的安全公告(npm、PyPI、Maven 等)
1.3 适用场景
镜像安全扫描应该成为所有使用容器技术团队的标配。具体来说:
必须要做的场景:
- 生产环境运行容器化应用
- 使用公共镜像作为基础镜像
- 有合规要求(PCI DSS、HIPAA、等保等)
- 微服务架构,镜像数量多
特别要注意的场景:
- 金融、医疗、政务等敏感行业
- 面向公网的服务
- 处理用户数据的服务
- 有供应链安全要求的场景
可以简化的场景:
- 内部开发测试环境(但也建议扫描,养成习惯)
- 完全内网、无外部暴露的服务(但漏洞依然是漏洞)
1.4 环境要求
我们的技术栈基于以下环境,本文的配置都基于这个环境验证过:
# CI/CD Platform
gitlab_version: "16.6"
gitlab_runner: "16.6"
runner_executor: "docker"
# Container Runtime
docker_version: "24.0.7"
containerd_version: "1.7.11"
# Container Registry
registry: "harbor"
harbor_version: "2.9.1"
# Kubernetes (deployment target)
kubernetes_version: "1.28.4"
# Scanning Tools
trivy_version: "0.48.1"
grype_version: "0.74.0"
syft_version: "0.100.0"
# Policy Engine
opa_version: "0.60.0"
硬件资源要求:
镜像扫描是 CPU 密集型操作,建议:
- GitLab Runner:4 核 CPU,8GB 内存
- 扫描缓存:50GB SSD(用于缓存漏洞数据库)
- 网络:能访问漏洞数据库更新源
二、详细步骤
2.1 准备工作
2.1.1 选择扫描工具
市面上的镜像扫描工具很多,我们评估了几个主流工具:
| 工具 |
优点 |
缺点 |
我们的选择 |
| Trivy |
快速、准确、易集成、免费 |
企业版功能需付费 |
CI 主力工具 |
| Grype |
Anchore 出品、SBOM 集成好 |
漏洞库更新稍慢 |
辅助验证 |
| Clair |
老牌、稳定 |
部署复杂、API 老旧 |
弃用 |
| Snyk |
修复建议好、开发者友好 |
免费版有限制 |
PR 扫描 |
| Harbor 内置 |
与 Registry 集成 |
功能相对简单 |
仓库层防护 |
我们最终选择了 Trivy 作为主力工具,原因是:
- 开源免费,社区活跃
- 扫描速度快(有缓存的情况下几秒完成)
- 支持多种扫描目标(镜像、文件系统、Git 仓库、Kubernetes)
- 漏洞数据库更新及时
- CI/CD 集成简单
2.1.2 安装 Trivy
在 GitLab Runner 上安装:
#!/bin/bash
# install-trivy.sh
# Install Trivy on GitLab Runner
set -euo pipefail
TRIVY_VERSION="0.48.1"
INSTALL_DIR="/usr/local/bin"
# Download Trivy
curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh | sh -s -- -b ${INSTALL_DIR} v${TRIVY_VERSION}
# Verify installation
trivy --version
# Pre-download vulnerability database (takes a while first time)
trivy image --download-db-only
# Create cache directory
mkdir -p /var/cache/trivy
chmod 755 /var/cache/trivy
# Setup database update cron (update every 6 hours)
cat > /etc/cron.d/trivy-db-update << 'EOF'
0 */6 * * * root /usr/local/bin/trivy image --download-db-only --cache-dir /var/cache/trivy > /var/log/trivy-db-update.log 2>&1
EOF
echo "Trivy installation completed"
作为 Docker 镜像使用(推荐,更易于版本管理):
# Use official Trivy image in CI
image: aquasec/trivy:0.48.1
2.1.3 配置漏洞数据库缓存
漏洞数据库大约 100MB,每次扫描都下载会很慢。我们配置了缓存:
# .gitlab-ci.yml - Trivy cache configuration
variables:
TRIVY_CACHE_DIR: /var/cache/trivy
TRIVY_DB_REPOSITORY: ghcr.io/aquasecurity/trivy-db
# Cache Trivy database across pipelines
cache:
key: trivy-db-${CI_COMMIT_REF_SLUG}
paths:
- .trivy-cache/
policy: pull-push
2.2 核心配置
2.2.1 基础扫描配置
创建 Trivy 配置文件,统一扫描策略:
# trivy.yaml - Trivy configuration file
# Place in project root or /etc/trivy/trivy.yaml
# Cache settings
cache:
dir: /var/cache/trivy
ttl: 24h
# Database settings
db:
repository: ghcr.io/aquasecurity/trivy-db
skip-update: false
# Scan settings
scan:
# Skip files larger than this
file-patterns:
- "*.jar"
- "*.war"
- "*.ear"
# Scan these security issues
security-checks:
- vuln # Vulnerabilities
- config # Misconfigurations
- secret # Exposed secrets
# Vulnerability settings
vulnerability:
# Only report these severities
severity:
- CRITICAL
- HIGH
- MEDIUM
# Ignore unfixed vulnerabilities in production scans
ignore-unfixed: false
# Vulnerability sources
vuln-type:
- os # OS packages
- library # Application libraries
# Report settings
report:
format: table
output: trivy-report.txt
# Template for custom reports
template: ""
# Ignore specific vulnerabilities (use with caution!)
ignore:
- CVE-2023-XXXXX # Example: known false positive
2.2.2 GitLab CI 集成
这是我们生产环境使用的完整 CI 配置:
# .gitlab-ci.yml - Complete image scanning pipeline
stages:
- build
- scan
- push
- deploy
variables:
# Docker settings
DOCKER_HOST: tcp://docker:2376
DOCKER_TLS_CERTDIR: "/certs"
DOCKER_DRIVER: overlay2
# Image settings
IMAGE_NAME: ${CI_REGISTRY_IMAGE}/${CI_PROJECT_NAME}
IMAGE_TAG: ${CI_COMMIT_SHA:0:8}
# Trivy settings
TRIVY_CACHE_DIR: ".trivy-cache"
TRIVY_SEVERITY: "CRITICAL,HIGH"
TRIVY_EXIT_CODE: "1"
TRIVY_NO_PROGRESS: "true"
# Security policy
BLOCK_ON_CRITICAL: "true"
BLOCK_ON_HIGH: "true"
MAX_HIGH_VULNS: "0"
MAX_MEDIUM_VULNS: "10"
# Reusable templates
.docker-login: &docker-login
before_script:
- echo "${CI_REGISTRY_PASSWORD}" | docker login -u "${CI_REGISTRY_USER}" --password-stdin ${CI_REGISTRY}
# Build stage
build:
stage: build
image: docker:24.0.7
services:
- docker:24.0.7-dind
<<: *docker-login
script:
- docker build -t ${IMAGE_NAME}:${IMAGE_TAG} .
- docker save ${IMAGE_NAME}:${IMAGE_TAG} -o image.tar
artifacts:
paths:
- image.tar
expire_in: 1 hour
tags:
- docker
# Security scanning stage
scan:vulnerability:
stage: scan
image: aquasec/trivy:0.48.1
dependencies:
- build
variables:
GIT_STRATEGY: none
cache:
key: trivy-db
paths:
- ${TRIVY_CACHE_DIR}/
policy: pull-push
script:
# Load image from artifact
- docker load -i image.tar 2>/dev/null || true
# Update vulnerability database
- trivy image --download-db-only --cache-dir ${TRIVY_CACHE_DIR}
# Run vulnerability scan
- |
trivy image \
--cache-dir ${TRIVY_CACHE_DIR} \
--severity ${TRIVY_SEVERITY} \
--exit-code 0 \
--format table \
--output trivy-table-report.txt \
${IMAGE_NAME}:${IMAGE_TAG}
# Generate JSON report for further processing
- |
trivy image \
--cache-dir ${TRIVY_CACHE_DIR} \
--severity CRITICAL,HIGH,MEDIUM,LOW \
--exit-code 0 \
--format json \
--output trivy-report.json \
${IMAGE_NAME}:${IMAGE_TAG}
# Generate SBOM
- |
trivy image \
--cache-dir ${TRIVY_CACHE_DIR} \
--format spdx-json \
--output sbom.json \
${IMAGE_NAME}:${IMAGE_TAG}
# Check policy and decide whether to block
- |
echo "=== Security Scan Results ==="
cat trivy-table-report.txt
CRITICAL_COUNT=$(cat trivy-report.json | jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="CRITICAL")] | length')
HIGH_COUNT=$(cat trivy-report.json | jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="HIGH")] | length')
MEDIUM_COUNT=$(cat trivy-report.json | jq '[.Results[]?.Vulnerabilities[]? | select(.Severity=="MEDIUM")] | length')
echo ""
echo "=== Vulnerability Summary ==="
echo "Critical: ${CRITICAL_COUNT}"
echo "High: ${HIGH_COUNT}"
echo "Medium: ${MEDIUM_COUNT}"
echo ""
# Policy enforcement
if [ "${BLOCK_ON_CRITICAL}" = "true" ] && [ "${CRITICAL_COUNT}" -gt 0 ]; then
echo "BLOCKED: Found ${CRITICAL_COUNT} CRITICAL vulnerabilities"
exit 1
fi
if [ "${BLOCK_ON_HIGH}" = "true" ] && [ "${HIGH_COUNT}" -gt ${MAX_HIGH_VULNS} ]; then
echo "BLOCKED: Found ${HIGH_COUNT} HIGH vulnerabilities (max allowed: ${MAX_HIGH_VULNS})"
exit 1
fi
if [ "${MEDIUM_COUNT}" -gt ${MAX_MEDIUM_VULNS} ]; then
echo "WARNING: Found ${MEDIUM_COUNT} MEDIUM vulnerabilities (max allowed: ${MAX_MEDIUM_VULNS})"
fi
echo "Security scan passed!"
artifacts:
when: always
paths:
- trivy-table-report.txt
- trivy-report.json
- sbom.json
reports:
container_scanning: trivy-report.json
expire_in: 30 days
tags:
- docker
scan:secrets:
stage: scan
image: aquasec/trivy:0.48.1
dependencies:
- build
script:
# Scan for exposed secrets
- |
trivy image \
--cache-dir ${TRIVY_CACHE_DIR} \
--scanners secret \
--exit-code 1 \
--format table \
${IMAGE_NAME}:${IMAGE_TAG}
allow_failure: false
tags:
- docker
scan:config:
stage: scan
image: aquasec/trivy:0.48.1
script:
# Scan Dockerfile for misconfigurations
- |
trivy config \
--severity CRITICAL,HIGH,MEDIUM \
--exit-code 1 \
--format table \
.
allow_failure: true
tags:
- docker
# Push to registry only if scans pass
push:
stage: push
image: docker:24.0.7
services:
- docker:24.0.7-dind
<<: *docker-login
dependencies:
- build
needs:
- build
- scan:vulnerability
- scan:secrets
script:
- docker load -i image.tar
- docker tag ${IMAGE_NAME}:${IMAGE_TAG} ${IMAGE_NAME}:latest
- docker push ${IMAGE_NAME}:${IMAGE_TAG}
- docker push ${IMAGE_NAME}:latest
only:
- main
- master
- /^release\/.*$/
tags:
- docker
# Deploy only if push succeeds
deploy:production:
stage: deploy
image: bitnami/kubectl:1.28
dependencies: []
needs:
- push
script:
- kubectl set image deployment/${CI_PROJECT_NAME} app=${IMAGE_NAME}:${IMAGE_TAG}
- kubectl rollout status deployment/${CI_PROJECT_NAME} --timeout=300s
environment:
name: production
url: https://app.example.com
only:
- main
- master
when: manual
tags:
- kubernetes
2.2.3 漏洞忽略配置
有时候某些漏洞需要临时忽略(比如误报,或者等待上游修复),我们用 .trivyignore 文件管理:
# .trivyignore - Vulnerability ignore list
# Format: CVE-ID [until:YYYY-MM-DD] # reason
# False positive - this CVE doesn't affect our use case
CVE-2023-12345
# Waiting for upstream fix, review on 2024-02-01
CVE-2023-67890 until:2024-02-01 # No fix available, mitigated by WAF
# Disputed vulnerability
CVE-2023-11111 # Disputed by vendor, see https://example.com/issue/123
# Development dependency only, not in production image
CVE-2023-22222 # dev-only: test framework vulnerability
但我们有严格的规则:
- 每个忽略项必须有明确的原因
- 忽略 CRITICAL 漏洞需要安全团队审批
- 定期 review 忽略列表(每两周一次)
- 使用
until 设置过期时间
2.2.4 基于 OPA 的策略引擎
对于更复杂的策略控制,我们使用 Open Policy Agent(OPA):
# policy/image-security.rego
# OPA policy for image security scanning
package imagesecurity
import future.keywords.if
import future.keywords.in
# Default deny
default allow := false
# Configuration
max_critical := 0
max_high := 0
max_medium := 10
max_age_days := 90
# Allow if no policy violations
allow if {
count(violations) == 0
}
# Collect all violations
violations[msg] if {
critical_count > max_critical
msg := sprintf("Found %d critical vulnerabilities (max: %d)", [critical_count, max_critical])
}
violations[msg] if {
high_count > max_high
msg := sprintf("Found %d high vulnerabilities (max: %d)", [high_count, max_high])
}
violations[msg] if {
medium_count > max_medium
msg := sprintf("Found %d medium vulnerabilities (max: %d)", [medium_count, max_medium])
}
violations[msg] if {
image_too_old
msg := sprintf("Base image is older than %d days", [max_age_days])
}
violations[msg] if {
running_as_root
msg := "Container is configured to run as root"
}
violations[msg] if {
has_exposed_secrets
msg := "Exposed secrets detected in image"
}
# Helper rules
critical_count := count([v | v := input.Results[_].Vulnerabilities[_]; v.Severity == "CRITICAL"])
high_count := count([v | v := input.Results[_].Vulnerabilities[_]; v.Severity == "HIGH"])
medium_count := count([v | v := input.Results[_].Vulnerabilities[_]; v.Severity == "MEDIUM"])
image_too_old if {
created := time.parse_rfc3339_ns(input.Metadata.Created)
age_ns := time.now_ns() - created
age_days := age_ns / (24 * 60 * 60 * 1000000000)
age_days > max_age_days
}
running_as_root if {
input.Metadata.Config.User == ""
}
running_as_root if {
input.Metadata.Config.User == "root"
}
running_as_root if {
input.Metadata.Config.User == "0"
}
has_exposed_secrets if {
count(input.Results[_].Secrets) > 0
}
在 CI 中使用 OPA 策略:
# .gitlab-ci.yml - OPA policy check
scan:policy:
stage: scan
image: openpolicyagent/opa:0.60.0
dependencies:
- scan:vulnerability
script:
# Evaluate policy against scan results
- |
opa eval \
--input trivy-report.json \
--data policy/image-security.rego \
--format pretty \
"data.imagesecurity.allow"
# Get detailed violations
- |
RESULT=$(opa eval \
--input trivy-report.json \
--data policy/image-security.rego \
--format json \
"data.imagesecurity")
echo "${RESULT}" | jq .
ALLOWED=$(echo "${RESULT}" | jq -r '.result[0].expressions[0].value.allow')
if [ "${ALLOWED}" != "true" ]; then
echo "Policy violations detected:"
echo "${RESULT}" | jq -r '.result[0].expressions[0].value.violations[]'
exit 1
fi
tags:
- docker
2.3 启动和验证
2.3.1 本地测试
在提交代码前,先在本地测试扫描配置:
#!/bin/bash
# scripts/local-scan.sh
# Local image scanning for development
set -euo pipefail
IMAGE_NAME="${1:-myapp:latest}"
echo "=== Building image ==="
docker build -t ${IMAGE_NAME} .
echo ""
echo "=== Running vulnerability scan ==="
trivy image \
--severity CRITICAL,HIGH,MEDIUM \
--format table \
${IMAGE_NAME}
echo ""
echo "=== Running secret scan ==="
trivy image \
--scanners secret \
--format table \
${IMAGE_NAME}
echo ""
echo "=== Scanning Dockerfile ==="
trivy config \
--severity CRITICAL,HIGH,MEDIUM \
.
echo ""
echo "=== Generating SBOM ==="
trivy image \
--format spdx-json \
--output sbom.json \
${IMAGE_NAME}
echo ""
echo "Local scan completed. Check sbom.json for full software inventory."
2.3.2 验证扫描流水线
#!/bin/bash
# scripts/verify-pipeline.sh
# Verify scanning pipeline is working correctly
set -euo pipefail
echo "=== Verification Checklist ==="
# Check Trivy installation
echo -n "1. Trivy installed: "
if command -v trivy &> /dev/null; then
echo "OK ($(trivy --version | head -1))"
else
echo "FAILED"
exit 1
fi
# Check vulnerability database
echo -n "2. Vulnerability DB: "
DB_AGE=$(find /var/cache/trivy -name "trivy.db" -mtime -1 2>/dev/null | wc -l)
if [ "$DB_AGE" -gt 0 ]; then
echo "OK (updated within 24h)"
else
echo "WARNING (may be outdated)"
fi
# Test scan functionality
echo -n "3. Scan functionality: "
if trivy image alpine:3.19 --severity CRITICAL --exit-code 0 &> /dev/null; then
echo "OK"
else
echo "FAILED"
exit 1
fi
# Check GitLab integration
echo -n "4. GitLab artifacts: "
if [ -f "trivy-report.json" ]; then
echo "OK"
else
echo "NOT FOUND (run pipeline first)"
fi
# Check policy engine
echo -n "5. OPA policy: "
if [ -f "policy/image-security.rego" ]; then
if opa check policy/image-security.rego &> /dev/null; then
echo "OK"
else
echo "FAILED (syntax error)"
exit 1
fi
else
echo "NOT CONFIGURED"
fi
echo ""
echo "=== Verification Complete ==="
2.3.3 集成测试
创建一个包含已知漏洞的测试镜像,验证扫描能正确检测和阻断:
# test/Dockerfile.vulnerable
# DO NOT USE IN PRODUCTION - for testing only
FROM python:3.8-slim-buster
# Install package with known vulnerability
RUN pip install PyYAML==5.3.1 # CVE-2020-14343
# Add a "secret" for testing secret detection
RUN echo "aws_access_key_id=AKIAIOSFODNN7EXAMPLE" > /tmp/creds
# Run as root (misconfiguration)
USER root
CMD ["python", "-c", "print('vulnerable image')"]
# Run test
docker build -f test/Dockerfile.vulnerable -t vulnerable-test .
trivy image --severity CRITICAL,HIGH --exit-code 1 vulnerable-test
# Expected: Should exit with code 1 and report vulnerabilities
三、示例代码和配置
3.1 完整配置示例
3.1.1 多环境扫描策略
不同环境有不同的安全要求,我们用模板来管理:
# .gitlab-ci.yml - Multi-environment scanning
.scan-template: &scan-template
stage: scan
image: aquasec/trivy:0.48.1
cache:
key: trivy-db
paths:
- .trivy-cache/
artifacts:
when: always
paths:
- trivy-report.json
- sbom.json
reports:
container_scanning: trivy-report.json
# Development - warn only, don't block
scan:dev:
<<: *scan-template
variables:
TRIVY_SEVERITY: "CRITICAL,HIGH"
TRIVY_EXIT_CODE: "0" # Don't block
script:
- trivy image --exit-code ${TRIVY_EXIT_CODE} --severity ${TRIVY_SEVERITY} ${IMAGE_NAME}:${IMAGE_TAG}
only:
- /^feature\/.*$/
- /^dev\/.*$/
# Staging - block on critical only
scan:staging:
<<: *scan-template
variables:
TRIVY_SEVERITY: "CRITICAL"
TRIVY_EXIT_CODE: "1" # Block on critical
script:
- |
trivy image \
--exit-code ${TRIVY_EXIT_CODE} \
--severity ${TRIVY_SEVERITY} \
--ignore-unfixed \
${IMAGE_NAME}:${IMAGE_TAG}
only:
- staging
- /^release\/.*$/
# Production - strict policy
scan:production:
<<: *scan-template
variables:
TRIVY_SEVERITY: "CRITICAL,HIGH"
TRIVY_EXIT_CODE: "1" # Block on critical and high
script:
- |
trivy image \
--exit-code ${TRIVY_EXIT_CODE} \
--severity ${TRIVY_SEVERITY} \
${IMAGE_NAME}:${IMAGE_TAG}
# Additional checks for production
- trivy image --scanners secret --exit-code 1 ${IMAGE_NAME}:${IMAGE_TAG}
- trivy config --exit-code 1 .
only:
- main
- master
3.1.2 Harbor 集成
我们的镜像仓库使用 Harbor,配置了准入控制:
# harbor-config.yaml
# Harbor vulnerability scanning configuration
# Global scanner settings
scanner:
type: trivy
endpoint: http://trivy-adapter:8080
# Project-level policy
projects:
production:
vulnerability_scanning:
enabled: true
scan_on_push: true
prevent_vulnerable: true
severity_threshold: high # Block high and critical
content_trust:
enabled: true
require_signed: true
retention_policy:
rules:
- retain_latest: 10
- retain_days: 90
staging:
vulnerability_scanning:
enabled: true
scan_on_push: true
prevent_vulnerable: true
severity_threshold: critical # Only block critical
content_trust:
enabled: false
retention_policy:
rules:
- retain_latest: 5
- retain_days: 30
development:
vulnerability_scanning:
enabled: true
scan_on_push: true
prevent_vulnerable: false # Warn only
retention_policy:
rules:
- retain_latest: 3
- retain_days: 7
通过 Harbor API 配置项目策略:
#!/bin/bash
# scripts/configure-harbor-policy.sh
# Configure Harbor vulnerability policy via API
HARBOR_URL="https://harbor.example.com"
HARBOR_USER="admin"
HARBOR_PASS="${HARBOR_ADMIN_PASSWORD}"
PROJECT_NAME="production"
# Get project ID
PROJECT_ID=$(curl -s -u "${HARBOR_USER}:${HARBOR_PASS}" \
"${HARBOR_URL}/api/v2.0/projects?name=${PROJECT_NAME}" | jq -r '.[0].project_id')
# Configure vulnerability policy
curl -X PUT \
-u "${HARBOR_USER}:${HARBOR_PASS}" \
-H "Content-Type: application/json" \
"${HARBOR_URL}/api/v2.0/projects/${PROJECT_ID}" \
-d '{
"metadata": {
"auto_scan": "true",
"prevent_vul": "true",
"severity": "high",
"reuse_sys_cve_allowlist": "true"
}
}'
echo "Policy configured for project ${PROJECT_NAME}"
3.1.3 Kubernetes 准入控制
使用 Gatekeeper 在部署时强制执行镜像扫描策略:
# kubernetes/gatekeeper/constraint-template.yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8simagescanned
spec:
crd:
spec:
names:
kind: K8sImageScanned
validation:
openAPIV3Schema:
type: object
properties:
registries:
type: array
items:
type: string
maxAge:
type: integer
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8simagescanned
violation[{"msg": msg}] {
container := input.review.object.spec.containers[_]
not image_scanned(container.image)
msg := sprintf("Image %v has not been scanned or scan is outdated", [container.image])
}
violation[{"msg": msg}] {
container := input.review.object.spec.initContainers[_]
not image_scanned(container.image)
msg := sprintf("Init container image %v has not been scanned", [container.image])
}
image_scanned(image) {
# Check if image is from allowed registry
registry := input.parameters.registries[_]
startswith(image, registry)
# In production, you would check against an external API
# that tracks scan status
}
---
# kubernetes/gatekeeper/constraint.yaml
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sImageScanned
metadata:
name: require-scanned-images
spec:
match:
kinds:
- apiGroups: [""]
kinds: ["Pod"]
namespaces:
- production
- staging
parameters:
registries:
- "harbor.example.com/production/"
- "harbor.example.com/staging/"
maxAge: 7 # days
3.2 实际应用案例
案例一:紧急漏洞响应 - Log4Shell
2021 年 12 月 Log4Shell 漏洞爆发时,我们在 2 小时内完成了全部镜像的扫描和修复。这是当时的响应流程:
#!/bin/bash
# scripts/emergency-scan-log4j.sh
# Emergency scan for Log4Shell (CVE-2021-44228)
set -euo pipefail
REGISTRY="harbor.example.com"
OUTPUT_DIR="/tmp/log4j-scan-$(date +%Y%m%d%H%M%S)"
AFFECTED_IMAGES=()
mkdir -p ${OUTPUT_DIR}
echo "=== Log4Shell Emergency Scan ==="
echo "Started at: $(date)"
echo ""
# Get all images from Harbor
PROJECTS=$(curl -s -u "${HARBOR_USER}:${HARBOR_PASS}" \
"${REGISTRY}/api/v2.0/projects" | jq -r '.[].name')
for PROJECT in ${PROJECTS}; do
echo "Scanning project: ${PROJECT}"
REPOS=$(curl -s -u "${HARBOR_USER}:${HARBOR_PASS}" \
"${REGISTRY}/api/v2.0/projects/${PROJECT}/repositories" | jq -r '.[].name')
for REPO in ${REPOS}; do
TAGS=$(curl -s -u "${HARBOR_USER}:${HARBOR_PASS}" \
"${REGISTRY}/api/v2.0/projects/${PROJECT}/repositories/${REPO}/artifacts" | jq -r '.[].tags[0].name // empty')
for TAG in ${TAGS}; do
IMAGE="${REGISTRY}/${REPO}:${TAG}"
echo " Scanning: ${IMAGE}"
# Scan for Log4j specifically
RESULT=$(trivy image \
--severity CRITICAL,HIGH \
--format json \
--quiet \
${IMAGE} 2>/dev/null || echo "{}")
# Check for Log4j CVEs
LOG4J_VULNS=$(echo "${RESULT}" | jq -r '
[.Results[]?.Vulnerabilities[]? |
select(.VulnerabilityID | test("CVE-2021-44228|CVE-2021-45046|CVE-2021-45105"))] |
length')
if [ "${LOG4J_VULNS}" -gt 0 ]; then
echo " AFFECTED! Found ${LOG4J_VULNS} Log4j vulnerabilities"
AFFECTED_IMAGES+=("${IMAGE}")
echo "${RESULT}" > "${OUTPUT_DIR}/${REPO//\//_}_${TAG}.json"
fi
done
done
done
echo ""
echo "=== Scan Complete ==="
echo "Total affected images: ${#AFFECTED_IMAGES[@]}"
echo ""
echo "Affected images:"
printf '%s\n' "${AFFECTED_IMAGES[@]}"
# Generate report
cat > "${OUTPUT_DIR}/summary.txt" << EOF
Log4Shell Emergency Scan Report
===============================
Scan Time: $(date)
Total Projects Scanned: $(echo "${PROJECTS}" | wc -l)
Affected Images: ${#AFFECTED_IMAGES[@]}
Affected Image List:
$(printf '%s\n' "${AFFECTED_IMAGES[@]}")
Recommended Actions:
1. Block affected images in Harbor
2. Update base images to patched versions
3. Rebuild and redeploy affected services
4. Monitor for exploitation attempts
EOF
echo ""
echo "Report saved to: ${OUTPUT_DIR}/summary.txt"
# Send alert
if [ ${#AFFECTED_IMAGES[@]} -gt 0 ]; then
curl -X POST "${SLACK_WEBHOOK_URL}" \
-H "Content-Type: application/json" \
-d "{
\"text\": \"Log4Shell Scan Complete: ${#AFFECTED_IMAGES[@]} affected images found. Check ${OUTPUT_DIR}/summary.txt for details.\"
}"
fi
案例二:供应链安全 - SBOM 追踪
有一次,某个开源库被发现植入了恶意代码。因为我们有完整的 SBOM,能在 30 分钟内确定所有受影响的服务:
#!/usr/bin/env python3
# scripts/sbom-search.py
# Search for specific package across all SBOMs
import json
import sys
import os
from pathlib import Path
import argparse
from datetime import datetime
def search_spdx_sbom(sbom_path: str, package_name: str, version_pattern: str = None) -> list:
"""Search for a package in SPDX SBOM format"""
results = []
with open(sbom_path) as f:
sbom = json.load(f)
for package in sbom.get('packages', []):
name = package.get('name', '')
version = package.get('versionInfo', '')
if package_name.lower() in name.lower():
if version_pattern is None or version_pattern in version:
results.append({
'name': name,
'version': version,
'sbom_file': sbom_path,
'image': sbom.get('name', 'unknown'),
'purl': package.get('externalRefs', [{}])[0].get('referenceLocator', '')
})
return results
def search_all_sboms(sbom_dir: str, package_name: str, version_pattern: str = None) -> list:
"""Search across all SBOM files"""
all_results = []
for sbom_file in Path(sbom_dir).rglob('*.json'):
try:
results = search_spdx_sbom(str(sbom_file), package_name, version_pattern)
all_results.extend(results)
except Exception as e:
print(f"Error processing {sbom_file}: {e}", file=sys.stderr)
return all_results
def main():
parser = argparse.ArgumentParser(description='Search for packages in SBOMs')
parser.add_argument('package', help='Package name to search for')
parser.add_argument('--version', help='Version pattern to match')
parser.add_argument('--sbom-dir', default='/var/lib/sbom', help='Directory containing SBOMs')
parser.add_argument('--format', choices=['text', 'json'], default='text', help='Output format')
args = parser.parse_args()
results = search_all_sboms(args.sbom_dir, args.package, args.version)
if args.format == 'json':
print(json.dumps(results, indent=2))
else:
print(f"Search Results for '{args.package}'")
print(f"{'=' * 60}")
print(f"Found {len(results)} matches\n")
for r in results:
print(f"Image: {r['image']}")
print(f" Package: {r['name']}")
print(f" Version: {r['version']}")
print(f" PURL: {r['purl']}")
print()
# Exit with code 1 if matches found (useful for CI)
sys.exit(1 if results else 0)
if __name__ == '__main__':
main()
使用示例:
# Search for compromised package
./scripts/sbom-search.py event-stream --version "3.3.6"
# Output:
# Search Results for 'event-stream'
# ============================================================
# Found 3 matches
#
# Image: harbor.example.com/production/payment-service:v1.2.3
# Package: event-stream
# Version: 3.3.6
# PURL: pkg:npm/event-stream@3.3.6
案例三:持续合规 - 定期审计
我们每周运行一次全量扫描,生成合规报告:
# .gitlab-ci.yml - Weekly compliance scan
compliance:weekly-scan:
stage: scan
image: aquasec/trivy:0.48.1
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
variables:
REGISTRY: "harbor.example.com"
script:
# Pull all production images
- |
for IMAGE in $(cat production-images.txt); do
echo "Scanning ${IMAGE}..."
# Full scan
trivy image \
--severity CRITICAL,HIGH,MEDIUM,LOW \
--format json \
--output "reports/${IMAGE//\//_}.json" \
${IMAGE}
# Generate SBOM
trivy image \
--format spdx-json \
--output "sbom/${IMAGE//\//_}.json" \
${IMAGE}
done
# Generate compliance report
- python scripts/generate-compliance-report.py
# Upload to compliance system
- curl -X POST "${COMPLIANCE_API}/reports" -F "report=@compliance-report.pdf"
artifacts:
paths:
- reports/
- sbom/
- compliance-report.pdf
expire_in: 1 year
tags:
- docker
#!/usr/bin/env python3
# scripts/generate-compliance-report.py
# Generate compliance report from scan results
import json
import os
from pathlib import Path
from datetime import datetime
from collections import defaultdict
def generate_report():
report_dir = Path('reports')
summary = {
'scan_date': datetime.now().isoformat(),
'total_images': 0,
'images_with_critical': 0,
'images_with_high': 0,
'total_vulnerabilities': defaultdict(int),
'top_vulnerabilities': [],
'images': []
}
vuln_count = defaultdict(int)
for report_file in report_dir.glob('*.json'):
summary['total_images'] += 1
with open(report_file) as f:
scan_result = json.load(f)
image_summary = {
'name': report_file.stem.replace('_', '/'),
'critical': 0,
'high': 0,
'medium': 0,
'low': 0
}
for result in scan_result.get('Results', []):
for vuln in result.get('Vulnerabilities', []):
severity = vuln.get('Severity', 'UNKNOWN')
image_summary[severity.lower()] = image_summary.get(severity.lower(), 0) + 1
summary['total_vulnerabilities'][severity] += 1
vuln_count[vuln.get('VulnerabilityID')] += 1
if image_summary['critical'] > 0:
summary['images_with_critical'] += 1
if image_summary['high'] > 0:
summary['images_with_high'] += 1
summary['images'].append(image_summary)
# Top 10 most common vulnerabilities
summary['top_vulnerabilities'] = sorted(
vuln_count.items(),
key=lambda x: x[1],
reverse=True
)[:10]
# Generate markdown report
report_md = f"""# Container Security Compliance Report
**Generated:** {summary['scan_date']}
## Executive Summary
| Metric | Value |
|--------|-------|
| Total Images Scanned | {summary['total_images']} |
| Images with Critical Vulnerabilities | {summary['images_with_critical']} |
| Images with High Vulnerabilities | {summary['images_with_high']} |
## Vulnerability Distribution
| Severity | Count |
|----------|-------|
| Critical | {summary['total_vulnerabilities']['CRITICAL']} |
| High | {summary['total_vulnerabilities']['HIGH']} |
| Medium | {summary['total_vulnerabilities']['MEDIUM']} |
| Low | {summary['total_vulnerabilities']['LOW']} |
## Top 10 Most Common Vulnerabilities
| CVE ID | Occurrence Count |
|--------|-----------------|
"""
for cve, count in summary['top_vulnerabilities']:
report_md += f"| {cve} | {count} |\n"
report_md += """
## Compliance Status
Based on our security policy:
"""
if summary['images_with_critical'] > 0:
report_md += "- **NON-COMPLIANT**: Critical vulnerabilities detected\n"
elif summary['images_with_high'] > 0:
report_md += "- **WARNING**: High vulnerabilities detected\n"
else:
report_md += "- **COMPLIANT**: No critical or high vulnerabilities\n"
# Save report
with open('compliance-report.md', 'w') as f:
f.write(report_md)
# Save JSON for further processing
with open('compliance-summary.json', 'w') as f:
json.dump(summary, f, indent=2, default=str)
print("Compliance report generated: compliance-report.md")
# Return exit code based on compliance
return 1 if summary['images_with_critical'] > 0 else 0
if __name__ == '__main__':
exit(generate_report())
四、最佳实践和注意事项
4.1 最佳实践
4.1.1 安全左移
不要等到镜像构建完才扫描,在更早的阶段就发现问题:
# Early scanning in development workflow
# 1. Scan during code review
scan:pr:
stage: test
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
script:
# Scan Dockerfile
- trivy config --severity HIGH,CRITICAL .
# Scan dependencies before build
- trivy fs --scanners vuln --severity HIGH,CRITICAL .
# 2. Scan base images separately
scan:base-image:
stage: .pre
script:
- |
BASE_IMAGE=$(grep "^FROM" Dockerfile | head -1 | awk '{print $2}')
trivy image --severity CRITICAL ${BASE_IMAGE}
# 3. Build and scan in same job
build-and-scan:
stage: build
script:
# Build
- docker build -t ${IMAGE}:${TAG} .
# Immediate scan
- trivy image --exit-code 1 ${IMAGE}:${TAG}
# Only push if scan passes
- docker push ${IMAGE}:${TAG}
4.1.2 基础镜像管理
维护一套经过安全加固的基础镜像:
# base-images/python-3.11.Dockerfile
# Hardened Python 3.11 base image
FROM python:3.11-slim-bookworm
# Update and install security patches
RUN apt-get update && \
apt-get upgrade -y && \
apt-get install -y --no-install-recommends \
ca-certificates \
curl \
dumb-init && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Create non-root user
RUN groupadd -r appgroup && \
useradd -r -g appgroup -d /app -s /sbin/nologin appuser
# Set secure defaults
ENV PYTHONDONTWRITEBYTECODE=1 \
PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1
# Security hardening
RUN chmod 700 /root && \
chmod 755 /app
WORKDIR /app
USER appuser
ENTRYPOINT ["/usr/bin/dumb-init", "--"]
自动更新基础镜像的流水线:
# .gitlab-ci.yml - Base image auto-update
base-image:update:
stage: build
rules:
- if: $CI_PIPELINE_SOURCE == "schedule"
script:
# Build all base images
- |
for DOCKERFILE in base-images/*.Dockerfile; do
IMAGE_NAME=$(basename ${DOCKERFILE} .Dockerfile)
docker build -f ${DOCKERFILE} -t harbor.example.com/base/${IMAGE_NAME}:latest .
done
# Scan all base images
- |
for DOCKERFILE in base-images/*.Dockerfile; do
IMAGE_NAME=$(basename ${DOCKERFILE} .Dockerfile)
trivy image --exit-code 1 --severity CRITICAL,HIGH \
harbor.example.com/base/${IMAGE_NAME}:latest
done
# Push if all scans pass
- |
for DOCKERFILE in base-images/*.Dockerfile; do
IMAGE_NAME=$(basename ${DOCKERFILE} .Dockerfile)
docker push harbor.example.com/base/${IMAGE_NAME}:latest
done
# Notify teams to rebuild their images
- |
curl -X POST "${SLACK_WEBHOOK}" \
-H "Content-Type: application/json" \
-d '{"text": "Base images updated. Please rebuild your applications."}'
4.1.3 漏洞响应流程
建立清晰的漏洞响应流程:
# vulnerability-response-policy.yaml
# Vulnerability response SLA and procedures
severity_response:
critical:
sla: 24_hours
actions:
- immediate_notification:
channels: [slack, pagerduty, email]
recipients: [security-team, on-call-sre, team-leads]
- block_deployment: true
- mandatory_patch: true
- incident_ticket: required
escalation:
- after_hours: 4
escalate_to: security-manager
- after_hours: 12
escalate_to: cto
high:
sla: 7_days
actions:
- notification:
channels: [slack, email]
recipients: [security-team, team-leads]
- block_deployment: true
- mandatory_patch: true
escalation:
- after_days: 3
escalate_to: security-manager
medium:
sla: 30_days
actions:
- notification:
channels: [slack]
recipients: [development-team]
- block_deployment: false
- recommended_patch: true
low:
sla: 90_days
actions:
- notification:
channels: [weekly-report]
- block_deployment: false
- optional_patch: true
exception_process:
requires:
- security_team_approval: true
- risk_assessment: documented
- mitigation_plan: documented
- review_date: required
max_duration: 30_days
4.1.4 多层防御
不要只依赖 CI 扫描,在多个层面设置检查:
┌─────────────────────────────────────────────────────────────┐
│ Security Layers │
├─────────────────────────────────────────────────────────────┤
│ 1. IDE/Local │
│ └─> Developer runs local scan before commit │
│ │
│ 2. Pre-commit Hook │
│ └─> Scan Dockerfile and dependencies │
│ │
│ 3. CI Pipeline │
│ └─> Full image scan, block on policy violation │
│ │
│ 4. Container Registry (Harbor) │
│ └─> Scan on push, prevent pull of vulnerable images │
│ │
│ 5. Kubernetes Admission │
│ └─> Gatekeeper policy, only allow scanned images │
│ │
│ 6. Runtime Security │
│ └─> Falco monitors container behavior │
│ │
│ 7. Continuous Monitoring │
│ └─> Weekly full scan, alert on new CVEs │
│ │
└─────────────────────────────────────────────────────────────┘
4.2 注意事项
4.2.1 避免过于严格
一开始就设置“零漏洞”策略会让开发工作陷入停滞。建议渐进式收紧:
# Phase 1: Awareness (Month 1-2)
# - Scan all images, warn only
# - Build team awareness
TRIVY_EXIT_CODE: "0" # Don't block
# Phase 2: Critical Blocking (Month 3-4)
# - Block CRITICAL only
# - Establish fix workflow
TRIVY_SEVERITY: "CRITICAL"
TRIVY_EXIT_CODE: "1"
# Phase 3: High Blocking (Month 5-6)
# - Block CRITICAL and HIGH
# - Refine exception process
TRIVY_SEVERITY: "CRITICAL,HIGH"
TRIVY_EXIT_CODE: "1"
# Phase 4: Full Enforcement (Month 7+)
# - Full policy enforcement
# - SLA tracking
# - Compliance reporting
4.2.2 处理误报
误报是不可避免的,需要有机制处理:
# .trivyignore.yaml - Structured ignore file
vulnerabilities:
# False positive: package is not actually used in runtime
- id: CVE-2023-12345
reason: "Development dependency only, not included in final image"
approved_by: "security-team"
approved_date: "2024-01-15"
review_date: "2024-04-15"
jira_ticket: "SEC-1234"
# Disputed vulnerability
- id: CVE-2023-67890
reason: "Disputed by vendor, see https://github.com/vendor/project/issues/123"
approved_by: "security-team"
approved_date: "2024-01-20"
review_date: "2024-02-20"
# Mitigated by other controls
- id: CVE-2023-11111
reason: "Mitigated by WAF rules and network segmentation"
approved_by: "security-manager"
approved_date: "2024-01-25"
review_date: "2024-03-25"
mitigation_details: "WAF rule ID: waf-rule-789, Network policy: np-secure-api"
4.2.3 性能优化
镜像扫描可能很慢,特别是大镜像。优化建议:
# Performance optimization for large images
# 1. Use layer caching
variables:
TRIVY_CACHE_DIR: /var/cache/trivy
cache:
key: trivy-${CI_COMMIT_REF_SLUG}
paths:
- ${TRIVY_CACHE_DIR}
# 2. Skip non-essential scans in feature branches
scan:
rules:
- if: $CI_COMMIT_BRANCH == "main"
variables:
SCAN_SCOPE: "full"
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
variables:
SCAN_SCOPE: "quick"
script:
- |
if [ "${SCAN_SCOPE}" = "quick" ]; then
trivy image --severity CRITICAL --skip-files "*.jar" ${IMAGE}
else
trivy image --severity CRITICAL,HIGH,MEDIUM ${IMAGE}
fi
# 3. Parallel scanning for multi-image projects
scan:parallel:
parallel:
matrix:
- IMAGE: ["service-a", "service-b", "service-c"]
script:
- trivy image ${REGISTRY}/${IMAGE}:${TAG}
4.2.4 保持漏洞数据库更新
漏洞数据库更新不及时是常见问题:
#!/bin/bash
# scripts/check-db-freshness.sh
# Check if vulnerability database is up to date
CACHE_DIR="/var/cache/trivy"
MAX_AGE_HOURS=12
if [ ! -d "${CACHE_DIR}" ]; then
echo "ERROR: Cache directory not found"
exit 1
fi
DB_FILE=$(find ${CACHE_DIR} -name "trivy.db" -type f 2>/dev/null | head -1)
if [ -z "${DB_FILE}" ]; then
echo "WARNING: Database file not found, will download"
trivy image --download-db-only
exit 0
fi
# Check file age
DB_AGE_SECONDS=$(( $(date +%s) - $(stat -c %Y "${DB_FILE}") ))
DB_AGE_HOURS=$(( DB_AGE_SECONDS / 3600 ))
echo "Database age: ${DB_AGE_HOURS} hours"
if [ ${DB_AGE_HOURS} -gt ${MAX_AGE_HOURS} ]; then
echo "WARNING: Database is older than ${MAX_AGE_HOURS} hours, updating..."
trivy image --download-db-only
else
echo "OK: Database is fresh"
fi
4.2.5 不要完全依赖自动化
自动化很重要,但不能替代人工审查:
- 定期 review 忽略列表
- 新漏洞公布后手动评估影响
- 安全团队定期审计扫描配置
- 关注安全社区动态,不只是等待 CVE 发布
五、故障排查和监控
5.1 故障排查
5.1.1 常见问题
问题一:扫描超时
# Symptoms: CI job times out during scan
# Solution 1: Increase timeout
variables:
TRIVY_TIMEOUT: "30m"
# Solution 2: Use offline database
trivy image --skip-db-update --cache-dir ${CACHE_DIR} ${IMAGE}
# Solution 3: Reduce scan scope
trivy image --skip-files "*.jar,*.war" ${IMAGE}
问题二:漏洞数据库下载失败
# Symptoms: "failed to download vulnerability DB"
# Check network connectivity
curl -I https://ghcr.io/v2/
# Use mirror if available
export TRIVY_DB_REPOSITORY=mirror.example.com/trivy-db
# Or use offline mode with pre-downloaded DB
trivy --cache-dir /offline-cache image ${IMAGE}
问题三:内存不足
# Symptoms: OOM killed during scan
# Solution: Limit parallel analysis
trivy image --parallel 2 ${IMAGE}
# For large images, scan layers separately
trivy image --skip-db-update --scanners vuln ${IMAGE}
trivy image --skip-db-update --scanners secret ${IMAGE}
问题四:结果不一致
# Different results between local and CI
# Check Trivy versions
trivy --version
# Check database versions
trivy version --format json | jq '.VulnerabilityDB'
# Ensure same configuration
diff local-trivy.yaml ci-trivy.yaml
5.1.2 调试技巧
# Enable debug logging
trivy image --debug ${IMAGE}
# Show scan progress
trivy image --no-progress=false ${IMAGE}
# Output raw vulnerability data
trivy image --format json ${IMAGE} | jq '.Results[].Vulnerabilities[] | select(.Severity == "CRITICAL")'
# Check what files are being scanned
trivy image --list-all-pkgs ${IMAGE}
5.2 性能监控
5.2.1 Prometheus 指标
# prometheus/trivy-metrics.yaml
# Custom metrics for Trivy scanning
groups:
- name: trivy_scanning
rules:
# Scan duration by image
- record: trivy_scan_duration_seconds
expr: |
histogram_quantile(0.95,
sum(rate(gitlab_ci_job_duration_seconds_bucket{job_name=~".*scan.*"}[5m])) by (le, project)
)
# Vulnerabilities found per severity
- record: trivy_vulnerabilities_total
expr: |
sum(trivy_scan_vulnerabilities) by (severity, project)
# Blocked deployments
- record: trivy_blocked_deployments_total
expr: |
sum(increase(gitlab_ci_job_status{status="failed", job_name=~".*scan.*"}[24h])) by (project)
- name: trivy_alerts
rules:
# Alert on critical vulnerabilities
- alert: CriticalVulnerabilityDetected
expr: trivy_scan_vulnerabilities{severity="CRITICAL"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Critical vulnerability detected in {{ $labels.image }}"
description: "Found {{ $value }} critical vulnerabilities"
# Alert on scan failures
- alert: ImageScanFailed
expr: increase(gitlab_ci_job_status{status="failed", job_name=~".*scan.*"}[1h]) > 3
for: 10m
labels:
severity: warning
annotations:
summary: "Multiple image scan failures"
description: "{{ $value }} scan failures in the last hour"
5.2.2 Grafana Dashboard
{
"dashboard": {
"title": "Container Security Dashboard",
"panels": [
{
"title": "Vulnerabilities by Severity",
"type": "piechart",
"targets": [
{
"expr": "sum(trivy_vulnerabilities_total) by (severity)",
"legendFormat": "{{ severity }}"
}
]
},
{
"title": "Scan Duration Trend",
"type": "timeseries",
"targets": [
{
"expr": "avg(trivy_scan_duration_seconds) by (project)",
"legendFormat": "{{ project }}"
}
]
},
{
"title": "Blocked Deployments (24h)",
"type": "stat",
"targets": [
{
"expr": "sum(trivy_blocked_deployments_total)"
}
]
},
{
"title": "Images with Critical Vulnerabilities",
"type": "table",
"targets": [
{
"expr": "trivy_scan_vulnerabilities{severity=\"CRITICAL\"} > 0",
"format": "table"
}
]
}
]
}
}
5.3 备份与恢复
5.3.1 扫描结果备份
# .gitlab-ci.yml - Archive scan results
scan:archive:
stage: scan
script:
- trivy image --format json --output scan-result.json ${IMAGE}
after_script:
# Archive to S3
- |
aws s3 cp scan-result.json \
s3://security-reports/trivy/${CI_PROJECT_PATH}/${CI_COMMIT_SHA}/scan-result.json
# Archive SBOM
- |
aws s3 cp sbom.json \
s3://security-reports/sbom/${CI_PROJECT_PATH}/${CI_COMMIT_SHA}/sbom.json
artifacts:
paths:
- scan-result.json
- sbom.json
expire_in: 1 year
5.3.2 漏洞数据库备份
#!/bin/bash
# scripts/backup-trivy-db.sh
# Backup Trivy vulnerability database
BACKUP_DIR="/var/backup/trivy"
CACHE_DIR="/var/cache/trivy"
DATE=$(date +%Y%m%d)
mkdir -p ${BACKUP_DIR}
# Create backup
tar -czf ${BACKUP_DIR}/trivy-db-${DATE}.tar.gz -C ${CACHE_DIR} .
# Keep only last 7 days
find ${BACKUP_DIR} -name "trivy-db-*.tar.gz" -mtime +7 -delete
# Upload to remote storage
aws s3 sync ${BACKUP_DIR}/ s3://backup-bucket/trivy-db/
5.3.3 恢复流程
#!/bin/bash
# scripts/restore-trivy-db.sh
# Restore Trivy vulnerability database from backup
BACKUP_FILE="$1"
CACHE_DIR="/var/cache/trivy"
if [ -z "${BACKUP_FILE}" ]; then
echo "Usage: $0 <backup-file>"
echo "Available backups:"
ls -la /var/backup/trivy/
exit 1
fi
# Stop any running scans (in production, coordinate with CI)
echo "Restoring from ${BACKUP_FILE}..."
# Backup current DB
mv ${CACHE_DIR} ${CACHE_DIR}.old
# Restore
mkdir -p ${CACHE_DIR}
tar -xzf ${BACKUP_FILE} -C ${CACHE_DIR}
# Verify
trivy image --skip-db-update alpine:3.19 &> /dev/null
if [ $? -eq 0 ]; then
echo "Restore successful"
rm -rf ${CACHE_DIR}.old
else
echo "Restore failed, rolling back"
rm -rf ${CACHE_DIR}
mv ${CACHE_DIR}.old ${CACHE_DIR}
exit 1
fi
六、总结
镜像安全扫描这个领域,说复杂也复杂,说简单也简单。复杂在于要把扫描嵌入到整个 DevOps 流程中,需要考虑性能、策略、异常处理等方方面面;简单在于核心就那么几件事:选一个好工具、配置合理的策略、建立响应流程。
回顾我们团队的实践,有几个关键的转折点:
第一个转折点是从“有总比没有好”到“安全左移”。以前我们觉得有扫描就行了,结果漏洞还是溜进了生产环境。现在我们从代码提交阶段就开始扫描,越早发现问题修复成本越低。
第二个转折点是从“告警”到“阻断”。告警很容易被忽略,特别是在 CI 通过的情况下。我们花了不少时间说服开发团队接受“阻断”策略,但事实证明这是对的——现在大家都养成了关注安全的习惯。
第三个转折点是从“单点防护”到“多层防御”。CI 扫描可能被绕过,Harbor 可能配置有问题,Kubernetes 准入控制可能有漏洞。每一层都可能失效,所以我们在每个关键节点都设置了检查。
这套体系跑了一年多,效果还是很明显的:
- 生产环境的高危漏洞数量降了 90%
- 漏洞平均修复时间从 2 周降到 3 天
- 再也没有出现过漏洞溜进生产环境的情况
安全这个事情,永远没有完美的状态,只有持续改进的过程。希望这篇文章能帮助你们少踩一些坑。如果你想和更多技术人一起探讨运维、安全和 DevOps 实践,欢迎常来云栈社区转转。
附录
A. 工具对比表
| 特性 |
Trivy |
Grype |
Clair |
Snyk |
| 开源 |
是 |
是 |
是 |
部分 |
| 扫描速度 |
快 |
快 |
中等 |
中等 |
| 漏洞数据库 |
多源 |
Anchore |
多源 |
私有 |
| CI 集成 |
优秀 |
优秀 |
一般 |
优秀 |
| SBOM 生成 |
是 |
是 |
否 |
是 |
| 配置扫描 |
是 |
否 |
否 |
是 |
| 密钥扫描 |
是 |
否 |
否 |
是 |
| 修复建议 |
基础 |
基础 |
无 |
详细 |
| 企业支持 |
付费 |
付费 |
付费 |
付费 |
B. 漏洞严重程度说明
| 等级 |
CVSS 分数 |
响应时间 |
说明 |
| Critical |
9.0-10.0 |
24 小时 |
可被远程利用,无需用户交互 |
| High |
7.0-8.9 |
7 天 |
可导致严重后果,但有限制条件 |
| Medium |
4.0-6.9 |
30 天 |
需要特定条件才能利用 |
| Low |
0.1-3.9 |
90 天 |
影响有限 |
C. 常用 Trivy 命令
# Basic scan
trivy image nginx:1.25
# Scan with severity filter
trivy image --severity CRITICAL,HIGH nginx:1.25
# Generate JSON report
trivy image --format json --output report.json nginx:1.25
# Generate SBOM
trivy image --format spdx-json --output sbom.json nginx:1.25
# Scan filesystem
trivy fs --scanners vuln,secret /path/to/project
# Scan Kubernetes
trivy k8s --report summary cluster
# Scan config files
trivy config --severity HIGH,CRITICAL .
# Update database
trivy image --download-db-only
# Offline scan
trivy image --skip-db-update --cache-dir /offline-cache nginx:1.25
D. 参考资源
- Trivy 官方文档
- NIST NVD
- CVE 数据库
- Harbor 文档
- OPA Gatekeeper
- SLSA 供应链安全框架
- SBOM 格式规范 (SPDX)