一、概述
1.1 背景介绍
直接用kubectl管理K8s资源,10个微服务就要维护几十个YAML文件,版本管理靠文件夹命名,回滚靠手动替换文件。Helm把一组相关的K8s资源打包成Chart,支持模板化、版本管理、一键部署和回滚,是K8s生态中事实上的包管理标准。
生产环境中的痛点:开发团队每次部署都要改十几个YAML里的镜像版本、副本数、资源限制,改漏一个就出问题。用Helm后,这些可变参数抽成values.yaml,部署时只需要helm upgrade -f prod-values.yaml一条命令,CI/CD流水线也更容易集成。
本文基于Helm v3.13.x版本,覆盖Chart开发、仓库管理、生产部署、回滚策略等内容。Helm v2已于2020年停止维护,不再讨论。
1.2 技术特点
- 模板引擎:基于Go template,支持变量、条件判断、循环、函数,一套模板适配多环境
- Release管理:每次安装/升级都是一个Release,记录完整的版本历史,支持一键回滚到任意版本
- 依赖管理:Chart可以声明对其他Chart的依赖,自动拉取和安装子Chart
- 仓库生态:Artifact Hub上有数千个社区Chart,常用中间件(MySQL、Redis、Kafka)开箱即用
1.3 适用场景
- 场景一:多环境部署(dev/staging/prod),同一套Chart用不同的values文件区分配置
- 场景二:CI/CD流水线集成,Jenkins/GitLab CI中用helm命令自动化部署
- 场景三:复杂应用编排,一个Chart包含Deployment、Service、ConfigMap、Ingress等多个资源
- 场景四:第三方中间件部署,用社区Chart快速部署Prometheus、Grafana、Nginx Ingress等
1.4 环境要求
| 组件 |
版本要求 |
说明 |
| Kubernetes |
1.24+ |
Helm 3.13支持K8s 1.24-1.28 |
| Helm |
3.13+ |
当前稳定版本 |
| kubectl |
与集群版本匹配 |
Helm依赖kubeconfig访问集群 |
| OCI Registry |
Harbor 2.x / Docker Registry |
存储Helm Chart的OCI仓库 |
二、详细步骤
2.1 准备工作
2.1.1 安装Helm
# 方式一:官方脚本安装
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# 方式二:手动下载
wget https://get.helm.sh/helm-v3.13.3-linux-amd64.tar.gz
tar xzf helm-v3.13.3-linux-amd64.tar.gz
mv linux-amd64/helm /usr/local/bin/helm
# 验证
helm version
# version.BuildInfo{Version:"v3.13.3", ...}
# 配置命令补全
helm completion bash > /etc/bash_completion.d/helm
source /etc/bash_completion.d/helm
2.1.2 配置Chart仓库
# 添加常用仓库
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
# 更新仓库索引
helm repo update
# 查看已添加的仓库
helm repo list
# 搜索Chart
helm search repo nginx
helm search repo mysql --versions # 查看所有版本
# 搜索Artifact Hub
helm search hub prometheus
2.1.3 Helm基本操作
# 查看Chart信息
helm show chart bitnami/nginx
helm show values bitnami/nginx # 查看默认values
helm show readme bitnami/nginx
# 安装Chart
helm install my-nginx bitnami/nginx -n web --create-namespace
# 查看Release
helm list -A
helm status my-nginx -n web
# 查看Release历史
helm history my-nginx -n web
# 升级Release
helm upgrade my-nginx bitnami/nginx -n web --set replicaCount=3
# 回滚
helm rollback my-nginx 1 -n web # 回滚到版本1
# 卸载
helm uninstall my-nginx -n web
2.2 核心配置
2.2.1 创建自定义Chart
# 创建Chart骨架
helm create myapp
# 目录结构
# myapp/
# ├── Chart.yaml # Chart元数据
# ├── values.yaml # 默认配置值
# ├── charts/ # 依赖Chart
# ├── templates/ # 模板文件
# │ ├── deployment.yaml
# │ ├── service.yaml
# │ ├── ingress.yaml
# │ ├── hpa.yaml
# │ ├── serviceaccount.yaml
# │ ├── configmap.yaml
# │ ├── _helpers.tpl # 模板辅助函数
# │ ├── NOTES.txt # 安装后提示信息
# │ └── tests/
# │ └── test-connection.yaml
# └── .helmignore # 打包时忽略的文件
Chart.yaml定义:
# 文件:myapp/Chart.yaml
apiVersion: v2
name: myapp
description: Production backend API service
type: application
version: 1.0.0
appVersion: "2.1.0"
keywords:
- api
- backend
maintainers:
- name: ops-team
email: ops@company.com
dependencies:
- name: redis
version: "18.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: redis.enabled
- name: postgresql
version: "13.x.x"
repository: "https://charts.bitnami.com/bitnami"
condition: postgresql.enabled
2.2.2 编写values.yaml
# 文件:myapp/values.yaml
# 镜像配置
image:
repository: registry.company.com/backend/myapp
tag: "" # 默认使用Chart的appVersion
pullPolicy: IfNotPresent
imagePullSecrets:
- name: registry-secret
# 副本数
replicaCount: 3
# 资源限制
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 1Gi
# Service配置
service:
type: ClusterIP
port: 8080
# Ingress配置
ingress:
enabled: true
className: nginx
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
hosts:
- host: api.company.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: api-tls
hosts:
- api.company.com
# HPA自动伸缩
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 20
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# 健康检查
livenessProbe:
httpGet:
path: /health
port: http
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: http
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 3
# 环境变量
env:
- name: APP_ENV
value: "production"
- name: LOG_LEVEL
value: "info"
- name: DB_HOST
valueFrom:
secretKeyRef:
name: myapp-db-secret
key: host
# ConfigMap数据
config:
app.conf: |
server.port=8080
server.graceful-shutdown=30s
cache.ttl=300
# 持久化存储
persistence:
enabled: false
storageClass: "ceph-rbd"
size: 10Gi
# Pod调度
nodeSelector: {}
tolerations: []
affinity: {}
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app.kubernetes.io/name: myapp
# 依赖服务开关
redis:
enabled: true
architecture: standalone
auth:
password: "redis-password"
postgresql:
enabled: false
2.2.3 编写模板文件
Deployment模板:
# 文件:myapp/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ include "myapp.fullname" . }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
spec:
{{- if not .Values.autoscaling.enabled }}
replicas: {{ .Values.replicaCount }}
{{- end }}
selector:
matchLabels:
{{- include "myapp.selectorLabels" . | nindent 6 }}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 0
template:
metadata:
annotations:
checksum/config: {{ include (print $.Template.BasePath "/configmap.yaml") . | sha256sum }}
labels:
{{- include "myapp.selectorLabels" . | nindent 8 }}
spec:
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 8 }}
{{- end }}
serviceAccountName: {{ include "myapp.serviceAccountName" . }}
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
containers:
- name: {{ .Chart.Name }}
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
ports:
- name: http
containerPort: {{ .Values.service.port }}
protocol: TCP
{{- with .Values.env }}
env:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.livenessProbe }}
livenessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.readinessProbe }}
readinessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
resources:
{{- toYaml .Values.resources | nindent 12 }}
volumeMounts:
- name: config
mountPath: /app/config
readOnly: true
{{- if .Values.persistence.enabled }}
- name: data
mountPath: /app/data
{{- end }}
volumes:
- name: config
configMap:
name: {{ include "myapp.fullname" . }}-config
{{- if .Values.persistence.enabled }}
- name: data
persistentVolumeClaim:
claimName: {{ include "myapp.fullname" . }}-data
{{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 8 }}
{{- end }}
{{- with .Values.topologySpreadConstraints }}
topologySpreadConstraints:
{{- toYaml . | nindent 8 }}
{{- end }}
说明:
checksum/config注解:ConfigMap内容变化时自动触发Pod滚动更新,不加这个的话改了ConfigMap但Pod不会重启
maxUnavailable: 0:滚动更新时不允许有Pod不可用,保证零停机部署
securityContext:以非root用户运行,生产安全基线
辅助函数模板:
# 文件:myapp/templates/_helpers.tpl
{{/*
生成应用全名
*/}}
{{- define "myapp.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}
{{/*
通用标签
*/}}
{{- define "myapp.labels" -}}
helm.sh/chart: {{ include "myapp.chart" . }}
{{ include "myapp.selectorLabels" . }}
app.kubernetes.io/version: {{ .Chart.AppVersion | quote }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
{{/*
选择器标签
*/}}
{{- define "myapp.selectorLabels" -}}
app.kubernetes.io/name: {{ include "myapp.name" . }}
app.kubernetes.io/instance: {{ .Release.Name }}
{{- end }}
{{/*
Chart名称
*/}}
{{- define "myapp.name" -}}
{{- default .Chart.Name .Values.nameOverride | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
Chart版本标签
*/}}
{{- define "myapp.chart" -}}
{{- printf "%s-%s" .Chart.Name .Chart.Version | replace "+" "_" | trunc 63 | trimSuffix "-" }}
{{- end }}
{{/*
ServiceAccount名称
*/}}
{{- define "myapp.serviceAccountName" -}}
{{- default (include "myapp.fullname" .) .Values.serviceAccount.name }}
{{- end }}
2.2.4 多环境values文件
# 文件:values-dev.yaml(开发环境覆盖)
replicaCount: 1
image:
tag: "latest"
pullPolicy: Always
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
ingress:
enabled: true
hosts:
- host: api-dev.company.com
paths:
- path: /
pathType: Prefix
tls: []
autoscaling:
enabled: false
env:
- name: APP_ENV
value: "development"
- name: LOG_LEVEL
value: "debug"
# 文件:values-prod.yaml(生产环境覆盖)
replicaCount: 5
image:
tag: "2.1.0"
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 30
env:
- name: APP_ENV
value: "production"
- name: LOG_LEVEL
value: "warn"
2.2.5 Chart打包和发布
# 更新依赖
helm dependency update myapp/
helm dependency build myapp/
# 模板渲染测试(不实际部署,只看生成的YAML)
helm template myapp-release myapp/ -f values-prod.yaml -n production
# Lint检查
helm lint myapp/
# 打包
helm package myapp/
# 输出:myapp-1.0.0.tgz
# 推送到OCI仓库(Harbor)
helm push myapp-1.0.0.tgz oci://harbor.company.com/helm-charts
# 推送到ChartMuseum
curl --data-binary "@myapp-1.0.0.tgz" http://chartmuseum.company.com/api/charts
2.3 启动和验证
2.3.1 部署到各环境
# 部署到开发环境
helm upgrade --install myapp-dev myapp/ \
-f values-dev.yaml \
-n dev --create-namespace \
--wait --timeout 5m
# 部署到生产环境
helm upgrade --install myapp-prod myapp/ \
-f values-prod.yaml \
-n production --create-namespace \
--wait --timeout 10m \
--atomic
# --atomic:部署失败自动回滚到上一个版本
# --wait:等待所有Pod Ready才算成功
# --timeout:超时时间,超时视为失败
2.3.2 验证部署
# 查看Release状态
helm status myapp-prod -n production
# 查看实际生成的资源
helm get manifest myapp-prod -n production
# 查看使用的values
helm get values myapp-prod -n production
# 查看所有信息
helm get all myapp-prod -n production
# 验证Pod运行
kubectl get pods -n production -l app.kubernetes.io/name=myapp
kubectl get svc -n production -l app.kubernetes.io/name=myapp
kubectl get ingress -n production
2.3.3 版本管理和回滚
# 查看Release历史
helm history myapp-prod -n production
# REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
# 1 2024-01-15 10:00:00 superseded myapp-1.0.0 2.0.0 Install complete
# 2 2024-01-20 14:30:00 superseded myapp-1.1.0 2.1.0 Upgrade complete
# 3 2024-01-25 09:00:00 deployed myapp-1.2.0 2.2.0 Upgrade complete
# 回滚到指定版本
helm rollback myapp-prod 2 -n production --wait
# 查看两个版本的差异
helm diff upgrade myapp-prod myapp/ -f values-prod.yaml -n production
# 需要安装helm-diff插件:helm plugin install https://github.com/databus23/helm-diff
三、示例代码和配置
3.1 完整配置示例
3.1.1 Ingress模板
# 文件:myapp/templates/ingress.yaml
{{- if .Values.ingress.enabled -}}
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: {{ include "myapp.fullname" . }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
{{- with .Values.ingress.annotations }}
annotations:
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
{{- if .Values.ingress.className }}
ingressClassName: {{ .Values.ingress.className }}
{{- end }}
{{- if .Values.ingress.tls }}
tls:
{{- range .Values.ingress.tls }}
- hosts:
{{- range .hosts }}
- {{ . | quote }}
{{- end }}
secretName: {{ .secretName }}
{{- end }}
{{- end }}
rules:
{{- range .Values.ingress.hosts }}
- host: {{ .host | quote }}
http:
paths:
{{- range .paths }}
- path: {{ .path }}
pathType: {{ .pathType }}
backend:
service:
name: {{ include "myapp.fullname" $ }}
port:
number: {{ $.Values.service.port }}
{{- end }}
{{- end }}
{{- end }}
3.1.2 HPA模板
# 文件:myapp/templates/hpa.yaml
{{- if .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ include "myapp.fullname" . }}
labels:
{{- include "myapp.labels" . | nindent 4 }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ include "myapp.fullname" . }}
minReplicas: {{ .Values.autoscaling.minReplicas }}
maxReplicas: {{ .Values.autoscaling.maxReplicas }}
metrics:
{{- if .Values.autoscaling.targetCPUUtilizationPercentage }}
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetCPUUtilizationPercentage }}
{{- end }}
{{- if .Values.autoscaling.targetMemoryUtilizationPercentage }}
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: {{ .Values.autoscaling.targetMemoryUtilizationPercentage }}
{{- end }}
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
- type: Pods
value: 4
periodSeconds: 60
selectPolicy: Max
{{- end }}
注意:HPA的scaleDown.stabilizationWindowSeconds: 300表示缩容前等待5分钟,防止流量波动导致频繁缩扩容。生产环境这个值不要设太小,实测设为60秒时一天内缩扩容了200多次。
3.1.3 CI/CD集成脚本
#!/bin/bash
# 文件:deploy.sh
# GitLab CI/Jenkins中调用的部署脚本
set -euo pipefail
# 参数
RELEASE_NAME="${1:?Usage: deploy.sh <release-name> <namespace> <env>}"
NAMESPACE="${2:?}"
ENV="${3:?}"
CHART_PATH="./helm/myapp"
VALUES_FILE="./helm/values-${ENV}.yaml"
IMAGE_TAG="${CI_COMMIT_SHORT_SHA:-latest}"
echo "=== Deploying ${RELEASE_NAME} to ${NAMESPACE} (${ENV}) ==="
echo "Image tag: ${IMAGE_TAG}"
# Lint检查
helm lint "${CHART_PATH}" -f "${VALUES_FILE}"
# Dry-run验证
helm upgrade --install "${RELEASE_NAME}" "${CHART_PATH}" \
-f "${VALUES_FILE}" \
-n "${NAMESPACE}" \
--set image.tag="${IMAGE_TAG}" \
--dry-run
# 实际部署
helm upgrade --install "${RELEASE_NAME}" "${CHART_PATH}" \
-f "${VALUES_FILE}" \
-n "${NAMESPACE}" --create-namespace \
--set image.tag="${IMAGE_TAG}" \
--wait --timeout 10m \
--atomic \
--history-max 10
echo "=== Deployment completed ==="
helm status "${RELEASE_NAME}" -n "${NAMESPACE}"
3.2 实际应用案例
案例一:用Helm部署Prometheus监控栈
场景描述:用kube-prometheus-stack Chart一键部署Prometheus + Grafana + AlertManager,替代手动维护几十个YAML文件。
实现代码:
# 添加仓库
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# 文件:prometheus-values.yaml
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: ceph-rbd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
grafana:
adminPassword: "your-secure-password"
persistence:
enabled: true
storageClassName: ceph-rbd
size: 10Gi
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.company.com
tls:
- secretName: grafana-tls
hosts:
- grafana.company.com
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: ceph-rbd
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 10Gi
config:
route:
receiver: 'slack-notifications'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
channel: '#alerts'
send_resolved: true
# 部署
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
-f prometheus-values.yaml \
-n monitoring --create-namespace \
--wait --timeout 10m
# 验证
kubectl get pods -n monitoring
helm list -n monitoring
案例二:Helm Hooks实现数据库迁移
场景描述:应用升级时需要先执行数据库Schema迁移,迁移成功后才部署新版本。用Helm的pre-upgrade Hook实现。
实现代码:
# 文件:myapp/templates/db-migration-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: {{ include "myapp.fullname" . }}-db-migrate
labels:
{{- include "myapp.labels" . | nindent 4 }}
annotations:
"helm.sh/hook": pre-upgrade
"helm.sh/hook-weight": "-5"
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
spec:
backoffLimit: 3
activeDeadlineSeconds: 300
template:
spec:
restartPolicy: Never
containers:
- name: migrate
image: "{{ .Values.image.repository }}:{{ .Values.image.tag | default .Chart.AppVersion }}"
command: ["./migrate", "--direction=up"]
env:
- name: DB_HOST
valueFrom:
secretKeyRef:
name: myapp-db-secret
key: host
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: myapp-db-secret
key: password
resources:
requests:
cpu: 100m
memory: 128Mi
说明:
-
helm.sh/hook: pre-upgrade:在upgrade之前执行
-
helm.sh/hook-weight: "-5":多个Hook时按weight排序,数字小的先执行
-
helm.sh/hook-delete-policy: before-hook-creation,hook-succeeded:下次执行前删除旧Job,成功后也删除
-
四、最佳实践和注意事项
4.1 最佳实践
4.1.1 性能优化
-
限制Release历史数量:每次upgrade都会在K8s Secret中存储一份Release记录,默认不限制。集群运行一年后,频繁部署的服务可能积累几百个历史版本,占用etcd空间。
# 部署时限制历史版本数
helm upgrade --install myapp ./myapp -n production --history-max 10
# 清理已有的旧版本
helm history myapp -n production
# 手动无法直接删除历史,只能通过--history-max在下次upgrade时自动清理
-
使用OCI Registry替代ChartMuseum:Helm 3.8+原生支持OCI格式存储Chart,Harbor 2.x直接支持。OCI Registry比ChartMuseum性能更好,且不需要额外维护一个服务。
# 登录OCI仓库
helm registry login harbor.company.com
# 推送Chart
helm push myapp-1.0.0.tgz oci://harbor.company.com/helm-charts
# 从OCI仓库安装
helm install myapp oci://harbor.company.com/helm-charts/myapp --version 1.0.0
- 模板渲染缓存:大型Chart(50+模板文件)的渲染时间可能超过10秒,CI/CD中频繁的
helm template会拖慢流水线。把渲染结果缓存,只在Chart或values变化时重新渲染。
4.1.2 安全加固
-
values中不存储敏感信息:数据库密码、API Key等不要写在values.yaml中,用External Secrets Operator或Sealed Secrets从外部密钥管理系统注入。
# 不推荐:密码明文写在values中
database:
password: "my-secret-password"
# 推荐:引用已存在的Secret
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: myapp-db-secret # 由External Secrets创建
key: password
-
Chart签名验证:用GPG签名Chart包,安装时验证签名防止篡改。
# 签名
helm package --sign --key 'ops-team' --keyring ~/.gnupg/secring.gpg myapp/
# 验证
helm verify myapp-1.0.0.tgz --keyring ~/.gnupg/pubring.gpg
# 安装时验证
helm install myapp myapp-1.0.0.tgz --verify --keyring ~/.gnupg/pubring.gpg
- RBAC限制Helm操作权限:不同团队只能在自己的namespace中操作Helm Release,通过K8s RBAC限制ServiceAccount权限。
4.1.3 高可用配置
- HA方案一:
--atomic参数保证部署失败自动回滚,不会出现半部署状态
- HA方案二:GitOps模式(ArgoCD + Helm),Chart和values存储在Git仓库,ArgoCD自动同步,避免手动操作失误
- 备份策略:Chart源码纳入Git版本管理,Release历史保留10个版本,values文件按环境分别管理
4.2 注意事项
4.2.1 配置注意事项
警告:Helm upgrade操作会直接修改集群资源,生产环境务必先dry-run确认变更内容。
- 注意
helm upgrade默认会合并新旧values,不是完全替换。如果旧版本有个参数新版本删掉了,upgrade后旧参数仍然存在。用--reset-values可以只使用新的values,但要确保新values是完整的。
- 注意
helm install和helm upgrade是两个不同的命令,用helm upgrade --install可以合并为一个,首次执行是install,后续是upgrade。CI/CD中统一用这个命令。
- 注意 Chart的
version(Chart版本)和appVersion(应用版本)是两个独立的字段。Chart模板改了要升version,只改镜像tag不需要升version但建议升appVersion。
4.2.2 常见错误
| 错误现象 |
原因分析 |
解决方案 |
Error: UPGRADE FAILED: another operation is in progress |
上次操作未完成或异常中断 |
helm rollback <release> <revision> 回滚到上一个成功版本 |
Error: rendered manifests contain a resource that already exists |
资源已存在但不属于当前Release |
给已有资源添加Helm标签:app.kubernetes.io/managed-by: Helm |
模板渲染报 nil pointer evaluating interface |
values中缺少必要字段 |
模板中用 {{ default "" .Values.xxx }} 或 {{- if .Values.xxx }} 做空值判断 |
| upgrade后Pod没有更新 |
ConfigMap/Secret内容变了但Pod没重启 |
在Deployment模板中加 checksum/config 注解 |
| 依赖Chart下载失败 |
仓库不可达或版本不存在 |
helm repo update 更新索引;检查Chart.yaml中的依赖版本 |
4.2.3 兼容性问题
-
版本兼容:Helm 3.13.x支持K8s 1.24-1.28,不同Helm版本生成的Release Secret格式可能不同
-
平台兼容:Helm CLI支持Linux/macOS/Windows,但Windows下路径分隔符可能导致模板渲染问题
-
组件依赖:helm-diff插件版本需要和Helm版本匹配;ArgoCD的Helm支持有版本要求
-
五、故障排查和监控
5.1 故障排查
5.1.1 日志查看
# 查看Helm操作日志(增加debug输出)
helm upgrade --install myapp ./myapp -n production --debug 2>&1 | tee helm-debug.log
# 查看Release的manifest
helm get manifest myapp -n production
# 查看Release的hooks
helm get hooks myapp -n production
# 查看Release的notes
helm get notes myapp -n production
# 对比当前Release和Chart的差异
helm diff upgrade myapp ./myapp -f values-prod.yaml -n production
5.1.2 常见问题排查
问题一:upgrade卡住不动,超时失败
# 诊断命令
helm status myapp -n production
kubectl get pods -n production -l app.kubernetes.io/name=myapp
kubectl describe pod <pending-pod> -n production
解决方案:
- Pod镜像拉取失败:检查镜像地址和imagePullSecrets
- Pod资源不足:检查节点可用资源
- readinessProbe失败:检查健康检查路径和端口
- Hook Job失败:
kubectl logs job/<hook-job-name> -n production
问题二:Release状态为failed,无法upgrade
# 诊断命令
helm history myapp -n production
helm status myapp -n production
解决方案:
# 方式一:回滚到上一个成功版本
helm rollback myapp <last-successful-revision> -n production
# 方式二:强制替换(谨慎使用)
helm upgrade --install myapp ./myapp -n production --force
# 方式三:卸载后重装(会短暂中断服务)
helm uninstall myapp -n production
helm install myapp ./myapp -f values-prod.yaml -n production
问题三:模板渲染错误
5.1.3 调试模式
# Helm debug模式
helm install myapp ./myapp --debug --dry-run -n production -f values-prod.yaml
# 查看Helm使用的kubeconfig
helm env
# 查看Helm缓存
ls ~/.cache/helm/
# 查看Helm插件
helm plugin list
# 验证Chart结构
helm lint ./myapp --strict
5.2 性能监控
5.2.1 关键指标监控
# 查看所有Release状态
helm list -A --all
# 查看failed的Release
helm list -A --failed
# 查看pending的Release
helm list -A --pending
# 统计各namespace的Release数量
helm list -A -o json | jq -r '.[].namespace' | sort | uniq -c | sort -rn
5.2.2 监控指标说明
| 指标名称 |
正常范围 |
告警阈值 |
说明 |
| Release状态 |
deployed |
failed/pending |
failed需要排查,pending说明操作卡住 |
| Release历史版本数 |
<10 |
>20 |
过多历史版本占用etcd空间 |
| 部署耗时 |
<5分钟 |
>10分钟 |
超时通常是Pod启动慢或镜像拉取慢 |
| Hook执行时间 |
<2分钟 |
>5分钟 |
数据库迁移等Hook不应该太慢 |
| Chart Lint警告数 |
0 |
>0 |
警告可能导致部署异常 |
5.2.3 Prometheus监控规则
# 文件:helm-alerts.yaml
# 通过kube-state-metrics监控Helm管理的资源
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: helm-release-alerts
namespace: monitoring
spec:
groups:
- name: helm-releases
rules:
- alert: HelmReleaseDeploymentUnavailable
expr: |
kube_deployment_status_replicas_unavailable{namespace=~"production|staging"}
* on(deployment, namespace) group_left()
label_replace(
kube_deployment_labels{label_app_kubernetes_io_managed_by="Helm"},
"deployment", "$1", "deployment", "(.*)"
) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Helm-managed deployment {{ $labels.deployment }} has unavailable replicas"
- alert: HelmReleasePodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total{namespace=~"production|staging"}[1h]) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.pod }} in {{ $labels.namespace }} is crash looping"
5.3 备份与恢复
5.3.1 备份策略
#!/bin/bash
# Helm Release备份脚本
# 文件:/opt/scripts/helm-backup.sh
set -euo pipefail
BACKUP_DIR="/data/helm-backup/$(date +%Y%m%d)"
mkdir -p "${BACKUP_DIR}"
# 备份所有Release的values和manifest
for ns in $(kubectl get ns -o jsonpath='{.items.metadata.name}'); do
for release in $(helm list -n "$ns" -q 2>/dev/null); do
mkdir -p "${BACKUP_DIR}/${ns}"
helm get values "$release" -n "$ns" -o yaml > "${BACKUP_DIR}/${ns}/${release}-values.yaml" 2>/dev/null || true
helm get manifest "$release" -n "$ns" > "${BACKUP_DIR}/${ns}/${release}-manifest.yaml" 2>/dev/null || true
echo "[$(date)] Backed up: ${ns}/${release}"
done
done
# 压缩
tar czf "/data/helm-backup/helm-backup-$(date +%Y%m%d).tar.gz" -C "/data/helm-backup" "$(date +%Y%m%d)"
rm -rf "${BACKUP_DIR}"
echo "[$(date)] Helm backup completed"
5.3.2 恢复流程
- 停止服务:通知相关团队准备维护窗口
- 恢复数据:
helm upgrade --install <release> <chart> -f <backed-up-values.yaml> -n <namespace>
- 验证完整性:
helm status <release> -n <namespace>
- 重启服务:确认所有Pod Running
六、总结
6.1 技术要点回顾
- 要点一:
helm upgrade --install统一安装和升级操作,配合--atomic保证部署失败自动回滚
- 要点二:values.yaml按环境拆分(dev/staging/prod),敏感信息不写在values中,用External Secrets注入
- 要点三:Deployment模板中加
checksum/config注解,ConfigMap变更时自动触发Pod滚动更新
- 要点四:
--history-max 10限制Release历史版本数量,避免占用过多etcd空间
- 要点五:Chart源码纳入Git版本管理,CI/CD中用
helm lint和--dry-run做部署前验证
6.2 进阶学习方向
- Helmfile多Release编排:管理多个Helm Release的声明式工具,一个helmfile.yaml定义整个环境的所有服务
- 学习资源:Helmfile GitHub
- 实践建议:从单个环境开始,逐步扩展到多环境管理
- ArgoCD + Helm GitOps:用ArgoCD监控Git仓库中的Chart和values变更,自动同步到集群
- 学习资源:ArgoCD Helm支持
- 实践建议:先在staging环境实践GitOps流程
- 自定义Chart Library:把公司通用的模板抽成Library Chart,业务Chart引用Library减少重复代码
如果你正探索如何系统化地组织各种技术文档与实操教程,在日常运维中形成自己的知识库是提高效率的关键。
6.3 参考资料
附录
A. 命令速查表
# 仓库管理
helm repo add <name> <url> # 添加仓库
helm repo update # 更新仓库索引
helm repo list # 查看仓库列表
helm search repo <keyword> # 搜索Chart
# Release管理
helm install <release> <chart> -n <ns> # 安装
helm upgrade --install <release> <chart> -n <ns> # 安装或升级
helm upgrade <release> <chart> -n <ns> -f values.yaml # 升级
helm rollback <release> <revision> -n <ns> # 回滚
helm uninstall <release> -n <ns> # 卸载
helm list -A # 查看所有Release
helm history <release> -n <ns> # 查看历史
helm status <release> -n <ns> # 查看状态
# Chart开发
helm create <name> # 创建Chart骨架
helm lint <chart-path> # 语法检查
helm template <release> <chart> # 本地渲染模板
helm package <chart-path> # 打包
helm push <chart.tgz> <repo> # 推送到OCI仓库
helm dependency update <chart> # 更新依赖
# 调试
helm get values <release> -n <ns> # 查看values
helm get manifest <release> -n <ns> # 查看manifest
helm get all <release> -n <ns> # 查看所有信息
helm diff upgrade <release> <chart> # 差异对比(需要插件)
B. 配置参数详解
helm upgrade常用参数:
| 参数 |
说明 |
建议 |
--install |
不存在时自动install |
CI/CD中必用 |
--atomic |
失败自动回滚 |
生产环境必用 |
--wait |
等待Pod Ready |
生产环境必用 |
--timeout |
超时时间 |
生产设10m,开发设5m |
--history-max |
历史版本上限 |
建议10 |
--dry-run |
模拟执行 |
部署前验证 |
--debug |
调试输出 |
排查问题时用 |
--force |
强制替换资源 |
谨慎使用 |
--reset-values |
不合并旧values |
完整values时使用 |
--reuse-values |
复用旧values |
只改个别参数时使用 |
-f |
指定values文件 |
可多次使用,后面覆盖前面 |
--set |
命令行设置值 |
优先级最高 |
Helm Hook类型:
| Hook |
执行时机 |
典型用途 |
pre-install |
install前 |
创建前置资源 |
post-install |
install后 |
发送通知 |
pre-upgrade |
upgrade前 |
数据库迁移 |
post-upgrade |
upgrade后 |
清理缓存 |
pre-delete |
uninstall前 |
数据备份 |
post-delete |
uninstall后 |
清理外部资源 |
pre-rollback |
rollback前 |
数据库回滚 |
post-rollback |
rollback后 |
通知 |
test |
helm test时 |
连通性测试 |
C. 术语表
| 术语 |
英文 |
解释 |
| Chart |
- |
Helm的包格式,包含K8s资源模板和配置 |
| Release |
- |
Chart的一次安装实例,每次install/upgrade产生新版本 |
| Repository |
- |
Chart仓库,存储和分发Chart包 |
| Values |
- |
Chart的配置参数,通过values.yaml或--set传入 |
| Template |
- |
Go template格式的K8s资源模板 |
| Hook |
- |
在Release生命周期特定阶段执行的操作 |
| Dependency |
- |
Chart对其他Chart的依赖关系 |
| OCI |
Open Container Initiative |
容器镜像标准,Helm 3.8+支持OCI格式存储Chart |