写给那些每天被业务方追着要资源、手动处理工单到怀疑人生的运维开发同学
1.1 背景介绍
从事运维开发多年,我目睹了太多同行陷入一种困境:技术能力并不弱,却日复一日地处理重复性工作 —— 开通账号、配置权限、搭建流水线、排查网络问题。业务方一条消息发来,就得立刻放下手头的事情去响应。年复一年,技术栈似乎永远停留在编写 Shell 脚本和维护 Jenkins 的水平。
自 2022 年起,Gartner 连续两年将 Platform Engineering 列入年度技术趋势。这并非炒作概念,而是行业在实践 DevOps 近十年后,发现的一个根本性问题:DevOps 将运维的责任推给了开发,但开发人员根本无心管理这些底层设施。
我们团队从 2021 年开始尝试转型,如今已搭建起完整的内部开发者平台,服务于 300 多名开发者,日均处理超 2000 次部署。本文将我们踩过的坑、总结的路线整理出来,希望能为有志于转型的同学提供一份切实的参考。
简单来说,Platform Engineering 的核心在于 将运维能力产品化。
传统模式下,开发向运维申请资源,运维进行手动操作,开发则被动等待。这个过程可能耗时数小时甚至数天。
在 Platform Engineering 模式下,运维将这些能力封装成标准化的自助服务,开发人员只需在平台上点击几下即可完成所需操作。至此,运维的角色从“被动的工具人”转变为“主动的平台建设者”。
一个形象的比喻是:过去的运维像是餐厅服务员,客人点什么就上什么;现在的运维则是自助餐厅的设计师,负责规划取餐动线、准备食材、维护设备,而客人可以自行取餐。
其核心区别在于:
- DevOps:强调开发与运维的融合,要求开发人员具备一定的运维知识。
- Platform Engineering:由运维团队构建统一平台,让开发人员能够专注业务逻辑,实现更清晰的责任划分。
1.3 为什么现在要转型
根据近年来的观察,以下几个变化尤为显著:
技术复杂度爆炸式增长
五年前部署一个应用,可能只需在服务器上安装 JDK、配置 Nginx。而现在呢?Kubernetes、Service Mesh、可观测性三件套、GitOps... 要求每位开发人员都精通这些技术栈,显然不切实际。
认知负荷问题
我们曾做过统计,一名后端开发若要部署一个简单的微服务,需要了解:
- Dockerfile 编写(即便有模板,也常常需要修改)
- Kubernetes 资源定义(Deployment、Service、Ingress、ConfigMap...)
- CI/CD 流水线配置
- 监控告警规则
- 日志采集配置
这些知识与他的核心业务代码开发关系甚微,却必须掌握,导致认知负荷过重。
效率瓶颈
我们团队曾由 5 名运维开发人员服务 20 个业务团队。如果每个需求都手动处理,根本应接不暇。虽然编写了大量自动化脚本,但它们分散各处,维护成本也越来越高。
1.4 适用场景
并非所有团队都需要立即实施 Platform Engineering。根据我们的经验,符合以下条件的团队更为适合:
- 开发团队规模超过 50 人
- 微服务数量超过 30 个
- 每周部署次数超过 100 次
- 运维开发团队至少 3 人
- 已具备基本的容器化和 CI/CD 基础
如果团队仅有十几人、两三个服务,确实没有必要搞得太复杂,编写几个高效的自动化脚本足矣。
1.5 环境要求
转型所需的基础设施概览:
| 组件 |
最低要求 |
推荐配置 |
| Kubernetes |
1.24+ |
1.28+,至少 3 节点 |
| 代码仓库 |
GitLab 14+ / GitHub |
GitLab 16+ |
| CI/CD |
Jenkins / GitLab CI |
ArgoCD + Tekton |
| 监控 |
Prometheus + Grafana |
VictoriaMetrics + Grafana |
| 日志 |
ELK |
Loki + Grafana |
| 制品仓库 |
Harbor 2.0+ |
Harbor 2.9+ |
硬件资源估算(支撑 100 个微服务):
- 平台组件:8C16G * 3(高可用部署)
- 数据存储:500GB SSD(用于监控数据、日志索引)
- 对象存储:2TB(用于制品、日志归档)
二、转型路线图
2.1 阶段一:奠定基础(1-2 个月)
此阶段切勿急于编码,首要任务是全面摸清现状。
现状调研
我们当时向所有开发团队发放了一份问卷:
1. 你们每周大概部署多少次?
2. 一次部署从提交代码到上线大概需要多久?
3. 部署过程中遇到的最大痛点是什么?
4. 如果有个平台能解决一个问题,你最希望是什么?
5. 你觉得现在的 CI/CD 流程哪里最浪费时间?
收集回来的结果颇具启发性,排名前三的痛点是:
- 等待运维开通资源、配置环境(42%)
- 流水线配置复杂,不知如何修改(31%)
- 出现问题后不知去哪里查看日志(27%)
技术栈统一
调研后发现一个突出问题:各个团队的技术栈五花八门。有的用 Jenkins,有的用 GitLab CI,还有的直接在服务器上运行脚本。Kubernetes 版本更是从 1.18 到 1.26 不等。
统一技术栈是第一步。我们花费了一个月时间进行以下工作:
- 将所有集群升级到 Kubernetes 1.26(当时的稳定版本)
- 将 CI/CD 统一迁移至 GitLab CI
- 镜像仓库统一使用 Harbor
- 监控体系统一为 Prometheus + Grafana
这个过程相当痛苦,需要协调各个业务团队,有些团队坚决不愿改动。后来我们找到了一个有效方法:不强制迁移,但只支持新标准。旧的流水线若能运行则暂时保留,但所有新功能只在新平台上提供。渐渐地,团队们便主动迁移了过来。
团队能力评估
评估团队成员的技术栈构成:
# skills_assessment.yaml
team_members:
- name: "张三"
skills:
kubernetes: 4 # 1-5分
golang: 3
python: 5
frontend: 2
architecture: 3
- name: "李四"
skills:
kubernetes: 5
golang: 5
python: 3
frontend: 1
architecture: 4
根据评估结果分配后续任务。Platform Engineering 所需的关键技能包括:
- 后端开发能力(至少精通 Go/Python 之一)
- 对 Kubernetes 的深度理解(不仅是会用 kubectl,更要懂其原理)
- 前端基础(平台总得有个用户界面)
- 产品思维(这一点最容易被忽视)
2.2 阶段二:核心能力建设(3-6 个月)
此阶段开始搭建平台的核心能力。建议按以下顺序推进:
应用模板标准化
这是投入产出比最高的工作之一。
我们定义了一套应用模板,开发人员只需填写几个参数,复杂的 Kubernetes 配置便会自动生成。
# app_template.yaml
apiVersion: platform.internal/v1
kind: ApplicationTemplate
metadata:
name: springboot-web
labels:
type: web
framework: springboot
spec:
# developer needs to fill these
parameters:
- name: appName
description: "Application name"
type: string
required: true
pattern: "^[a-z][a-z0-9-]{2,30}$"
- name: replicas
description: "Number of replicas"
type: integer
default: 2
minimum: 1
maximum: 10
- name: memory
description: "Memory limit"
type: string
default: "1Gi"
enum: ["512Mi", "1Gi", "2Gi", "4Gi"]
- name: enableIngress
description: "Expose via Ingress"
type: boolean
default: true
- name: healthCheckPath
description: "Health check endpoint"
type: string
default: "/actuator/health"
# generated resources
resources:
deployment:
template: |
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .appName }}
labels:
app: {{ .appName }}
version: {{ .version }}
spec:
replicas: {{ .replicas }}
selector:
matchLabels:
app: {{ .appName }}
template:
metadata:
labels:
app: {{ .appName }}
version: {{ .version }}
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
spec:
containers:
- name: {{ .appName }}
image: harbor.internal/{{ .team }}/{{ .appName }}:{{ .version }}
ports:
- containerPort: 8080
name: http
resources:
requests:
memory: {{ div .memory 2 }}
cpu: "100m"
limits:
memory: {{ .memory }}
cpu: "1000m"
livenessProbe:
httpGet:
path: {{ .healthCheckPath }}
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: {{ .healthCheckPath }}
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
env:
- name: JAVA_OPTS
value: "-XX:+UseG1GC -XX:MaxGCPauseMillis=200 -Xms{{ div .memory 2 }} -Xmx{{ .memory }}"
- name: SPRING_PROFILES_ACTIVE
value: "{{ .environment }}"
自助服务门户
开发人员最反感的就是四处找人、被动等待。我们将常用操作都做成了自助服务:
// internal/service/project.go
package service
import (
"context"
"fmt"
"time"
"k8s.io/client-go/kubernetes"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
type ProjectService struct {
k8sClient *kubernetes.Clientset
gitlabClient *gitlab.Client
harborClient *harbor.Client
}
// CreateProject handles the entire project initialization flow
func (s *ProjectService) CreateProject(ctx context.Context, req *CreateProjectRequest) (*Project, error) {
// validate request
if err := req.Validate(); err != nil {
return nil, fmt.Errorf("validation failed: %w", err)
}
// check quota
quota, err := s.getTeamQuota(ctx, req.TeamID)
if err != nil {
return nil, err
}
if quota.RemainingProjects <= 0 {
return nil, ErrQuotaExceeded
}
// create namespace in kubernetes
ns := &corev1.Namespace{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("%s-%s", req.TeamID, req.ProjectName),
Labels: map[string]string{
"platform.internal/team": req.TeamID,
"platform.internal/project": req.ProjectName,
"platform.internal/env": req.Environment,
},
Annotations: map[string]string{
"platform.internal/created-by": req.Creator,
"platform.internal/created-at": time.Now().Format(time.RFC3339),
},
},
}
if _, err := s.k8sClient.CoreV1().Namespaces().Create(ctx, ns, metav1.CreateOptions{}); err != nil {
return nil, fmt.Errorf("failed to create namespace: %w", err)
}
// create resource quota
resourceQuota := &corev1.ResourceQuota{
ObjectMeta: metav1.ObjectMeta{
Name: "default-quota",
Namespace: ns.Name,
},
Spec: corev1.ResourceQuotaSpec{
Hard: corev1.ResourceList{
corev1.ResourceCPU: resource.MustParse("8"),
corev1.ResourceMemory: resource.MustParse("16Gi"),
corev1.ResourcePods: resource.MustParse("20"),
},
},
}
if _, err := s.k8sClient.CoreV1().ResourceQuotas(ns.Name).Create(ctx, resourceQuota, metav1.CreateOptions{}); err != nil {
// rollback namespace creation
s.k8sClient.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{})
return nil, fmt.Errorf("failed to create resource quota: %w", err)
}
// create GitLab project
gitlabProject, err := s.gitlabClient.Projects.CreateProject(&gitlab.CreateProjectOptions{
Name: gitlab.String(req.ProjectName),
NamespaceID: gitlab.Int(req.GitLabGroupID),
Visibility: gitlab.Visibility(gitlab.InternalVisibility),
})
if err != nil {
// rollback
s.k8sClient.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{})
return nil, fmt.Errorf("failed to create GitLab project: %w", err)
}
// create Harbor project
harborReq := &harbor.ProjectReq{
ProjectName: fmt.Sprintf("%s-%s", req.TeamID, req.ProjectName),
Public: false,
StorageLimit: 10 * 1024 * 1024 * 1024, // 10GB
}
if err := s.harborClient.Projects.CreateProject(ctx, harborReq); err != nil {
// rollback
s.k8sClient.CoreV1().Namespaces().Delete(ctx, ns.Name, metav1.DeleteOptions{})
s.gitlabClient.Projects.DeleteProject(gitlabProject.ID)
return nil, fmt.Errorf("failed to create Harbor project: %w", err)
}
// setup CI/CD pipeline (add .gitlab-ci.yml to repo)
if err := s.setupDefaultPipeline(ctx, gitlabProject.ID, req); err != nil {
// log warning but don't fail
log.Warnf("failed to setup default pipeline: %v", err)
}
return &Project{
ID: generateProjectID(),
Name: req.ProjectName,
TeamID: req.TeamID,
Namespace: ns.Name,
GitLabURL: gitlabProject.WebURL,
HarborURL: fmt.Sprintf("harbor.internal/%s-%s", req.TeamID, req.ProjectName),
CreatedAt: time.Now(),
CreatedBy: req.Creator,
}, nil
}
流水线模板化
CI/CD 流水线配置是开发人员吐槽最多的地方。我们制作了几套标准模板,覆盖了 90% 的常见场景。
# .gitlab-ci-templates/springboot.yml
# Standard CI/CD template for Spring Boot applications
variables:
MAVEN_OPTS: "-Dmaven.repo.local=.m2/repository"
DOCKER_DRIVER: overlay2
DOCKER_TLS_CERTDIR: ""
stages:
- test
- build
- scan
- deploy-dev
- deploy-staging
- deploy-prod
cache:
key: ${CI_COMMIT_REF_SLUG}
paths:
- .m2/repository/
- target/
# unit test and code quality
test:
stage: test
image: maven:3.9-eclipse-temurin-17
script:
- mvn clean test -B
- mvn sonar:sonar -Dsonar.host.url=$SONAR_HOST -Dsonar.token=$SONAR_TOKEN
coverage: '/Total.*?([0-9]{1,3})%/'
artifacts:
when: always
reports:
junit: target/surefire-reports/*.xml
coverage_report:
coverage_format: cobertura
path: target/site/jacoco/jacoco.xml
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
# build docker image
build:
stage: build
image: docker:24-dind
services:
- docker:24-dind
before_script:
- echo $HARBOR_PASSWORD | docker login harbor.internal -u $HARBOR_USER --password-stdin
script:
- |
VERSION=${CI_COMMIT_TAG:-${CI_COMMIT_SHORT_SHA}}
docker build \
--build-arg JAR_FILE=target/*.jar \
--build-arg BUILD_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ") \
--build-arg GIT_COMMIT=${CI_COMMIT_SHA} \
-t harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:${VERSION} \
-t harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:latest \
.
docker push harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:${VERSION}
docker push harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:latest
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
- if: $CI_COMMIT_TAG
# security scan
scan:
stage: scan
image: aquasec/trivy:latest
script:
- |
trivy image \
--severity HIGH,CRITICAL \
--exit-code 1 \
--ignore-unfixed \
--format json \
--output trivy-report.json \
harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA}
artifacts:
when: always
paths:
- trivy-report.json
expire_in: 1 week
allow_failure: false
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
# deploy to dev environment
deploy-dev:
stage: deploy-dev
image: bitnami/kubectl:latest
script:
- |
kubectl set image deployment/${CI_PROJECT_NAME} \
${CI_PROJECT_NAME}=harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA} \
-n ${CI_PROJECT_NAMESPACE}-dev
kubectl rollout status deployment/${CI_PROJECT_NAME} -n ${CI_PROJECT_NAMESPACE}-dev --timeout=300s
environment:
name: development
url: https://${CI_PROJECT_NAME}-dev.internal.company.com
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
# deploy to staging (manual trigger)
deploy-staging:
stage: deploy-staging
image: bitnami/kubectl:latest
script:
- |
kubectl set image deployment/${CI_PROJECT_NAME} \
${CI_PROJECT_NAME}=harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA} \
-n ${CI_PROJECT_NAMESPACE}-staging
kubectl rollout status deployment/${CI_PROJECT_NAME} -n ${CI_PROJECT_NAMESPACE}-staging --timeout=300s
environment:
name: staging
url: https://${CI_PROJECT_NAME}-staging.internal.company.com
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: manual
# deploy to production (manual trigger with approval)
deploy-prod:
stage: deploy-prod
image: bitnami/kubectl:latest
script:
- |
kubectl set image deployment/${CI_PROJECT_NAME} \
${CI_PROJECT_NAME}=harbor.internal/${CI_PROJECT_NAMESPACE}/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA} \
-n ${CI_PROJECT_NAMESPACE}-prod
kubectl rollout status deployment/${CI_PROJECT_NAME} -n ${CI_PROJECT_NAMESPACE}-prod --timeout=600s
environment:
name: production
url: https://${CI_PROJECT_NAME}.company.com
rules:
- if: $CI_COMMIT_TAG
when: manual
needs:
- deploy-staging
2.3 阶段三:体验优化(2-3 个月)
平台虽然可用,但若不好用,开发人员依然会抱怨。此阶段重点在于提升用户体验。
统一门户
将所有入口收敛到一个统一的地方。我们选择 Backstage 作为开发者门户的基座。
// packages/app/src/App.tsx
import React from 'react';
import { createApp } from '@backstage/app-defaults';
import { AppRouter, FlatRoutes } from '@backstage/core-app-api';
import { CatalogIndexPage, CatalogEntityPage } from '@backstage/plugin-catalog';
import { TechDocsPage } from '@backstage/plugin-techdocs';
import { SearchPage } from '@backstage/plugin-search';
import { UserSettingsPage } from '@backstage/plugin-user-settings';
// custom plugins
import { DeploymentPage } from '@internal/plugin-deployment';
import { MonitoringPage } from '@internal/plugin-monitoring';
import { CostCenterPage } from '@internal/plugin-cost';
const app = createApp({
apis: [],
plugins: [],
themes: [{
id: 'platform-theme',
title: 'Platform Theme',
variant: 'light',
Provider: ({ children }) => (
<ThemeProvider theme={platformTheme}>{children}</ThemeProvider>
),
}],
});
const routes = (
<FlatRoutes>
<Route path="/" element={<HomepageDashboard />} />
<Route path="/catalog" element={<CatalogIndexPage />} />
<Route path="/catalog/:namespace/:kind/:name" element={<CatalogEntityPage />} />
<Route path="/docs" element={<TechDocsPage />} />
<Route path="/deploy" element={<DeploymentPage />} />
<Route path="/monitoring" element={<MonitoringPage />} />
<Route path="/cost" element={<CostCenterPage />} />
<Route path="/search" element={<SearchPage />} />
<Route path="/settings" element={<UserSettingsPage />} />
</FlatRoutes>
);
export default app.createRoot(
<>
<AlertDisplay />
<OAuthRequestDialog />
<AppRouter>
{routes}
</AppRouter>
</>,
);
Golden Path 设计
Golden Path 是 Platform Engineering 的核心概念:为开发者铺设一条“黄金路径”,让他们以最少的决策成本完成任务。
# golden-paths/new-microservice.yaml
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: springboot-microservice
title: Spring Boot Microservice
description: Create a new Spring Boot microservice with all best practices built-in
tags:
- java
- springboot
- recommended
spec:
owner: platform-team
type: service
parameters:
- title: Service Information
required:
- name
- owner
properties:
name:
title: Service Name
type: string
description: Unique name for your service
pattern: '^[a-z][a-z0-9-]{2,30}$'
ui:autofocus: true
ui:help: 'lowercase letters, numbers, hyphens only'
owner:
title: Owner Team
type: string
description: Team responsible for this service
ui:field: OwnerPicker
ui:options:
catalogFilter:
kind: Group
description:
title: Description
type: string
description: Brief description of what this service does
- title: Technical Configuration
properties:
javaVersion:
title: Java Version
type: string
default: '17'
enum: ['17', '21']
enumNames: ['Java 17 (LTS)', 'Java 21 (LTS)']
database:
title: Database
type: string
default: 'postgresql'
enum: ['none', 'postgresql', 'mysql', 'mongodb']
enumNames: ['No Database', 'PostgreSQL', 'MySQL', 'MongoDB']
messaging:
title: Message Queue
type: string
default: 'none'
enum: ['none', 'kafka', 'rabbitmq']
enumNames: ['No MQ', 'Kafka', 'RabbitMQ']
caching:
title: Cache
type: boolean
default: false
description: Enable Redis caching
steps:
- id: fetch-template
name: Fetch Template
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
owner: ${{ parameters.owner }}
description: ${{ parameters.description }}
javaVersion: ${{ parameters.javaVersion }}
database: ${{ parameters.database }}
messaging: ${{ parameters.messaging }}
caching: ${{ parameters.caching }}
- id: create-repo
name: Create Repository
action: publish:gitlab
input:
repoUrl: gitlab.internal?owner=${{ parameters.owner }}&repo=${{ parameters.name }}
defaultBranch: main
repoVisibility: internal
- id: create-k8s-resources
name: Setup Kubernetes Resources
action: kubernetes:apply
input:
namespaceTemplate: team-namespace
values:
serviceName: ${{ parameters.name }}
team: ${{ parameters.owner }}
- id: register-catalog
name: Register in Catalog
action: catalog:register
input:
repoContentsUrl: ${{ steps['create-repo'].output.repoContentsUrl }}
catalogInfoPath: /catalog-info.yaml
output:
links:
- title: Repository
url: ${{ steps['create-repo'].output.remoteUrl }}
- title: Open in Catalog
icon: catalog
entityRef: ${{ steps['register-catalog'].output.entityRef }}
- title: CI/CD Pipeline
url: ${{ steps['create-repo'].output.remoteUrl }}/-/pipelines
2.4 阶段四:规模化运营(持续)
平台搭建完成后,运营工作至关重要。这个阶段容易被忽视,但其实决定了平台的长期成败。
平台即产品
将平台当作一个持续迭代的产品来运营,而不是一个一次性交付的项目。
我们每两周进行一次用户访谈,每月发送一次满意度调查。并使用 NPS(净推荐值)来衡量平台的整体健康度。
# scripts/collect_platform_metrics.py
#!/usr/bin/env python3
"""Collect platform usage metrics for monthly report."""
import json
from datetime import datetime, timedelta
from typing import Dict, List
import requests
from dataclasses import dataclass
@dataclass
class PlatformMetrics:
period_start: datetime
period_end: datetime
total_deployments: int
unique_deployers: int
avg_deployment_time_seconds: float
deployment_success_rate: float
self_service_ratio: float # percentage of operations done without human intervention
incident_count: int
mttr_minutes: float # mean time to recovery
def collect_deployment_metrics(prometheus_url: str, start: datetime, end: datetime) -> Dict:
"""Collect deployment metrics from Prometheus."""
queries = {
'total_deployments': f'sum(increase(deployment_total{{{period_label}}}[30d]))',
'successful_deployments': f'sum(increase(deployment_success_total{{{period_label}}}[30d]))',
'deployment_duration_avg': f'avg(deployment_duration_seconds{{{period_label}}})',
}
results = {}
for name, query in queries.items():
resp = requests.get(
f"{prometheus_url}/api/v1/query",
params={'query': query, 'time': end.timestamp()}
)
data = resp.json()
if data['status'] == 'success' and data['data']['result']:
results[name] = float(data['data']['result'][0]['value'][1])
else:
results[name] = 0
return results
def calculate_self_service_ratio(
total_operations: int,
manual_tickets: int
) -> float:
"""Calculate what percentage of operations were self-service."""
if total_operations == 0:
return 0.0
return (total_operations - manual_tickets) / total_operations * 100
def generate_monthly_report(metrics: PlatformMetrics) -> str:
"""Generate monthly platform report in Markdown."""
report = f"""
# Platform Engineering Monthly Report
## {metrics.period_start.strftime('%Y-%m')}
### Key Metrics
| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| Total Deployments | {metrics.total_deployments:,} | - | - |
| Unique Deployers | {metrics.unique_deployers} | - | - |
| Avg Deployment Time | {metrics.avg_deployment_time_seconds:.1f}s | <300s | {'OK' if metrics.avg_deployment_time_seconds < 300 else 'WARN'} |
| Deployment Success Rate | {metrics.deployment_success_rate:.1f}% | >95% | {'OK' if metrics.deployment_success_rate > 95 else 'WARN'} |
| Self-Service Ratio | {metrics.self_service_ratio:.1f}% | >80% | {'OK' if metrics.self_service_ratio > 80 else 'WARN'} |
| Platform Incidents | {metrics.incident_count} | <5 | {'OK' if metrics.incident_count < 5 else 'WARN'} |
| MTTR | {metrics.mttr_minutes:.0f}min | <30min | {'OK' if metrics.mttr_minutes < 30 else 'WARN'} |
### Highlights
- Deployment frequency increased by X% compared to last month
- Self-service ratio improved by X points
- Top 3 most active teams: ...
### Action Items
- [ ] Improve documentation for X feature
- [ ] Add support for Y use case
- [ ] Investigate performance issue with Z
"""
return report
if __name__ == '__main__':
# run monthly report generation
end = datetime.now()
start = end - timedelta(days=30)
# collect metrics from various sources
deployment_metrics = collect_deployment_metrics(
'http://prometheus.internal:9090',
start, end
)
metrics = PlatformMetrics(
period_start=start,
period_end=end,
total_deployments=int(deployment_metrics.get('total_deployments', 0)),
unique_deployers=150, # from user database
avg_deployment_time_seconds=deployment_metrics.get('deployment_duration_avg', 0),
deployment_success_rate=deployment_metrics.get('successful_deployments', 0) /
max(deployment_metrics.get('total_deployments', 1), 1) * 100,
self_service_ratio=calculate_self_service_ratio(2000, 100),
incident_count=3,
mttr_minutes=25,
)
report = generate_monthly_report(metrics)
print(report)
成本分摊与展示
让每个团队清晰地看到自己消耗了多少资源、产生了多少费用,能有效控制资源浪费。
// internal/cost/calculator.go
package cost
import (
"context"
"time"
promapi "github.com/prometheus/client_golang/api"
promv1 "github.com/prometheus/client_golang/api/prometheus/v1"
)
// UnitPrice defines the cost per resource unit per hour
type UnitPrice struct {
CPUCore float64 // cost per CPU core per hour
MemoryGB float64 // cost per GB memory per hour
StorageGB float64 // cost per GB storage per hour
NetworkGB float64 // cost per GB network transfer
}
// default prices (adjust based on your cloud provider)
var DefaultPrices = UnitPrice{
CPUCore: 0.05, // $0.05 per core per hour
MemoryGB: 0.01, // $0.01 per GB per hour
StorageGB: 0.0001, // $0.0001 per GB per hour
NetworkGB: 0.05, // $0.05 per GB
}
type CostCalculator struct {
promClient promv1.API
prices UnitPrice
}
func NewCostCalculator(promURL string, prices UnitPrice) (*CostCalculator, error) {
client, err := promapi.NewClient(promapi.Config{Address: promURL})
if err != nil {
return nil, err
}
return &CostCalculator{
promClient: promv1.NewAPI(client),
prices: prices,
}, nil
}
// CalculateNamespaceCost calculates the cost for a namespace over a time period
func (c *CostCalculator) CalculateNamespaceCost(
ctx context.Context,
namespace string,
start, end time.Time,
) (*NamespaceCost, error) {
hours := end.Sub(start).Hours()
// query average CPU usage
cpuQuery := fmt.Sprintf(
`avg_over_time(sum(rate(container_cpu_usage_seconds_total{namespace="%s"}[5m]))[%dh:1h])`,
namespace, int(hours),
)
cpuResult, _, err := c.promClient.Query(ctx, cpuQuery, end)
if err != nil {
return nil, err
}
cpuCores := extractScalarValue(cpuResult)
// query average memory usage
memQuery := fmt.Sprintf(
`avg_over_time(sum(container_memory_usage_bytes{namespace="%s"})[%dh:1h]) / 1024 / 1024 / 1024`,
namespace, int(hours),
)
memResult, _, err := c.promClient.Query(ctx, memQuery, end)
if err != nil {
return nil, err
}
memoryGB := extractScalarValue(memResult)
// query storage usage
storageQuery := fmt.Sprintf(
`sum(kubelet_volume_stats_used_bytes{namespace="%s"}) / 1024 / 1024 / 1024`,
namespace,
)
storageResult, _, err := c.promClient.Query(ctx, storageQuery, end)
if err != nil {
return nil, err
}
storageGB := extractScalarValue(storageResult)
// calculate costs
return &NamespaceCost{
Namespace: namespace,
Period: fmt.Sprintf("%s - %s", start.Format("2006-01-02"), end.Format("2006-01-02")),
CPUCores: cpuCores,
MemoryGB: memoryGB,
StorageGB: storageGB,
CPUCost: cpuCores * hours * c.prices.CPUCore,
MemoryCost: memoryGB * hours * c.prices.MemoryGB,
StorageCost: storageGB * hours * c.prices.StorageGB,
TotalCost: cpuCores*hours*c.prices.CPUCore + memoryGB*hours*c.prices.MemoryGB + storageGB*hours*c.prices.StorageGB,
}, nil
}
type NamespaceCost struct {
Namespace string
Period string
CPUCores float64
MemoryGB float64
StorageGB float64
CPUCost float64
MemoryCost float64
StorageCost float64
TotalCost float64
}
三、示例代码和配置
3.1 完整的平台 API 设计
# api/openapi/platform-api.yaml
openapi: 3.0.3
info:
title: Internal Developer Platform API
description: API for self-service platform operations
version: 1.0.0
contact:
name: Platform Team
email: platform@company.com
servers:
- url: https://platform-api.internal.company.com/v1
description: Production
security:
- bearerAuth: []
paths:
/projects:
get:
summary: List all projects
operationId: listProjects
tags:
- Projects
parameters:
- name: team
in: query
schema:
type: string
description: Filter by team
- name: page
in: query
schema:
type: integer
default: 1
- name: limit
in: query
schema:
type: integer
default: 20
maximum: 100
responses:
'200':
description: List of projects
content:
application/json:
schema:
type: object
properties:
items:
type: array
items:
$ref: '#/components/schemas/Project'
total:
type: integer
page:
type: integer
limit:
type: integer
post:
summary: Create a new project
operationId: createProject
tags:
- Projects
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/CreateProjectRequest'
responses:
'201':
description: Project created
content:
application/json:
schema:
$ref: '#/components/schemas/Project'
'400':
description: Invalid request
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
'409':
description: Project already exists
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
/projects/{projectId}/deployments:
get:
summary: List deployments for a project
operationId: listDeployments
tags:
- Deployments
parameters:
- name: projectId
in: path
required: true
schema:
type: string
- name: environment
in: query
schema:
type: string
enum: [dev, staging, prod]
- name: status
in: query
schema:
type: string
enum: [pending, running, success, failed, cancelled]
responses:
'200':
description: List of deployments
content:
application/json:
schema:
type: array
items:
$ref: '#/components/schemas/Deployment'
post:
summary: Trigger a new deployment
operationId: createDeployment
tags:
- Deployments
parameters:
- name: projectId
in: path
required: true
schema:
type: string
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/CreateDeploymentRequest'
responses:
'202':
description: Deployment triggered
content:
application/json:
schema:
$ref: '#/components/schemas/Deployment'
/projects/{projectId}/environments/{env}/scale:
put:
summary: Scale application replicas
operationId: scaleApplication
tags:
- Operations
parameters:
- name: projectId
in: path
required: true
schema:
type: string
- name: env
in: path
required: true
schema:
type: string
enum: [dev, staging, prod]
requestBody:
required: true
content:
application/json:
schema:
type: object
required:
- replicas
properties:
replicas:
type: integer
minimum: 0
maximum: 50
responses:
'200':
description: Scaling initiated
'400':
description: Invalid replica count
components:
securitySchemes:
bearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
schemas:
Project:
type: object
properties:
id:
type: string
format: uuid
name:
type: string
team:
type: string
description:
type: string
gitRepoUrl:
type: string
format: uri
imageRegistry:
type: string
environments:
type: array
items:
$ref: '#/components/schemas/Environment'
createdAt:
type: string
format: date-time
createdBy:
type: string
CreateProjectRequest:
type: object
required:
- name
- team
properties:
name:
type: string
pattern: '^[a-z][a-z0-9-]{2,30}$'
team:
type: string
description:
type: string
maxLength: 500
template:
type: string
enum: [springboot, nodejs, python, golang]
default: springboot
Environment:
type: object
properties:
name:
type: string
enum: [dev, staging, prod]
namespace:
type: string
replicas:
type: integer
currentVersion:
type: string
status:
type: string
enum: [healthy, degraded, unavailable]
Deployment:
type: object
properties:
id:
type: string
format: uuid
projectId:
type: string
environment:
type: string
version:
type: string
status:
type: string
enum: [pending, running, success, failed, cancelled]
triggeredBy:
type: string
triggeredAt:
type: string
format: date-time
completedAt:
type: string
format: date-time
logs:
type: string
format: uri
CreateDeploymentRequest:
type: object
required:
- environment
- version
properties:
environment:
type: string
enum: [dev, staging, prod]
version:
type: string
description: Git tag or commit SHA
strategy:
type: string
enum: [rolling, blue-green, canary]
default: rolling
Error:
type: object
properties:
code:
type: string
message:
type: string
details:
type: object
# modules/team-environment/main.tf
variable "team_name" {
type = string
description = "Team name, used as prefix for all resources"
validation {
condition = can(regex("^[a-z][a-z0-9-]{2,20}$", var.team_name))
error_message = "Team name must be lowercase alphanumeric with hyphens, 3-21 characters."
}
}
variable "environments" {
type = list(string)
description = "List of environments to create"
default = ["dev", "staging", "prod"]
}
variable "cpu_quota" {
type = map(string)
description = "CPU quota per environment"
default = {
dev = "4"
staging = "8"
prod = "16"
}
}
variable "memory_quota" {
type = map(string)
description = "Memory quota per environment"
default = {
dev = "8Gi"
staging = "16Gi"
prod = "32Gi"
}
}
# Create namespaces for each environment
resource "kubernetes_namespace" "team_ns" {
for_each = toset(var.environments)
metadata {
name = "${var.team_name}-${each.key}"
labels = {
"platform.internal/team" = var.team_name
"platform.internal/environment" = each.key
"istio-injection" = each.key == "prod" ? "enabled" : "disabled"
}
annotations = {
"platform.internal/created-by" = "terraform"
"platform.internal/managed" = "true"
}
}
}
# Resource quotas
resource "kubernetes_resource_quota" "team_quota" {
for_each = toset(var.environments)
metadata {
name = "team-quota"
namespace = kubernetes_namespace.team_ns[each.key].metadata[0].name
}
spec {
hard = {
"requests.cpu" = var.cpu_quota[each.key]
"requests.memory" = var.memory_quota[each.key]
"limits.cpu" = var.cpu_quota[each.key]
"limits.memory" = var.memory_quota[each.key]
"pods" = each.key == "prod" ? "100" : "50"
"services" = "20"
"secrets" = "50"
"configmaps" = "50"
}
}
}
# Limit ranges
resource "kubernetes_limit_range" "team_limits" {
for_each = toset(var.environments)
metadata {
name = "team-limits"
namespace = kubernetes_namespace.team_ns[each.key].metadata[0].name
}
spec {
limit {
type = "Container"
default = {
cpu = "500m"
memory = "512Mi"
}
default_request = {
cpu = "100m"
memory = "128Mi"
}
max = {
cpu = "4"
memory = "8Gi"
}
min = {
cpu = "10m"
memory = "32Mi"
}
}
limit {
type = "PersistentVolumeClaim"
max = {
storage = each.key == "prod" ? "100Gi" : "20Gi"
}
min = {
storage = "1Gi"
}
}
}
}
# Network policies - default deny with allow for specific traffic
resource "kubernetes_network_policy" "default_deny" {
for_each = toset(var.environments)
metadata {
name = "default-deny"
namespace = kubernetes_namespace.team_ns[each.key].metadata[0].name
}
spec {
pod_selector {}
policy_types = ["Ingress", "Egress"]
# Allow egress to DNS
egress {
to {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "kube-system"
}
}
pod_selector {
match_labels = {
"k8s-app" = "kube-dns"
}
}
}
ports {
port = 53
protocol = "UDP"
}
ports {
port = 53
protocol = "TCP"
}
}
# Allow egress to same namespace
egress {
to {
pod_selector {}
}
}
# Allow ingress from ingress controller
ingress {
from {
namespace_selector {
match_labels = {
"kubernetes.io/metadata.name" = "ingress-nginx"
}
}
}
}
# Allow ingress from same namespace
ingress {
from {
pod_selector {}
}
}
}
}
# RBAC - team members can manage resources in their namespaces
resource "kubernetes_role" "team_developer" {
for_each = toset(var.environments)
metadata {
name = "team-developer"
namespace = kubernetes_namespace.team_ns[each.key].metadata[0].name
}
rule {
api_groups = ["", "apps", "batch"]
resources = ["pods", "pods/log", "pods/exec", "deployments", "services", "configmaps", "secrets", "jobs", "cronjobs"]
verbs = each.key == "prod" ? ["get", "list", "watch"] : ["*"]
}
rule {
api_groups = ["networking.k8s.io"]
resources = ["ingresses"]
verbs = each.key == "prod" ? ["get", "list", "watch"] : ["*"]
}
rule {
api_groups = ["autoscaling"]
resources = ["horizontalpodautoscalers"]
verbs = ["get", "list", "watch", "create", "update"]
}
}
# Create RoleBinding for the team's GitLab group
resource "kubernetes_role_binding" "team_binding" {
for_each = toset(var.environments)
metadata {
name = "team-developer-binding"
namespace = kubernetes_namespace.team_ns[each.key].metadata[0].name
}
role_ref {
api_group = "rbac.authorization.k8s.io"
kind = "Role"
name = kubernetes_role.team_developer[each.key].metadata[0].name
}
subject {
kind = "Group"
name = "gitlab:${var.team_name}"
api_group = "rbac.authorization.k8s.io"
}
}
# Harbor project for team's container images
resource "harbor_project" "team_project" {
name = var.team_name
public = false
vulnerability_scanning = true
storage_quota = 53687091200 # 50GB
cve_allowlist {
# allow specific CVEs if needed (not recommended)
items = []
}
}
# Harbor robot account for CI/CD
resource "harbor_robot_account" "ci_robot" {
name = "${var.team_name}-ci"
description = "Robot account for ${var.team_name} CI/CD pipelines"
level = "project"
permissions {
kind = "project"
namespace = harbor_project.team_project.name
access {
resource = "repository"
action = "push"
}
access {
resource = "repository"
action = "pull"
}
access {
resource = "tag"
action = "list"
}
}
}
# Outputs
output "namespaces" {
value = {
for env in var.environments : env => kubernetes_namespace.team_ns[env].metadata[0].name
}
}
output "harbor_project" {
value = harbor_project.team_project.name
}
output "harbor_robot_name" {
value = harbor_robot_account.ci_robot.full_name
}
output "harbor_robot_secret" {
value = harbor_robot_account.ci_robot.secret
sensitive = true
}
3.3 GitOps 实现:ArgoCD Application 模板
# argocd/application-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: team-applications
namespace: argocd
spec:
generators:
# Generate applications from Git directory structure
- git:
repoURL: https://gitlab.internal/platform/deployments.git
revision: HEAD
directories:
- path: 'teams/*/apps/*'
# Also support explicit list from ConfigMap
- matrix:
generators:
- list:
elementsYaml: "{{ .teams }}"
- list:
elements:
- env: dev
cluster: https://k8s-dev.internal:6443
autoSync: true
- env: staging
cluster: https://k8s-staging.internal:6443
autoSync: true
- env: prod
cluster: https://k8s-prod.internal:6443
autoSync: false # manual sync for production
template:
metadata:
name: '{{path.basename}}-{{env}}'
labels:
team: '{{path[1]}}'
app: '{{path.basename}}'
environment: '{{env}}'
annotations:
notifications.argoproj.io/subscribe.on-sync-succeeded.slack: platform-deployments
notifications.argoproj.io/subscribe.on-sync-failed.slack: platform-alerts
spec:
project: '{{path[1]}}' # team name as project
source:
repoURL: https://gitlab.internal/platform/deployments.git
targetRevision: HEAD
path: '{{path}}/overlays/{{env}}'
# Kustomize support
kustomize:
images:
- 'harbor.internal/{{path[1]}}/{{path.basename}}:{{env}}'
destination:
server: '{{cluster}}'
namespace: '{{path[1]}}-{{env}}'
syncPolicy:
automated:
prune: '{{autoSync}}'
selfHeal: '{{autoSync}}'
allowEmpty: false
syncOptions:
- CreateNamespace=false
- PrunePropagationPolicy=foreground
- PruneLast=true
- ApplyOutOfSyncOnly=true
retry:
limit: 3
backoff:
duration: 5s
factor: 2
maxDuration: 3m
# Health checks
ignoreDifferences:
- group: apps
kind: Deployment
jsonPointers:
- /spec/replicas # ignore if HPA is managing replicas
# Notifications
info:
- name: team
value: '{{path[1]}}'
- name: slack
value: '#{{path[1]}}-deploys'
---
# ArgoCD Project for team isolation
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: team-template
namespace: argocd
spec:
description: Template for team projects
sourceRepos:
- 'https://gitlab.internal/{{.team}}/*'
- 'https://gitlab.internal/platform/deployments.git'
destinations:
- namespace: '{{.team}}-*'
server: '*'
clusterResourceWhitelist:
- group: ''
kind: Namespace
namespaceResourceWhitelist:
- group: '*'
kind: '*'
roles:
- name: developer
description: Team developers
policies:
- p, proj:{{.team}}:developer, applications, get, {{.team}}/*, allow
- p, proj:{{.team}}:developer, applications, sync, {{.team}}/*, allow
- p, proj:{{.team}}:developer, logs, get, {{.team}}/*, allow
groups:
- gitlab:{{.team}}
- name: lead
description: Team leads with more permissions
policies:
- p, proj:{{.team}}:lead, applications, *, {{.team}}/*, allow
- p, proj:{{.team}}:lead, logs, get, {{.team}}/*, allow
- p, proj:{{.team}}:lead, exec, create, {{.team}}/*, allow
groups:
- gitlab:{{.team}}-leads
四、最佳实践和注意事项
4.1 性能优化
API 性能
平台 API 是所有操作的入口,其响应速度直接影响用户体验。
// internal/middleware/cache.go
package middleware
import (
"context"
"crypto/sha256"
"encoding/hex"
"time"
"github.com/go-redis/redis/v8"
"github.com/gofiber/fiber/v2"
)
type CacheMiddleware struct {
redis *redis.Client
ttl time.Duration
}
func NewCacheMiddleware(redis *redis.Client) *CacheMiddleware {
return &CacheMiddleware{
redis: redis,
ttl: 5 * time.Minute,
}
}
func (c *CacheMiddleware) Handler() fiber.Handler {
return func(ctx *fiber.Ctx) error {
// only cache GET requests
if ctx.Method() != fiber.MethodGet {
return ctx.Next()
}
// skip if no-cache header
if ctx.Get("Cache-Control") == "no-cache" {
return ctx.Next()
}
// generate cache key
key := c.generateKey(ctx)
// try to get from cache
cached, err := c.redis.Get(context.Background(), key).Bytes()
if err == nil {
ctx.Set("X-Cache", "HIT")
ctx.Set("Content-Type", "application/json")
return ctx.Send(cached)
}
// call next handler
if err := ctx.Next(); err != nil {
return err
}
// cache successful responses
if ctx.Response().StatusCode() == 200 {
body := ctx.Response().Body()
c.redis.Set(context.Background(), key, body, c.ttl)
}
ctx.Set("X-Cache", "MISS")
return nil
}
}
func (c *CacheMiddleware) generateKey(ctx *fiber.Ctx) string {
// include user ID to ensure per-user caching
userID := ctx.Locals("userID").(string)
raw := ctx.OriginalURL() + "|" + userID
hash := sha256.Sum256([]byte(raw))
return "api:cache:" + hex.EncodeToString(hash[:])
}
Kubernetes API 调用优化
直接调用 Kubernetes API 可能较慢,使用 Informer 进行本地缓存是更佳选择。
// internal/k8s/cache.go
package k8s
import (
"context"
"sync"
"time"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/labels"
"k8s.io/client-go/informers"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/tools/cache"
)
type ResourceCache struct {
informerFactory informers.SharedInformerFactory
stopCh chan struct{}
mu sync.RWMutex
}
func NewResourceCache(clientset *kubernetes.Clientset) *ResourceCache {
factory := informers.NewSharedInformerFactoryWithOptions(
clientset,
30*time.Second, // resync period
)
rc := &ResourceCache{
informerFactory: factory,
stopCh: make(chan struct{}),
}
// start informers
factory.Start(rc.stopCh)
// wait for cache sync
factory.WaitForCacheSync(rc.stopCh)
return rc
}
func (rc *ResourceCache) ListPods(namespace string, labelSelector map[string]string) ([]*corev1.Pod, error) {
rc.mu.RLock()
defer rc.mu.RUnlock()
selector := labels.SelectorFromSet(labelSelector)
pods, err := rc.informerFactory.Core().V1().Pods().Lister().Pods(namespace).List(selector)
if err != nil {
return nil, err
}
return pods, nil
}
func (rc *ResourceCache) GetDeploymentStatus(namespace, name string) (*DeploymentStatus, error) {
deployment, err := rc.informerFactory.Apps().V1().Deployments().Lister().Deployments(namespace).Get(name)
if err != nil {
return nil, err
}
return &DeploymentStatus{
Name: deployment.Name,
Replicas: deployment.Status.Replicas,
ReadyReplicas: deployment.Status.ReadyReplicas,
AvailableReplicas: deployment.Status.AvailableReplicas,
UpdatedReplicas: deployment.Status.UpdatedReplicas,
}, nil
}
func (rc *ResourceCache) Stop() {
close(rc.stopCh)
}
4.2 安全加固
API 认证与授权
// internal/auth/middleware.go
package auth
import (
"context"
"strings"
"github.com/gofiber/fiber/v2"
"github.com/golang-jwt/jwt/v5"
)
type AuthMiddleware struct {
jwtSecret []byte
rbacClient *RBACClient
}
func (a *AuthMiddleware) Authenticate() fiber.Handler {
return func(c *fiber.Ctx) error {
// extract token from header
authHeader := c.Get("Authorization")
if authHeader == "" {
return c.Status(401).JSON(fiber.Map{
"error": "missing authorization header",
})
}
parts := strings.Split(authHeader, " ")
if len(parts) != 2 || parts[0] != "Bearer" {
return c.Status(401).JSON(fiber.Map{
"error": "invalid authorization header format",
})
}
// parse and validate JWT
token, err := jwt.Parse(parts[1], func(token *jwt.Token) (interface{}, error) {
if _, ok := token.Method.(*jwt.SigningMethodHMAC); !ok {
return nil, fmt.Errorf("unexpected signing method")
}
return a.jwtSecret, nil
})
if err != nil || !token.Valid {
return c.Status(401).JSON(fiber.Map{
"error": "invalid token",
})
}
claims := token.Claims.(jwt.MapClaims)
// store user info in context
c.Locals("userID", claims["sub"])
c.Locals("userName", claims["name"])
c.Locals("userGroups", claims["groups"])
return c.Next()
}
}
func (a *AuthMiddleware) Authorize(resource, action string) fiber.Handler {
return func(c *fiber.Ctx) error {
userID := c.Locals("userID").(string)
groups := c.Locals("userGroups").([]string)
// extract resource ID from path if exists
resourceID := c.Params("id", "")
// check permissions
allowed, err := a.rbacClient.Check(context.Background(), &CheckRequest{
Subject: userID,
Groups: groups,
Resource: resource,
ResourceID: resourceID,
Action: action,
})
if err != nil {
return c.Status(500).JSON(fiber.Map{
"error": "authorization check failed",
})
}
if !allowed {
return c.Status(403).JSON(fiber.Map{
"error": "forbidden",
})
}
return c.Next()
}
}
Secrets 管理
永远不要将密钥以明文形式存储在代码或配置文件中。可以利用 External Secrets Operator 等工具进行集中管理。
# external-secrets operator configuration
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
name: vault-backend
spec:
provider:
vault:
server: "https://vault.internal:8200"
path: "secret"
version: "v2"
auth:
kubernetes:
mountPath: "kubernetes"
role: "external-secrets"
serviceAccountRef:
name: "external-secrets"
namespace: "external-secrets"
---
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: app-secrets
namespace: team-a-prod
spec:
refreshInterval: 1h
secretStoreRef:
name: vault-backend
kind: ClusterSecretStore
target:
name: app-secrets
creationPolicy: Owner
template:
type: Opaque
data:
DATABASE_URL: "postgresql://{{ .db_user }}:{{ .db_password }}@{{ .db_host }}:5432/{{ .db_name }}"
REDIS_URL: "redis://:{{ .redis_password }}@{{ .redis_host }}:6379"
data:
- secretKey: db_user
remoteRef:
key: team-a/prod/database
property: username
- secretKey: db_password
remoteRef:
key: team-a/prod/database
property: password
- secretKey: db_host
remoteRef:
key: team-a/prod/database
property: host
- secretKey: db_name
remoteRef:
key: team-a/prod/database
property: name
- secretKey: redis_password
remoteRef:
key: team-a/prod/redis
property: password
- secretKey: redis_host
remoteRef:
key: team-a/prod/redis
property: host
4.3 高可用配置
平台本身也必须具备高可用性,避免成为单点故障。
# platform-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: platform-api
namespace: platform-system
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
selector:
matchLabels:
app: platform-api
template:
metadata:
labels:
app: platform-api
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: platform-api
topologyKey: kubernetes.io/hostname
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: platform-api
containers:
- name: api
image: harbor.internal/platform/api:v1.5.0
ports:
- containerPort: 8080
name: http
- containerPort: 8081
name: metrics
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: config
mountPath: /etc/platform
readOnly: true
volumes:
- name: config
configMap:
name: platform-api-config
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: platform-api-pdb
namespace: platform-system
spec:
minAvailable: 2
selector:
matchLabels:
app: platform-api
4.4 常见错误
错误一:一开始就想做太多
我们初期曾试图一次性实现所有功能,结果三个月过去了平台仍未上线。后来我们调整策略,先推出一个最小可行版本,解决最迫切的那个痛点,再逐步迭代完善。
错误二:不重视文档
平台再好用,缺乏文档开发人员也无从下手。我们现在的做法是:文档与代码同步编写,每个新功能必须附带相应的文档才能合并。
错误三:忽视存量系统
许多团队都存在历史包袱,不可能一夜之间全部迁移到新平台。必须提供清晰的迁移路径和兼容性方案。
# migration-helper.yaml - helps teams migrate from legacy systems
apiVersion: batch/v1
kind: Job
metadata:
name: migration-helper
spec:
template:
spec:
containers:
- name: migrator
image: harbor.internal/platform/migrator:latest
env:
- name: LEGACY_JENKINS_URL
value: "http://jenkins.internal:8080"
- name: NEW_GITLAB_URL
value: "https://gitlab.internal"
- name: TEAM_NAME
value: "team-a"
command:
- /bin/sh
- -c
- |
# Export Jenkins job configs
python /scripts/export_jenkins_jobs.py \
--jenkins-url $LEGACY_JENKINS_URL \
--team $TEAM_NAME \
--output /data/jenkins-jobs.json
# Convert to GitLab CI
python /scripts/convert_to_gitlab_ci.py \
--input /data/jenkins-jobs.json \
--output /data/gitlab-ci-configs/
# Create merge requests with new configs
python /scripts/create_migration_mrs.py \
--gitlab-url $NEW_GITLAB_URL \
--configs-dir /data/gitlab-ci-configs/
五、故障排查和监控
5.1 日志查看
统一的日志格式能极大方便问题排查。建立完善的监控告警规则体系是保障平台稳定的关键。
// internal/logging/logger.go
package logging
import (
"context"
"os"
"go.uber.org/zap"
"go.uber.org/zap/zapcore"
)
type contextKey string
const (
requestIDKey contextKey = "requestID"
userIDKey contextKey = "userID"
)
var logger *zap.Logger
func Init(env string) {
var config zap.Config
if env == "production" {
config = zap.NewProductionConfig()
config.EncoderConfig.TimeKey = "timestamp"
config.EncoderConfig.EncodeTime = zapcore.ISO8601TimeEncoder
} else {
config = zap.NewDevelopmentConfig()
config.EncoderConfig.EncodeLevel = zapcore.CapitalColorLevelEncoder
}
config.OutputPaths = []string{"stdout"}
config.InitialFields = map[string]interface{}{
"service": "platform-api",
"version": os.Getenv("APP_VERSION"),
}
var err error
logger, err = config.Build()
if err != nil {
panic(err)
}
}
func WithContext(ctx context.Context) *zap.Logger {
l := logger
if requestID, ok := ctx.Value(requestIDKey).(string); ok {
l = l.With(zap.String("request_id", requestID))
}
if userID, ok := ctx.Value(userIDKey).(string); ok {
l = l.With(zap.String("user_id", userID))
}
return l
}
// structured logging helpers
func LogDeployment(ctx context.Context, project, env, version string, success bool, duration float64) {
WithContext(ctx).Info("deployment completed",
zap.String("project", project),
zap.String("environment", env),
zap.String("version", version),
zap.Bool("success", success),
zap.Float64("duration_seconds", duration),
)
}
func LogAPIRequest(ctx context.Context, method, path string, statusCode int, duration float64) {
level := zap.InfoLevel
if statusCode >= 500 {
level = zap.ErrorLevel
} else if statusCode >= 400 {
level = zap.WarnLevel
}
WithContext(ctx).Log(level, "api request",
zap.String("method", method),
zap.String("path", path),
zap.Int("status_code", statusCode),
zap.Float64("duration_ms", duration),
)
}
Loki 日志查询示例
# 查找特定用户的所有操作
{service="platform-api"} | json | user_id="u-12345"
# 查找失败的部署
{service="platform-api"} | json | line_format "{{.message}}" |= "deployment completed" | success="false"
# 查找慢请求(>1秒)
{service="platform-api"} | json | duration_ms > 1000
# 按团队统计部署次数
sum by (project) (count_over_time({service="platform-api"} | json |= "deployment completed" [24h]))
5.2 常见问题排查
问题一:部署卡住
#!/bin/bash
# scripts/debug_stuck_deployment.sh
NAMESPACE=$1
DEPLOYMENT=$2
echo "=== Deployment Status ==="
kubectl get deployment $DEPLOYMENT -n $NAMESPACE -o wide
echo -e "\n=== ReplicaSet Status ==="
kubectl get rs -n $NAMESPACE -l app=$DEPLOYMENT
echo -e "\n=== Pod Status ==="
kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT -o wide
echo -e "\n=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | grep -i $DEPLOYMENT | tail -20
echo -e "\n=== Pod Descriptions ==="
for pod in $(kubectl get pods -n $NAMESPACE -l app=$DEPLOYMENT -o jsonpath='{.items.metadata.name}'); do
echo "--- Pod: $pod ---"
kubectl describe pod $pod -n $NAMESPACE | grep -A 20 "Events:"
done
echo -e "\n=== Container Logs (last 50 lines) ==="
kubectl logs -n $NAMESPACE -l app=$DEPLOYMENT --tail=50 --all-containers=true
问题二:资源配额不足
#!/bin/bash
# scripts/check_quota.sh
NAMESPACE=$1
echo "=== Resource Quota ==="
kubectl get resourcequota -n $NAMESPACE -o yaml
echo -e "\n=== Current Usage ==="
kubectl describe resourcequota -n $NAMESPACE
echo -e "\n=== Top Consumers ==="
kubectl top pods -n $NAMESPACE --sort-by=memory | head -10
5.3 性能监控
# prometheus/platform-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: platform-api-rules
namespace: monitoring
spec:
groups:
- name: platform-api.rules
interval: 30s
rules:
# API latency SLO
- record: platform:api_latency:p99
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{service="platform-api"}[5m])) by (le, path)
)
# API availability
- record: platform:api_availability:ratio
expr: |
sum(rate(http_requests_total{service="platform-api", status_code!~"5.."}[5m]))
/
sum(rate(http_requests_total{service="platform-api"}[5m]))
# Deployment success rate
- record: platform:deployment_success:ratio
expr: |
sum(rate(deployment_total{status="success"}[1h]))
/
sum(rate(deployment_total[1h]))
- name: platform-api.alerts
rules:
- alert: PlatformAPIHighLatency
expr: platform:api_latency:p99 > 2
for: 5m
labels:
severity: warning
annotations:
summary: "Platform API latency is high"
description: "P99 latency is {{ $value }}s (threshold: 2s)"
- alert: PlatformAPILowAvailability
expr: platform:api_availability:ratio < 0.99
for: 5m
labels:
severity: critical
annotations:
summary: "Platform API availability is low"
description: "Availability is {{ $value | humanizePercentage }} (SLO: 99%)"
- alert: DeploymentSuccessRateLow
expr: platform:deployment_success:ratio < 0.95
for: 15m
labels:
severity: warning
annotations:
summary: "Deployment success rate is low"
description: "Success rate is {{ $value | humanizePercentage }} (threshold: 95%)"
5.4 备份恢复
平台数据必须定期备份。
# backup/velero-schedule.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: platform-daily-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2:00 AM daily
template:
includedNamespaces:
- platform-system
- argocd
includedResources:
- configmaps
- secrets
- persistentvolumeclaims
- customresourcedefinitions
excludedResources:
- events
labelSelector:
matchLabels:
backup: "true"
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h # 30 days retention
---
# Restore procedure documented
apiVersion: v1
kind: ConfigMap
metadata:
name: disaster-recovery-runbook
namespace: platform-system
data:
restore.md: |
# Platform Disaster Recovery Runbook
## Prerequisites
- velero CLI installed
- Access to backup storage (S3/MinIO)
- kubectl configured for target cluster
## Steps
1. List available backups:
velero backup get
```
2. Describe backup to verify contents:
```
velero backup describe <backup-name> --details
```
3. Perform restore:
```
velero restore create --from-backup <backup-name>
```
4. Monitor restore progress:
```
velero restore describe <restore-name>
```
5. Verify platform services:
```
kubectl get pods -n platform-system
kubectl get pods -n argocd
```
6. Run smoke tests:
```
./scripts/platform-smoke-test.sh
```
## 六、总结
### 6.1 技术要点回顾
向 Platform Engineering 转型并非一蹴而就,需要持续的投入和迭代。其核心要点包括:
1. **优先解决最痛的问题**:避免贪多求全,找到开发人员最大的痛点并优先解决。
2. **自动化一切可自动化的环节**:人工操作效率低下且容易出错。
3. **抽象但要避免过度设计**:Golden Path 应覆盖 80% 的常规场景,同时为剩下的 20% 特殊需求留出定制空间。
4. **文档即代码**:确保平台文档与平台功能同步更新。
5. **数据驱动决策**:用数据证明平台价值,也用数据发现潜在问题。
### 6.2 进阶学习方向
* **Backstage 深入**:学习如何开发 Backstage 插件以扩展平台能力。
* **Crossplane**:探索使用 Kubernetes 理念来声明式地管理云资源。
* **Kratix**:了解声明式的平台构建框架。
* **Humanitec/Port**:参考商业化的 IDP 解决方案设计思路。
* **产品管理**:学习产品思维,真正将平台当作一个产品来运营。
### 6.3 参考资料
* [Team Topologies](https://teamtopologies.com/) - 了解平台团队的定位与协作模式
* [Platform Engineering on Kubernetes](https://www.manning.com/books/platform-engineering-on-kubernetes) - 技术实现参考书籍
* [Backstage.io](https://backstage.io/) - 开发者门户开源方案
* [CNCF Platforms White Paper](https://tag-app-delivery.cncf.io/whitepapers/platforms/) - CNCF 对平台工程的定义与白皮书
* [Internal Developer Platform](https://internaldeveloperplatform.org/) - IDP 社区资源
## 附录
### A. 命令速查表
```bash
# 项目管理
platform project create --name myapp --team myteam --template springboot
platform project list --team myteam
platform project delete --name myapp --force
# 部署操作
platform deploy --project myapp --env dev --version v1.2.3
platform deploy status --project myapp --env dev
platform rollback --project myapp --env dev --to-version v1.2.2
# 资源管理
platform scale --project myapp --env prod --replicas 5
platform quota show --team myteam
platform quota request --team myteam --cpu 16 --memory 32Gi
# 调试
platform logs --project myapp --env prod --tail 100 --follow
platform exec --project myapp --env dev --command "sh"
platform port-forward --project myapp --env dev --port 8080:8080
B. 配置参数详解
| 参数 |
说明 |
默认值 |
可选值 |
replicas |
Pod 副本数 |
2 |
1-50 |
memory |
内存限制 |
1Gi |
256Mi-16Gi |
cpu |
CPU 限制 |
1000m |
100m-8000m |
deployStrategy |
部署策略 |
rolling |
rolling/blue-green/canary |
healthCheckPath |
健康检查路径 |
/health |
任意路径 |
ingressEnabled |
启用 Ingress |
true |
true/false |
metricsEnabled |
启用指标采集 |
true |
true/false |
C. 术语表
| 术语 |
英文 |
说明 |
| 内部开发者平台 |
Internal Developer Platform (IDP) |
为开发者提供自助服务的内部平台 |
| 黄金路径 |
Golden Path |
推荐的、经过验证的标准化流程 |
| 认知负荷 |
Cognitive Load |
开发者需要了解的非业务知识的负担 |
| 自助服务 |
Self-Service |
用户无需人工介入即可完成的操作 |
| 平台即产品 |
Platform as a Product |
把平台当作产品来运营的理念 |
| 护栏 |
Guardrails |
限制但不阻止的安全边界 |
本文基于我们团队近三年的 Platform Engineering 实践经验整理而成。每个团队的具体情况不同,切勿生搬硬套,请根据自身实际情况进行调整和优化。欢迎在云栈社区交流讨论。