概述
背景介绍
在启动新项目时,第一步往往是在云平台上搭建VPC、子网、安全组和NAT网关等基础设施。过去,我们通常依赖控制台进行手动操作,仅搭建一个环境就可能耗费半小时,而Dev、Test、Staging、Prod四套环境则需两小时以上。更棘手的是,手工操作极易出错,导致不同环境间存在细微差异,常在排查问题时才发现诸如“Dev和Prod的子网CIDR不一致”等情况。
Terraform作为HashiCorp推出的基础设施即代码(IaC)工具,允许我们使用代码来定义云资源,从而实现版本控制、代码复用和一站式部署。本文将分享如何利用Terraform快速部署标准化的多环境VPC架构,在30分钟内完成以往需要半天的手动工作。
技术特点
- 声明式定义:描述“想要什么”,而非“如何做”。
- 多云支持:同一套代码可管理AWS、阿里云、腾讯云等多个云平台。
- 状态管理:追踪资源状态,支持增量更新和配置漂移检测。
- 模块化:实现代码复用,一套模板即可部署多套环境。
适用场景
- 场景一:为新项目快速搭建多套环境的基础设施。
- 场景二:标准化现有环境,消除配置漂移。
- 场景三:一键搭建灾备环境。
- 场景四:对基础设施配置进行自动化测试和验证。
环境要求
| 组件 |
版本要求 |
说明 |
| Terraform |
1.5+ |
推荐使用最新稳定版 |
| 云账号 |
AWS/阿里云/腾讯云 |
需要具备创建网络资源的足够权限 |
| 操作系统 |
Windows/macOS/Linux |
Terraform全平台支持 |
| Git |
2.x |
用于版本控制(可选但推荐) |
详细步骤
准备工作
# macOS(使用 Homebrew)
brew install terraform
# Linux(Ubuntu/Debian)
wget -O- https://apt.releases.hashicorp.com/gpg | sudo gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/hashicorp.list
sudo apt update && sudo apt install terraform
# Windows(使用 Chocolatey)
choco install terraform
# 验证安装
terraform version
配置云账号凭证
AWS 配置:
# 方法一:配置文件
mkdir -p ~/.aws
cat > ~/.aws/credentials << EOF
[default]
aws_access_key_id = YOUR_ACCESS_KEY
aws_secret_access_key = YOUR_SECRET_KEY
EOF
# 方法二:环境变量
export AWS_ACCESS_KEY_ID="YOUR_ACCESS_KEY"
export AWS_SECRET_ACCESS_KEY="YOUR_SECRET_KEY"
export AWS_DEFAULT_REGION="ap-northeast-1"
阿里云配置:
# 环境变量方式
export ALICLOUD_ACCESS_KEY="YOUR_ACCESS_KEY"
export ALICLOUD_SECRET_KEY="YOUR_SECRET_KEY"
export ALICLOUD_REGION="cn-hangzhou"
创建项目目录结构
mkdir -p terraform-vpc-project
cd terraform-vpc-project
# 推荐的目录结构
mkdir -p modules/vpc
mkdir -p environments/{dev,staging,prod}
# 创建基本文件
touch modules/vpc/{main.tf,variables.tf,outputs.tf}
touch environments/dev/{main.tf,terraform.tfvars}
touch environments/staging/{main.tf,terraform.tfvars}
touch environments/prod/{main.tf,terraform.tfvars}
编写 VPC 模块
模块主配置(以 AWS 为例)
# modules/vpc/main.tf
# AWS VPC 模块 - 包含 VPC、子网、路由表、NAT 网关等
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# 获取可用区信息
data "aws_availability_zones" "available" {
state = "available"
}
# 本地变量
locals {
azs = slice(data.aws_availability_zones.available.names, 0, var.az_count)
# 生成子网 CIDR
public_subnets = [
for i, az in local.azs :
cidrsubnet(var.vpc_cidr, var.subnet_newbits, i)
]
private_subnets = [
for i, az in local.azs :
cidrsubnet(var.vpc_cidr, var.subnet_newbits, i + var.az_count)
]
# 通用标签
common_tags = merge(var.tags, {
Environment = var.environment
ManagedBy = "terraform"
})
}
# VPC
resource "aws_vpc" "main" {
cidr_block = var.vpc_cidr
enable_dns_hostnames = true
enable_dns_support = true
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-vpc"
})
}
# Internet Gateway
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-igw"
})
}
# 公有子网
resource "aws_subnet" "public" {
count = var.az_count
vpc_id = aws_vpc.main.id
cidr_block = local.public_subnets[count.index]
availability_zone = local.azs[count.index]
map_public_ip_on_launch = true
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-public-${local.azs[count.index]}"
Type = "public"
})
}
# 私有子网
resource "aws_subnet" "private" {
count = var.az_count
vpc_id = aws_vpc.main.id
cidr_block = local.private_subnets[count.index]
availability_zone = local.azs[count.index]
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-private-${local.azs[count.index]}"
Type = "private"
})
}
# Elastic IP for NAT Gateway
resource "aws_eip" "nat" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : var.az_count) : 0
domain = "vpc"
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-nat-eip-${count.index + 1}"
})
depends_on = [aws_internet_gateway.main]
}
# NAT Gateway
resource "aws_nat_gateway" "main" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : var.az_count) : 0
allocation_id = aws_eip.nat[count.index].id
subnet_id = aws_subnet.public[count.index].id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-nat-${count.index + 1}"
})
depends_on = [aws_internet_gateway.main]
}
# 公有子网路由表
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-public-rt"
})
}
# 公有子网路由表关联
resource "aws_route_table_association" "public" {
count = var.az_count
subnet_id = aws_subnet.public[count.index].id
route_table_id = aws_route_table.public.id
}
# 私有子网路由表
resource "aws_route_table" "private" {
count = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : var.az_count) : 1
vpc_id = aws_vpc.main.id
dynamic "route" {
for_each = var.enable_nat_gateway ? [1] : []
content {
cidr_block = "0.0.0.0/0"
nat_gateway_id = var.single_nat_gateway ? aws_nat_gateway.main[0].id : aws_nat_gateway.main[count.index].id
}
}
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-private-rt-${count.index + 1}"
})
}
# 私有子网路由表关联
resource "aws_route_table_association" "private" {
count = var.az_count
subnet_id = aws_subnet.private[count.index].id
route_table_id = var.single_nat_gateway ? aws_route_table.private[0].id : aws_route_table.private[count.index].id
}
# 默认安全组
resource "aws_security_group" "default" {
name = "${var.project_name}-${var.environment}-default-sg"
description = "Default security group for ${var.project_name} ${var.environment}"
vpc_id = aws_vpc.main.id
# 允许 VPC 内部通信
ingress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = [var.vpc_cidr]
description = "Allow all traffic within VPC"
}
# 允许出站
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
description = "Allow all outbound traffic"
}
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-default-sg"
})
}
# VPC Flow Logs(可选)
resource "aws_flow_log" "main" {
count = var.enable_flow_logs ? 1 : 0
iam_role_arn = aws_iam_role.flow_logs[0].arn
log_destination = aws_cloudwatch_log_group.flow_logs[0].arn
traffic_type = "ALL"
vpc_id = aws_vpc.main.id
tags = merge(local.common_tags, {
Name = "${var.project_name}-${var.environment}-flow-logs"
})
}
resource "aws_cloudwatch_log_group" "flow_logs" {
count = var.enable_flow_logs ? 1 : 0
name = "/aws/vpc-flow-logs/${var.project_name}-${var.environment}"
retention_in_days = var.flow_logs_retention_days
tags = local.common_tags
}
resource "aws_iam_role" "flow_logs" {
count = var.enable_flow_logs ? 1 : 0
name = "${var.project_name}-${var.environment}-flow-logs-role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = "sts:AssumeRole"
Effect = "Allow"
Principal = {
Service = "vpc-flow-logs.amazonaws.com"
}
}
]
})
tags = local.common_tags
}
resource "aws_iam_role_policy" "flow_logs" {
count = var.enable_flow_logs ? 1 : 0
name = "${var.project_name}-${var.environment}-flow-logs-policy"
role = aws_iam_role.flow_logs[0].id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Action = [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
]
Effect = "Allow"
Resource = "*"
}
]
})
}
变量定义
# modules/vpc/variables.tf
variable "project_name" {
description = "项目名称,用于资源命名"
type = string
}
variable "environment" {
description = "环境名称(dev/staging/prod)"
type = string
}
variable "vpc_cidr" {
description = "VPC CIDR 块"
type = string
default = "10.0.0.0/16"
}
variable "az_count" {
description = "使用的可用区数量"
type = number
default = 2
}
variable "subnet_newbits" {
description = "子网 CIDR 的额外位数(用于 cidrsubnet 函数)"
type = number
default = 8
}
variable "enable_nat_gateway" {
description = "是否创建 NAT Gateway"
type = bool
default = true
}
variable "single_nat_gateway" {
description = "是否只使用一个 NAT Gateway(节省成本)"
type = bool
default = false
}
variable "enable_flow_logs" {
description = "是否启用 VPC Flow Logs"
type = bool
default = false
}
variable "flow_logs_retention_days" {
description = "Flow Logs 保留天数"
type = number
default = 30
}
variable "tags" {
description = "额外的资源标签"
type = map(string)
default = {}
}
输出定义
# modules/vpc/outputs.tf
output "vpc_id" {
description = "VPC ID"
value = aws_vpc.main.id
}
output "vpc_cidr" {
description = "VPC CIDR 块"
value = aws_vpc.main.cidr_block
}
output "public_subnet_ids" {
description = "公有子网 ID 列表"
value = aws_subnet.public.id
}
output "private_subnet_ids" {
description = "私有子网 ID 列表"
value = aws_subnet.private.id
}
output "public_subnet_cidrs" {
description = "公有子网 CIDR 列表"
value = aws_subnet.public.cidr_block
}
output "private_subnet_cidrs" {
description = "私有子网 CIDR 列表"
value = aws_subnet.private.cidr_block
}
output "nat_gateway_ids" {
description = "NAT Gateway ID 列表"
value = aws_nat_gateway.main.id
}
output "nat_gateway_public_ips" {
description = "NAT Gateway 公网 IP 列表"
value = aws_eip.nat.public_ip
}
output "default_security_group_id" {
description = "默认安全组 ID"
value = aws_security_group.default.id
}
output "internet_gateway_id" {
description = "Internet Gateway ID"
value = aws_internet_gateway.main.id
}
output "availability_zones" {
description = "使用的可用区列表"
value = local.azs
}
配置各环境
开发环境
主配置文件 (main.tf):
# environments/dev/main.tf
terraform {
required_version = ">= 1.5.0"
# 后端配置(可选,推荐使用远程后端)
# backend "s3" {
# bucket = "your-terraform-state-bucket"
# key = "dev/vpc/terraform.tfstate"
# region = "ap-northeast-1"
# encrypt = true
# dynamodb_table = "terraform-locks"
# }
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = var.project_name
Environment = "dev"
ManagedBy = "terraform"
}
}
}
module "vpc" {
source = "../../modules/vpc"
project_name = var.project_name
environment = "dev"
vpc_cidr = var.vpc_cidr
az_count = 2
enable_nat_gateway = true
single_nat_gateway = true # 开发环境用一个 NAT 节省成本
enable_flow_logs = false # 开发环境不需要 Flow Logs
tags = {
CostCenter = "development"
}
}
# 输出
output "vpc_id" {
value = module.vpc.vpc_id
}
output "public_subnet_ids" {
value = module.vpc.public_subnet_ids
}
output "private_subnet_ids" {
value = module.vpc.private_subnet_ids
}
output "nat_gateway_ips" {
value = module.vpc.nat_gateway_public_ips
}
变量文件 (variables.tf):
# environments/dev/variables.tf
variable "aws_region" {
description = "AWS 区域"
type = string
default = "ap-northeast-1"
}
variable "project_name" {
description = "项目名称"
type = string
}
variable "vpc_cidr" {
description = "VPC CIDR"
type = string
}
变量值文件 (terraform.tfvars):
# environments/dev/terraform.tfvars
aws_region = "ap-northeast-1"
project_name = "myproject"
vpc_cidr = "10.1.0.0/16"
生产环境
主配置文件 (main.tf):
# environments/prod/main.tf
terraform {
required_version = ">= 1.5.0"
# 生产环境强烈建议使用远程后端
# backend "s3" {
# bucket = "your-terraform-state-bucket"
# key = "prod/vpc/terraform.tfstate"
# region = "ap-northeast-1"
# encrypt = true
# dynamodb_table = "terraform-locks"
# }
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Project = var.project_name
Environment = "prod"
ManagedBy = "terraform"
}
}
}
module "vpc" {
source = "../../modules/vpc"
project_name = var.project_name
environment = "prod"
vpc_cidr = var.vpc_cidr
az_count = 3 # 生产环境用 3 个可用区
enable_nat_gateway = true
single_nat_gateway = false # 生产环境每个 AZ 一个 NAT
enable_flow_logs = true # 生产环境启用 Flow Logs
flow_logs_retention_days = 90
tags = {
CostCenter = "production"
Compliance = "required"
}
}
output "vpc_id" {
value = module.vpc.vpc_id
}
output "public_subnet_ids" {
value = module.vpc.public_subnet_ids
}
output "private_subnet_ids" {
value = module.vpc.private_subnet_ids
}
output "nat_gateway_ips" {
value = module.vpc.nat_gateway_public_ips
}
变量值文件 (terraform.tfvars):
# environments/prod/terraform.tfvars
aws_region = "ap-northeast-1"
project_name = "myproject"
vpc_cidr = "10.0.0.0/16"
部署执行
初始化和部署(以开发环境为例)
# 进入开发环境目录
cd environments/dev
# 初始化 Terraform
terraform init
# 检查配置格式
terraform fmt -check
# 验证配置
terraform validate
# 预览变更
terraform plan -out=tfplan
# 执行部署
terraform apply tfplan
# 查看输出
terraform output
部署生产环境
# 进入生产环境目录
cd environments/prod
# 初始化
terraform init
# 预览(生产环境务必仔细检查)
terraform plan -out=tfplan
# 部署前再次确认
terraform show tfplan
# 执行部署
terraform apply tfplan
示例代码和配置
完整项目结构
terraform-vpc-project/
├── modules/
│ └── vpc/
│ ├── main.tf
│ ├── variables.tf
│ └── outputs.tf
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ ├── staging/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ └── terraform.tfvars
│ └── prod/
│ ├── main.tf
│ ├── variables.tf
│ └── terraform.tfvars
├── scripts/
│ └── deploy.sh
└── README.md
自动化部署脚本
#!/bin/bash
# scripts/deploy.sh
# 多环境部署脚本
set -e
ENVIRONMENT=${1:-"dev"}
ACTION=${2:-"plan"}
# 颜色定义
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_info() {
echo -e "${GREEN}[INFO]${NC} $1"
}
log_warn() {
echo -e "${YELLOW}[WARN]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# 检查环境是否有效
if [[ ! -d "environments/${ENVIRONMENT}" ]]; then
log_error "环境 ${ENVIRONMENT} 不存在"
echo "可用环境: $(ls environments/ | tr '\n' ' ')"
exit 1
fi
cd "environments/${ENVIRONMENT}"
log_info "当前环境: ${ENVIRONMENT}"
log_info "执行操作: ${ACTION}"
case ${ACTION} in
init)
log_info "初始化 Terraform..."
terraform init
;;
plan)
log_info "生成执行计划..."
terraform plan -out=tfplan
;;
apply)
if [[ ! -f "tfplan" ]]; then
log_warn "未找到计划文件,先执行 plan..."
terraform plan -out=tfplan
fi
if [[ "${ENVIRONMENT}" == "prod" ]]; then
log_warn "即将部署到生产环境!"
read -p "确认部署?(yes/no): " confirm
if [[ "${confirm}" != "yes" ]]; then
log_info "部署已取消"
exit 0
fi
fi
log_info "执行部署..."
terraform apply tfplan
rm -f tfplan
;;
destroy)
if [[ "${ENVIRONMENT}" == "prod" ]]; then
log_error "禁止销毁生产环境!"
exit 1
fi
log_warn "即将销毁 ${ENVIRONMENT} 环境的所有资源!"
read -p "确认销毁?(yes/no): " confirm
if [[ "${confirm}" == "yes" ]]; then
terraform destroy
else
log_info "操作已取消"
fi
;;
output)
terraform output
;;
*)
echo "用法: $0 <environment> <action>"
echo "环境: dev, staging, prod"
echo "操作: init, plan, apply, destroy, output"
exit 1
;;
esac
log_info "操作完成"
阿里云版本模块
# modules/vpc-alicloud/main.tf
# 阿里云 VPC 模块
terraform {
required_providers {
alicloud = {
source = "aliyun/alicloud"
version = "~> 1.200"
}
}
}
data "alicloud_zones" "available" {
available_resource_creation = "VSwitch"
}
locals {
azs = slice(data.alicloud_zones.available.zones.id, 0, var.az_count)
common_tags = merge(var.tags, {
Environment = var.environment
ManagedBy = "terraform"
})
}
# VPC
resource "alicloud_vpc" "main" {
vpc_name = "${var.project_name}-${var.environment}-vpc"
cidr_block = var.vpc_cidr
description = "VPC for ${var.project_name} ${var.environment}"
tags = local.common_tags
}
# 公有交换机(VSwitch)
resource "alicloud_vswitch" "public" {
count = var.az_count
vpc_id = alicloud_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, var.subnet_newbits, count.index)
zone_id = local.azs[count.index]
vswitch_name = "${var.project_name}-${var.environment}-public-${local.azs[count.index]}"
tags = merge(local.common_tags, {
Type = "public"
})
}
# 私有交换机
resource "alicloud_vswitch" "private" {
count = var.az_count
vpc_id = alicloud_vpc.main.id
cidr_block = cidrsubnet(var.vpc_cidr, var.subnet_newbits, count.index + var.az_count)
zone_id = local.azs[count.index]
vswitch_name = "${var.project_name}-${var.environment}-private-${local.azs[count.index]}"
tags = merge(local.common_tags, {
Type = "private"
})
}
# NAT 网关
resource "alicloud_nat_gateway" "main" {
count = var.enable_nat_gateway ? 1 : 0
vpc_id = alicloud_vpc.main.id
nat_gateway_name = "${var.project_name}-${var.environment}-nat"
payment_type = "PayAsYouGo"
vswitch_id = alicloud_vswitch.public[0].id
nat_type = "Enhanced"
internet_charge_type = "PayByLcu"
tags = local.common_tags
}
# EIP
resource "alicloud_eip_address" "nat" {
count = var.enable_nat_gateway ? 1 : 0
address_name = "${var.project_name}-${var.environment}-nat-eip"
bandwidth = 100
internet_charge_type = "PayByTraffic"
tags = local.common_tags
}
# EIP 绑定到 NAT
resource "alicloud_eip_association" "nat" {
count = var.enable_nat_gateway ? 1 : 0
allocation_id = alicloud_eip_address.nat[0].id
instance_id = alicloud_nat_gateway.main[0].id
instance_type = "Nat"
}
# SNAT 条目(让私有子网访问公网)
resource "alicloud_snat_entry" "main" {
count = var.enable_nat_gateway ? var.az_count : 0
snat_table_id = alicloud_nat_gateway.main[0].snat_table_ids
source_vswitch_id = alicloud_vswitch.private[count.index].id
snat_ip = alicloud_eip_address.nat[0].ip_address
depends_on = [alicloud_eip_association.nat]
}
# 安全组
resource "alicloud_security_group" "default" {
name = "${var.project_name}-${var.environment}-default-sg"
description = "Default security group"
vpc_id = alicloud_vpc.main.id
tags = local.common_tags
}
# 安全组规则
resource "alicloud_security_group_rule" "allow_vpc_ingress" {
type = "ingress"
ip_protocol = "all"
nic_type = "intranet"
policy = "accept"
port_range = "-1/-1"
priority = 1
security_group_id = alicloud_security_group.default.id
cidr_ip = var.vpc_cidr
description = "Allow all traffic within VPC"
}
resource "alicloud_security_group_rule" "allow_all_egress" {
type = "egress"
ip_protocol = "all"
nic_type = "intranet"
policy = "accept"
port_range = "-1/-1"
priority = 1
security_group_id = alicloud_security_group.default.id
cidr_ip = "0.0.0.0/0"
description = "Allow all outbound traffic"
}
最佳实践和注意事项
最佳实践
状态管理
生产环境务必使用远程后端存储 Terraform 状态,这是实现高效模块化设计和团队协作的基础。
# backend.tf - S3 后端配置
terraform {
backend "s3" {
bucket = "your-company-terraform-state"
key = "prod/vpc/terraform.tfstate"
region = "ap-northeast-1"
encrypt = true
dynamodb_table = "terraform-state-locks" # 防止并发操作
# 可选:使用 assume role
# role_arn = "arn:aws:iam::ACCOUNT_ID:role/TerraformRole"
}
}
# 创建 S3 bucket 和 DynamoDB table(只需执行一次)
aws s3 mb s3://your-company-terraform-state --region ap-northeast-1
aws s3api put-bucket-versioning \
--bucket your-company-terraform-state \
--versioning-configuration Status=Enabled
aws dynamodb create-table \
--table-name terraform-state-locks \
--attribute-definitions AttributeName=LockID,AttributeType=S \
--key-schema AttributeName=LockID,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
敏感信息处理
# 使用 sensitive 标记敏感输出
output "nat_gateway_ips" {
value = module.vpc.nat_gateway_public_ips
sensitive = false # IP 可以公开
}
# 不要在代码中硬编码凭证
# 错误示例:
# provider "aws" {
# access_key = "AKIAXXXXXXXX"
# secret_key = "xxxxxxxx"
# }
# 正确做法:使用环境变量或 AWS 配置文件
代码规范
# 格式化代码
terraform fmt -recursive
# 验证配置
terraform validate
# 使用 tflint 检查最佳实践
tflint --init
tflint
# 使用 tfsec 安全扫描
tfsec .
注意事项
常见错误
| 错误现象 |
原因分析 |
解决方案 |
Error: No valid credential sources found |
未配置云凭证 |
设置环境变量或配置文件 |
Error: Error locking state |
状态文件被锁定 |
检查是否有其他操作进行中,或手动解锁 |
Error: CIDR block overlaps |
CIDR 冲突 |
检查 VPC 和子网的 CIDR 规划 |
Error: limit exceeded |
达到资源配额限制 |
申请提高配额或清理无用资源 |
生产环境检查清单
- [ ] 使用远程后端存储状态
- [ ] 启用状态文件加密
- [ ] 配置 DynamoDB 锁表
- [ ] 代码已通过
terraform validate
- [ ] 代码已通过安全扫描(tfsec)
- [ ] plan 输出已经仔细审查
- [ ] 有回滚方案
- [ ] 通知相关团队
CIDR 规划建议
公司 VPC CIDR 规划示例:
10.0.0.0/8 - 公司总网段
├── 10.0.0.0/16 - 生产环境
│ ├── 10.0.0.0/24 - 公有子网 AZ-a
│ ├── 10.0.1.0/24 - 公有子网 AZ-b
│ ├── 10.0.2.0/24 - 公有子网 AZ-c
│ ├── 10.0.10.0/24 - 私有子网 AZ-a
│ ├── 10.0.11.0/24 - 私有子网 AZ-b
│ └── 10.0.12.0/24 - 私有子网 AZ-c
├── 10.1.0.0/16 - 开发环境
├── 10.2.0.0/16 - 测试环境
└── 10.3.0.0/16 - 预生产环境
注意:
- 不同环境使用不同的第二段
- 预留足够的地址空间用于扩展
- 避免与办公网络、VPN 网段冲突
故障排查和监控
故障排查
常见问题排查
问题一:terraform init 失败
# 检查网络连接
curl -I https://registry.terraform.io
# 如果在国内,配置镜像
# 创建 ~/.terraformrc
cat > ~/.terraformrc << EOF
provider_installation {
network_mirror {
url = "https://mirrors.aliyun.com/terraform/"
}
}
EOF
# 或者使用代理
export HTTPS_PROXY="http://proxy:8080"
terraform init
问题二:状态文件损坏
# 从远程后端拉取最新状态
terraform state pull > terraform.tfstate.backup
# 如果状态与实际不符,导入资源
terraform import aws_vpc.main vpc-xxxxxxxx
# 刷新状态
terraform refresh
问题三:资源漂移检测
# 检测实际资源与状态的差异
terraform plan -refresh-only
# 如果有漂移,决定是更新状态还是恢复资源
# 更新状态到实际值
terraform apply -refresh-only
# 或者重新应用配置恢复资源
terraform apply
调试模式
# 启用详细日志
export TF_LOG=DEBUG
export TF_LOG_PATH=./terraform.log
terraform plan
# 查看日志
tail -f terraform.log
# 只针对特定 provider 调试
export TF_LOG_PROVIDER=DEBUG
成本监控
使用 Infracost 估算成本
# 安装 Infracost
brew install infracost
# 注册获取 API key
infracost auth login
# 估算成本
infracost breakdown --path .
# 对比不同配置的成本
infracost diff --path . --compare-to terraform.tfstate
# 生成 HTML 报告
infracost output --path /tmp/infracost.json --format html > cost-report.html
成本优化建议
# 开发环境使用单个 NAT Gateway 节省成本
module "vpc" {
source = "../../modules/vpc"
# ...
enable_nat_gateway = true
single_nat_gateway = true # 每月节省约 $30-50/NAT
}
# 或者开发环境完全不用 NAT Gateway
# 私有子网的实例通过 VPC Endpoint 访问 AWS 服务
module "vpc" {
source = "../../modules/vpc"
enable_nat_gateway = false
}
# 添加 VPC Endpoints(更便宜且更安全)
resource "aws_vpc_endpoint" "s3" {
vpc_id = module.vpc.vpc_id
service_name = "com.amazonaws.${var.region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = module.vpc.private_route_table_ids
}
CI/CD 集成
通过集成CI/CD,可以将Terraform部署流程自动化,这也是DevOps实践的重要组成部分。
# .github/workflows/terraform.yml
name: Terraform CI/CD
on:
push:
branches: [main]
paths:
- 'environments/**'
- 'modules/**'
pull_request:
branches: [main]
env:
TF_VERSION: '1.5.0'
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Validate
run: |
for env in environments/*/; do
echo "Validating $env"
cd $env
terraform init -backend=false
terraform validate
cd ../..
done
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run tfsec
uses: aquasecurity/tfsec-action@v1.0.0
plan:
needs: [validate, security-scan]
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
strategy:
matrix:
environment: [dev, staging]
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
with:
terraform_version: ${{ env.TF_VERSION }}
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ap-northeast-1
- name: Terraform Plan
run: |
cd environments/${{ matrix.environment }}
terraform init
terraform plan -no-color -out=tfplan
- name: Comment Plan on PR
uses: actions/github-script@v6
with:
script: |
// 将 plan 输出添加到 PR 评论
总结
技术要点回顾
- 模块化设计:一套代码支持多环境部署,减少重复劳动。
- 状态管理:使用远程后端,启用锁机制,确保团队协作安全。
- 渐进式部署:遵循先开发(dev)、再预发(staging)、最后生产(prod)的流程。
- 成本意识:开发环境采用精简配置以节省成本,生产环境则确保高可用性。
进阶学习方向
- Terraform 高级特性
- 学习资源:HashiCorp Learn 官方教程
- 实践建议:深入学习 workspace、data sources、provisioners 等概念。
- GitOps 与 Terraform
- 学习资源:Terraform Cloud、Atlantis 等工具文档。
- 实践建议:尝试实现 PR 触发自动 plan,合并后自动 apply 的流程。
- 多云管理
- 学习资源:Terraform 各云 Provider 的官方文档。
- 实践建议:尝试使用同一套 Terraform 代码管理 AWS 和阿里云资源。
参考资料
附录
命令速查表
# 初始化
terraform init
terraform init -upgrade # 升级 provider
# 格式化
terraform fmt
terraform fmt -recursive
# 验证
terraform validate
# 计划
terraform plan
terraform plan -out=tfplan
terraform plan -target=aws_vpc.main # 只计划特定资源
# 应用
terraform apply
terraform apply tfplan
terraform apply -auto-approve # 自动确认(谨慎使用)
# 销毁
terraform destroy
terraform destroy -target=aws_vpc.main # 只销毁特定资源
# 状态管理
terraform state list
terraform state show aws_vpc.main
terraform state mv aws_vpc.main aws_vpc.new_name
terraform state rm aws_vpc.main # 从状态中移除(不删除实际资源)
# 导入
terraform import aws_vpc.main vpc-xxxxxxxx
# 输出
terraform output
terraform output -json
资源命名规范
{project}-{environment}-{resource_type}-{purpose}
示例:
myproject-prod-vpc
myproject-prod-subnet-public-a
myproject-prod-nat-1
myproject-prod-sg-default
myproject-prod-igw
术语表
| 术语 |
英文 |
解释 |
| 基础设施即代码 |
Infrastructure as Code (IaC) |
用代码定义和管理基础设施的方法论 |
| 状态文件 |
State File |
Terraform 用于记录资源当前状态的文件 |
| 提供者 |
Provider |
与特定云平台或服务交互的 Terraform 插件 |
| 模块 |
Module |
可复用的 Terraform 配置单元 |
| 数据源 |
Data Source |
用于从外部获取信息的配置块 |
| 输出 |
Output |
向外暴露模块或资源配置的属性值 |
| 后端 |
Backend |
定义 Terraform 状态文件存储位置和方式的配置 |
| 漂移 |
Drift |
实际基础设施资源与代码定义的状态不一致的情况 |