一、概述
1.1 背景介绍
2025年的一个周五下午6点,我们团队正准备下班,突然收到消息:需要在200台服务器上紧急部署一个安全补丁。运维同事小王信心满满地打开他写的 Ansible Playbook 执行了命令。结果呢?因为 Playbook 里硬编码了一个只有他本地才有的路径,加上没有做幂等性处理,直接把50台服务器的 Nginx 配置搞坏了。
那个周末,整个团队都在加班修复问题。从那以后,我们痛定思痛,花了三个月时间整理出一套 Ansible Playbook 编写规范。这套规范在后来的三年里,帮助团队管理了超过2000台服务器,执行了上万次自动化部署,再也没有出现过类似的事故。
说实话,Ansible 的入门门槛很低,随便写几行 YAML 就能跑起来。但正是这种“简单”,让很多人忽视了规范的重要性。一个人写的 Playbook 能跑,不代表团队能维护;今天能跑,不代表明天还能跑。
1.2 技术特点
Ansible 作为一个无代理的配置管理工具,有几个让人特别喜欢的特点:
声明式配置:你告诉 Ansible 你想要什么状态,而不是告诉它怎么做。比如你说“我要 Nginx 服务运行”,Ansible 会自己判断是需要启动、重启还是什么都不做。
幂等性设计:理论上,同一个 Playbook 执行一次和执行一百次,结果应该是一样的。当然,这个“理论上”需要你写的 Playbook 足够规范才能实现。
模块化架构:Ansible 有几千个官方模块,从管理文件到操作云平台,基本上你能想到的操作都有现成的模块。
基于 SSH:不需要在目标机器上安装任何东西,只要能 SSH 连上去就行。这一点在很多安全要求严格的环境里特别有用。
但 Ansible 也有一些让人头疼的地方:
- YAML 语法对缩进极其敏感,少一个空格就报错
- 变量优先级有22个层级,搞不清楚哪个变量被覆盖了
- Jinja2 模板在复杂场景下容易写出让人看不懂的代码
- 大规模执行时性能是个问题
1.3 适用场景
根据实践经验,Ansible 特别适合这些场景:
配置管理:统一管理服务器的系统配置、软件包、用户权限等。我们用 Ansible 管理所有服务器的 SSH 配置、防火墙规则、系统参数调优。
应用部署:自动化部署 Web 应用、微服务、数据库等。Java 应用、Node.js 服务都可以用 Ansible 部署。
环境初始化:新服务器上线时的标准化配置。有一套 Playbook 能在10分钟内把一台裸机配置成符合公司标准的服务器。
批量操作:需要在大量服务器上执行相同操作的场景。比如批量更新软件包、批量修改配置文件。
灾难恢复:快速重建环境。所有配置都用 Ansible 管理,理论上可以在几小时内重建整个生产环境。
不太适合的场景:
- 需要实时响应的场景(Ansible 执行有延迟)
- 需要复杂条件判断和循环的场景(用 Python 或 Go 写更合适)
- 目标机器无法 SSH 连接的场景
1.4 环境要求
控制节点要求:
# Operating System
os: Linux/macOS/WSL2 (Windows native is not recommended)
python: ">=3.9"
ansible: ">=2.14"
# Recommended specs
cpu: 4 cores
memory: 8GB
disk: 50GB SSD
目标节点要求:
# Minimum requirements
os: Linux (RHEL/CentOS/Ubuntu/Debian)
python: ">=3.6"
ssh: enabled
sudo: configured for ansible user
网络要求:
- 控制节点能 SSH 到所有目标节点
- 建议使用专用的管理网络
- SSH 端口默认22,可以自定义
生产环境的实际配置:
# Ansible version we use in production
ansible --version
# ansible [core 2.15.6]
# python version = 3.11.4
# jinja version = 3.1.2
# Install Ansible on control node
pip install ansible==2.15.6 ansible-lint==6.22.0
# Create ansible user on target nodes
useradd -m -s /bin/bash ansible
echo "ansible ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ansible
chmod 440 /etc/sudoers.d/ansible
二、详细步骤
2.1 准备工作
2.1.1 项目目录结构
一个规范的 Ansible 项目应该有清晰的目录结构。经过多次迭代,最终确定了这样的结构:
ansible-infrastructure/
├── ansible.cfg # Ansible configuration
├── requirements.yml # Role dependencies from Galaxy
├── inventory/
│ ├── production/
│ │ ├── hosts.yml # Production inventory
│ │ ├── group_vars/
│ │ │ ├── all.yml # Variables for all hosts
│ │ │ ├── webservers.yml # Variables for web servers
│ │ │ └── databases.yml # Variables for databases
│ │ └── host_vars/
│ │ ├── web01.yml # Host-specific variables
│ │ └── db01.yml
│ └── staging/
│ ├── hosts.yml
│ ├── group_vars/
│ └── host_vars/
├── playbooks/
│ ├── site.yml # Main playbook
│ ├── webservers.yml # Web server playbook
│ ├── databases.yml # Database playbook
│ └── deploy-app.yml # Application deployment
├── roles/
│ ├── common/ # Common configurations
│ │ ├── tasks/
│ │ │ └── main.yml
│ │ ├── handlers/
│ │ │ └── main.yml
│ │ ├── templates/
│ │ ├── files/
│ │ ├── vars/
│ │ │ └── main.yml
│ │ ├── defaults/
│ │ │ └── main.yml
│ │ └── meta/
│ │ └── main.yml
│ ├── nginx/
│ ├── mysql/
│ └── monitoring/
├── library/ # Custom modules
├── filter_plugins/ # Custom filters
├── callback_plugins/ # Custom callbacks
├── files/ # Static files
├── templates/ # Jinja2 templates
└── scripts/ # Helper scripts
├── run-playbook.sh
└── vault-password.sh
2.1.2 Ansible 配置文件
ansible.cfg 是控制 Ansible 行为的核心配置文件。这是生产环境使用的配置:
# ansible.cfg - Production configuration
[defaults]
# Inventory settings
inventory = inventory/production/hosts.yml
roles_path = roles:~/.ansible/roles:/usr/share/ansible/roles
# Performance tuning
forks = 50
poll_interval = 5
timeout = 30
# SSH settings
remote_user = ansible
private_key_file = ~/.ssh/ansible_ed25519
host_key_checking = False
transport = smart
# Output settings
stdout_callback = yaml
callback_whitelist = timer, profile_tasks, profile_roles
deprecation_warnings = True
system_warnings = True
# Fact caching (significantly improves performance)
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400
# Retry settings
retry_files_enabled = True
retry_files_save_path = ~/.ansible/retry
# Logging
log_path = /var/log/ansible/ansible.log
# Vault settings
vault_password_file = scripts/vault-password.sh
# Misc
nocows = 1
any_errors_fatal = False
error_on_undefined_vars = True
[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = False
[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=600s -o PreferredAuthentications=publickey
pipelining = True
control_path_dir = ~/.ansible/cp
control_path = %(directory)s/%%h-%%r
[persistent_connection]
connect_timeout = 30
command_timeout = 30
2.1.3 Inventory 配置
Inventory 定义了你要管理的服务器。使用 YAML 格式比 INI 格式更清晰:
# inventory/production/hosts.yml
all:
children:
webservers:
hosts:
web01.prod.example.com:
ansible_host: 10.0.1.11
nginx_worker_processes: 4
web02.prod.example.com:
ansible_host: 10.0.1.12
nginx_worker_processes: 4
web03.prod.example.com:
ansible_host: 10.0.1.13
nginx_worker_processes: 8
vars:
nginx_worker_connections: 4096
app_port: 8080
databases:
children:
mysql_primary:
hosts:
db01.prod.example.com:
ansible_host: 10.0.2.11
mysql_server_id: 1
mysql_replica:
hosts:
db02.prod.example.com:
ansible_host: 10.0.2.12
mysql_server_id: 2
db03.prod.example.com:
ansible_host: 10.0.2.13
mysql_server_id: 3
vars:
mysql_port: 3306
mysql_datadir: /data/mysql
redis:
hosts:
redis01.prod.example.com:
ansible_host: 10.0.3.11
redis_port: 6379
redis02.prod.example.com:
ansible_host: 10.0.3.12
redis_port: 6379
redis03.prod.example.com:
ansible_host: 10.0.3.13
redis_port: 6379
vars:
redis_maxmemory: 8gb
redis_maxmemory_policy: allkeys-lru
monitoring:
hosts:
prometheus01.prod.example.com:
ansible_host: 10.0.4.11
grafana01.prod.example.com:
ansible_host: 10.0.4.12
vars:
ansible_user: ansible
ansible_become: true
ansible_python_interpreter: /usr/bin/python3
ntp_servers:
- 10.0.0.1
- 10.0.0.2
dns_servers:
- 10.0.0.10
- 10.0.0.11
2.2 核心配置
2.2.1 变量管理规范
变量管理是 Ansible 最容易出问题的地方。踩过无数坑后,总结出这些原则:
变量命名规范:
# Good: Prefixed with role name, clear meaning
nginx_worker_processes: 4
nginx_worker_connections: 4096
mysql_max_connections: 500
redis_maxmemory: 8gb
# Bad: Ambiguous, no prefix
workers: 4 # Which service?
max_conn: 500 # MySQL? Redis? Nginx?
memory: 8gb # What memory?
变量文件组织:
# inventory/production/group_vars/all.yml
# Global variables for all hosts
---
# Environment identifier
env: production
datacenter: dc1
region: cn-north-1
# Common system settings
timezone: Asia/Shanghai
locale: en_US.UTF-8
# Security settings
security_ssh_permit_root_login: false
security_ssh_password_authentication: false
security_fail2ban_enabled: true
# NTP configuration
ntp_enabled: true
ntp_servers:
- ntp1.aliyun.com
- ntp2.aliyun.com
# Package repository
package_repo_base_url: http://mirrors.aliyun.com
# Monitoring endpoints
prometheus_pushgateway: http://10.0.4.11:9091
# inventory/production/group_vars/webservers.yml
# Variables specific to web servers
---
# Nginx settings
nginx_user: www-data
nginx_worker_processes: auto
nginx_worker_connections: 4096
nginx_keepalive_timeout: 65
nginx_client_max_body_size: 100m
nginx_gzip_enabled: true
nginx_gzip_types:
- text/plain
- text/css
- application/json
- application/javascript
- text/xml
- application/xml
# SSL settings
nginx_ssl_protocols: TLSv1.2 TLSv1.3
nginx_ssl_ciphers: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256
nginx_ssl_session_timeout: 1d
nginx_ssl_session_cache: shared:SSL:50m
# Application settings
app_name: myapp
app_user: deploy
app_group: deploy
app_port: 8080
app_deploy_path: /opt/apps/{{ app_name }}
app_log_path: /var/log/{{ app_name }}
app_java_opts: "-Xms2g -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"
敏感信息用 Vault 加密:
# Create encrypted file
ansible-vault create inventory/production/group_vars/vault.yml
# Edit encrypted file
ansible-vault edit inventory/production/group_vars/vault.yml
# inventory/production/group_vars/vault.yml (encrypted content)
---
vault_mysql_root_password: "xK9<a href="javascript:;">#mP2</a>$vL5nQ8wR"
vault_mysql_repl_password: "aB3cD4eF5gH6iJ7k"
vault_redis_password: "rEdIs_PaSsWoRd_2024!"
vault_app_secret_key: "7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c"
vault_ssl_key_passphrase: "ssl_key_passphrase_here"
# Reference vault variables in group_vars/all.yml
mysql_root_password: "{{ vault_mysql_root_password }}"
mysql_repl_password: "{{ vault_mysql_repl_password }}"
redis_password: "{{ vault_redis_password }}"
2.2.2 Role 编写规范
Role 是 Ansible 代码复用的核心。一个好的 Role 应该是自包含的、可配置的、有文档的。
# roles/nginx/tasks/main.yml
---
- name: Include OS-specific variables
ansible.builtin.include_vars: "{{ ansible_os_family | lower }}.yml"
tags: [nginx, nginx:install]
- name: Install Nginx packages
ansible.builtin.package:
name: "{{ nginx_packages }}"
state: present
tags: [nginx, nginx:install]
- name: Create Nginx directories
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: root
group: root
mode: "0755"
loop:
- /etc/nginx/conf.d
- /etc/nginx/ssl
- /var/cache/nginx
- /var/log/nginx
tags: [nginx, nginx:config]
- name: Deploy Nginx main configuration
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
owner: root
group: root
mode: "0644"
validate: nginx -t -c %s
notify: Reload nginx
tags: [nginx, nginx:config]
- name: Deploy SSL certificates
ansible.builtin.copy:
content: "{{ item.content }}"
dest: "{{ item.dest }}"
owner: root
group: root
mode: "{{ item.mode }}"
loop:
- content: "{{ nginx_ssl_certificate }}"
dest: /etc/nginx/ssl/server.crt
mode: "0644"
- content: "{{ nginx_ssl_certificate_key }}"
dest: /etc/nginx/ssl/server.key
mode: "0600"
when: nginx_ssl_enabled | default(false)
notify: Reload nginx
no_log: true
tags: [nginx, nginx:ssl]
- name: Deploy virtual host configurations
ansible.builtin.template:
src: vhost.conf.j2
dest: "/etc/nginx/conf.d/{{ item.server_name }}.conf"
owner: root
group: root
mode: "0644"
validate: nginx -t -c /etc/nginx/nginx.conf
loop: "{{ nginx_vhosts }}"
notify: Reload nginx
tags: [nginx, nginx:vhosts]
- name: Remove default virtual host
ansible.builtin.file:
path: "{{ item }}"
state: absent
loop:
- /etc/nginx/sites-enabled/default
- /etc/nginx/conf.d/default.conf
notify: Reload nginx
tags: [nginx, nginx:config]
- name: Ensure Nginx is started and enabled
ansible.builtin.service:
name: nginx
state: started
enabled: true
tags: [nginx, nginx:service]
# roles/nginx/handlers/main.yml
---
- name: Reload nginx
ansible.builtin.service:
name: nginx
state: reloaded
listen: Reload nginx
- name: Restart nginx
ansible.builtin.service:
name: nginx
state: restarted
listen: Restart nginx
# roles/nginx/defaults/main.yml
---
# Package settings
nginx_packages:
- nginx
# Worker settings
nginx_user: www-data
nginx_worker_processes: auto
nginx_worker_connections: 4096
nginx_multi_accept: true
# Performance settings
nginx_keepalive_timeout: 65
nginx_keepalive_requests: 1000
nginx_client_max_body_size: 64m
nginx_client_body_buffer_size: 128k
nginx_client_header_buffer_size: 1k
nginx_large_client_header_buffers: 4 16k
# Gzip settings
nginx_gzip_enabled: true
nginx_gzip_comp_level: 6
nginx_gzip_min_length: 1024
nginx_gzip_types:
- text/plain
- text/css
- text/javascript
- application/json
- application/javascript
- application/xml
- application/xml+rss
- image/svg+xml
# Logging settings
nginx_access_log: /var/log/nginx/access.log
nginx_error_log: /var/log/nginx/error.log
nginx_log_format: |
'$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'rt=$request_time uct="$upstream_connect_time" '
'uht="$upstream_header_time" urt="$upstream_response_time"'
# SSL settings
nginx_ssl_enabled: false
nginx_ssl_protocols: TLSv1.2 TLSv1.3
nginx_ssl_ciphers: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384
# Virtual hosts
nginx_vhosts: []
{# roles/nginx/templates/nginx.conf.j2 #}
# Managed by Ansible - DO NOT EDIT MANUALLY
# Last modified: {{ ansible_date_time.iso8601 }}
user {{ nginx_user }};
worker_processes {{ nginx_worker_processes }};
pid /run/nginx.pid;
error_log {{ nginx_error_log }} warn;
events {
worker_connections {{ nginx_worker_connections }};
multi_accept {{ 'on' if nginx_multi_accept else 'off' }};
use epoll;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
# Logging configuration
log_format main {{ nginx_log_format }};
access_log {{ nginx_access_log }} main;
# Basic settings
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout {{ nginx_keepalive_timeout }};
keepalive_requests {{ nginx_keepalive_requests }};
types_hash_max_size 2048;
server_tokens off;
# Client settings
client_max_body_size {{ nginx_client_max_body_size }};
client_body_buffer_size {{ nginx_client_body_buffer_size }};
client_header_buffer_size {{ nginx_client_header_buffer_size }};
large_client_header_buffers {{ nginx_large_client_header_buffers }};
{% if nginx_gzip_enabled %}
# Gzip settings
gzip on;
gzip_vary on;
gzip_proxied any;
gzip_comp_level {{ nginx_gzip_comp_level }};
gzip_min_length {{ nginx_gzip_min_length }};
gzip_types {{ nginx_gzip_types | join(' ') }};
{% endif %}
{% if nginx_ssl_enabled %}
# SSL settings
ssl_protocols {{ nginx_ssl_protocols }};
ssl_ciphers {{ nginx_ssl_ciphers }};
ssl_prefer_server_ciphers on;
ssl_session_cache shared:SSL:50m;
ssl_session_timeout 1d;
ssl_session_tickets off;
{% endif %}
# Include virtual hosts
include /etc/nginx/conf.d/*.conf;
}
2.2.3 Playbook 编写规范
Playbook 是将 Role 组合起来完成特定任务的地方。
# playbooks/site.yml
# Master playbook - configures entire infrastructure
---
- name: Apply common configuration to all hosts
hosts: all
become: true
gather_facts: true
any_errors_fatal: false
pre_tasks:
- name: Verify Ansible version
ansible.builtin.assert:
that:
- ansible_version.full is version('2.14', '>=')
fail_msg: "Ansible 2.14 or higher required"
success_msg: "Ansible version check passed"
run_once: true
delegate_to: localhost
tags: [always]
- name: Display target information
ansible.builtin.debug:
msg: |
Target: {{ inventory_hostname }}
OS: {{ ansible_distribution }} {{ ansible_distribution_version }}
IP: {{ ansible_default_ipv4.address | default('N/A') }}
tags: [always]
roles:
- role: common
tags: [common]
- name: Configure web servers
hosts: webservers
become: true
serial: "30%"
max_fail_percentage: 10
roles:
- role: nginx
tags: [nginx]
- role: app-deploy
tags: [app]
- name: Configure database servers
hosts: databases
become: true
serial: 1
roles:
- role: mysql
tags: [mysql]
when: "'mysql' in group_names or 'mysql_primary' in group_names or 'mysql_replica' in group_names"
- name: Configure Redis servers
hosts: redis
become: true
roles:
- role: redis
tags: [redis]
- name: Configure monitoring
hosts: monitoring
become: true
roles:
- role: prometheus
tags: [prometheus]
when: "'prometheus' in inventory_hostname"
- role: grafana
tags: [grafana]
when: "'grafana' in inventory_hostname"
# playbooks/deploy-app.yml
# Application deployment playbook with zero-downtime strategy
---
- name: Pre-deployment checks
hosts: webservers
become: true
gather_facts: true
any_errors_fatal: true
vars:
app_version: "{{ lookup('env', 'APP_VERSION') | default('latest', true) }}"
deployment_id: "{{ lookup('pipe', 'date +%Y%m%d%H%M%S') }}"
pre_tasks:
- name: Validate deployment parameters
ansible.builtin.assert:
that:
- app_version != ''
- app_name is defined
- app_deploy_path is defined
fail_msg: "Missing required deployment parameters"
tags: [always]
- name: Check disk space
ansible.builtin.assert:
that:
- item.size_available > 5368709120 # 5GB
fail_msg: "Insufficient disk space on {{ item.mount }}"
loop: "{{ ansible_mounts | selectattr('mount', 'in', ['/', '/opt']) | list }}"
tags: [checks]
- name: Verify application artifact exists
ansible.builtin.uri:
url: "{{ artifact_repo_url }}/{{ app_name }}/{{ app_version }}/{{ app_name }}-{{ app_version }}.tar.gz"
method: HEAD
status_code: 200
delegate_to: localhost
run_once: true
tags: [checks]
tasks:
- name: Create deployment fact
ansible.builtin.set_fact:
current_deployment:
id: "{{ deployment_id }}"
version: "{{ app_version }}"
timestamp: "{{ ansible_date_time.iso8601 }}"
user: "{{ lookup('env', 'USER') }}"
- name: Deploy application with rolling update
hosts: webservers
become: true
serial: "30%"
max_fail_percentage: 10
vars:
app_version: "{{ lookup('env', 'APP_VERSION') | default('latest', true) }}"
pre_tasks:
- name: Remove server from load balancer
ansible.builtin.uri:
url: "http://{{ nginx_upstream_manager }}/api/upstream/{{ inventory_hostname }}/down"
method: POST
status_code: [200, 204]
delegate_to: localhost
when: nginx_upstream_manager is defined
tags: [lb]
- name: Wait for connections to drain
ansible.builtin.wait_for:
timeout: 30
tags: [lb]
roles:
- role: app-deploy
vars:
app_artifact_url: "{{ artifact_repo_url }}/{{ app_name }}/{{ app_version }}/{{ app_name }}-{{ app_version }}.tar.gz"
tags: [deploy]
post_tasks:
- name: Verify application health
ansible.builtin.uri:
url: "http://127.0.0.1:{{ app_port }}/health"
method: GET
status_code: 200
return_content: true
register: health_check
retries: 10
delay: 5
until: health_check.status == 200
tags: [verify]
- name: Add server back to load balancer
ansible.builtin.uri:
url: "http://{{ nginx_upstream_manager }}/api/upstream/{{ inventory_hostname }}/up"
method: POST
status_code: [200, 204]
delegate_to: localhost
when: nginx_upstream_manager is defined
tags: [lb]
- name: Post-deployment verification
hosts: webservers
become: true
gather_facts: false
any_errors_fatal: true
tasks:
- name: Run smoke tests
ansible.builtin.uri:
url: "http://127.0.0.1:{{ app_port }}{{ item.path }}"
method: "{{ item.method | default('GET') }}"
status_code: "{{ item.status | default(200) }}"
loop:
- path: /health
status: 200
- path: /api/v1/ping
status: 200
- path: /metrics
status: 200
tags: [smoke-test]
- name: Notify deployment success
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "Deployment successful: {{ app_name }} {{ app_version }} deployed to {{ ansible_play_hosts | length }} hosts"
channel: "#deployments"
delegate_to: localhost
run_once: true
when: slack_webhook_url is defined
tags: [notify]
2.3 启动和验证
2.3.1 执行前检查
在执行任何 Playbook 之前,有一套标准的检查流程:
#!/bin/bash
# scripts/run-playbook.sh
# Standard playbook execution wrapper
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
# Color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
# Default values
INVENTORY="production"
PLAYBOOK=""
TAGS=""
LIMIT=""
CHECK_MODE=""
DIFF_MODE=""
VERBOSE=""
usage() {
echo "Usage: $0 -p PLAYBOOK [-i INVENTORY] [-t TAGS] [-l LIMIT] [-C] [-D] [-v]"
echo ""
echo "Options:"
echo " -p PLAYBOOK Playbook to run (required)"
echo " -i INVENTORY Inventory to use (default: production)"
echo " -t TAGS Tags to run"
echo " -l LIMIT Limit to specific hosts"
echo " -C Check mode (dry run)"
echo " -D Show diff"
echo " -v Verbose output"
exit 1
}
while getopts "p:i:t:l:CDvh" opt; do
case $opt in
p) PLAYBOOK="$OPTARG" ;;
i) INVENTORY="$OPTARG" ;;
t) TAGS="--tags $OPTARG" ;;
l) LIMIT="--limit $OPTARG" ;;
C) CHECK_MODE="--check" ;;
D) DIFF_MODE="--diff" ;;
v) VERBOSE="-vvv" ;;
h) usage ;;
*) usage ;;
esac
done
if [[ -z "$PLAYBOOK" ]]; then
echo -e "${RED}Error: Playbook is required${NC}"
usage
fi
cd "$PROJECT_DIR"
echo -e "${GREEN}=== Pre-flight Checks ===${NC}"
# Check Ansible version
echo -n "Checking Ansible version... "
ANSIBLE_VERSION=$(ansible --version | head -1 | awk '{print $3}' | tr -d '[]')
if [[ "$(printf '%s\n' "2.14" "$ANSIBLE_VERSION" | sort -V | head -n1)" == "2.14" ]]; then
echo -e "${GREEN}OK ($ANSIBLE_VERSION)${NC}"
else
echo -e "${RED}FAILED (requires >= 2.14, found $ANSIBLE_VERSION)${NC}"
exit 1
fi
# Check inventory exists
echo -n "Checking inventory... "
if [[ -d "inventory/$INVENTORY" ]]; then
echo -e "${GREEN}OK${NC}"
else
echo -e "${RED}FAILED (inventory/$INVENTORY not found)${NC}"
exit 1
fi
# Check playbook exists
echo -n "Checking playbook... "
if [[ -f "playbooks/$PLAYBOOK" ]]; then
echo -e "${GREEN}OK${NC}"
else
echo -e "${RED}FAILED (playbooks/$PLAYBOOK not found)${NC}"
exit 1
fi
# Syntax check
echo -n "Running syntax check... "
if ansible-playbook "playbooks/$PLAYBOOK" \
-i "inventory/$INVENTORY/hosts.yml" \
--syntax-check >/dev/null; then
echo -e "${GREEN}OK${NC}"
else
echo -e "${RED}FAILED${NC}"
ansible-playbook "playbooks/$PLAYBOOK" \
-i "inventory/$INVENTORY/hosts.yml" \
--syntax-check
exit 1
fi
# Lint check
echo -n "Running lint check... "
if command -v ansible-lint >/dev/null; then
if ansible-lint "playbooks/$PLAYBOOK" >/dev/null; then
echo -e "${GREEN}OK${NC}"
else
echo -e "${YELLOW}WARNINGS (check output below)${NC}"
ansible-lint "playbooks/$PLAYBOOK"
fi
else
echo -e "${YELLOW}SKIPPED (ansible-lint not installed)${NC}"
fi
echo ""
echo -e "${GREEN}=== Execution ===${NC}"
echo "Inventory: $INVENTORY"
echo "Playbook: $PLAYBOOK"
[[ -n "$TAGS" ]] && echo "Tags: $TAGS"
[[ -n "$LIMIT" ]] && echo "Limit: $LIMIT"
[[ -n "$CHECK_MODE" ]] && echo "Mode: DRY RUN"
echo ""
read -p "Continue? [y/N] " -n 1 -r
echo ""
if [[ ! $REPLY =~ ^[Yy]$ ]]; then
echo "Aborted."
exit 0
fi
# Build command
CMD="ansible-playbook playbooks/$PLAYBOOK"
CMD="$CMD -i inventory/$INVENTORY/hosts.yml"
[[ -n "$TAGS" ]] && CMD="$CMD $TAGS"
[[ -n "$LIMIT" ]] && CMD="$CMD $LIMIT"
[[ -n "$CHECK_MODE" ]] && CMD="$CMD $CHECK_MODE"
[[ -n "$DIFF_MODE" ]] && CMD="$CMD $DIFF_MODE"
[[ -n "$VERBOSE" ]] && CMD="$CMD $VERBOSE"
echo "Executing: $CMD"
echo ""
# Execute
eval "$CMD"
echo ""
echo -e "${GREEN}=== Execution Complete ===${NC}"
2.3.2 执行示例
# Dry run with diff - see what would change without making changes
./scripts/run-playbook.sh -p site.yml -C -D
# Run specific tags on specific hosts
./scripts/run-playbook.sh -p site.yml -t nginx -l "web01.prod.example.com"
# Full deployment
./scripts/run-playbook.sh -p deploy-app.yml -i production
# Verbose output for debugging
./scripts/run-playbook.sh -p site.yml -l webservers -v
2.3.3 验证执行结果
# playbooks/verify-deployment.yml
# Post-deployment verification playbook
---
- name: Verify deployment across all hosts
hosts: all
become: true
gather_facts: true
tasks:
- name: Check system uptime
ansible.builtin.command: uptime
register: uptime_result
changed_when: false
- name: Display uptime
ansible.builtin.debug:
var: uptime_result.stdout
- name: Check listening ports
ansible.builtin.shell: ss -tlnp | grep -E ':(80|443|8080|3306|6379)\s'
register: ports_result
changed_when: false
failed_when: false
- name: Display listening ports
ansible.builtin.debug:
var: ports_result.stdout_lines
- name: Check service status
ansible.builtin.systemd:
name: "{{ item }}"
register: service_status
loop: "{{ services_to_check | default([]) }}"
when: services_to_check is defined
- name: Generate verification report
ansible.builtin.template:
src: verification-report.j2
dest: "/tmp/verification-{{ inventory_hostname }}-{{ ansible_date_time.epoch }}.txt"
delegate_to: localhost
三、示例代码和配置
3.1 完整配置示例
这是一个实际使用的完整 Role 示例——通用服务器初始化配置:
# roles/common/tasks/main.yml
---
- name: Import OS-specific variables
ansible.builtin.include_vars: "{{ item }}"
with_first_found:
- "{{ ansible_distribution | lower }}-{{ ansible_distribution_major_version }}.yml"
- "{{ ansible_distribution | lower }}.yml"
- "{{ ansible_os_family | lower }}.yml"
- "default.yml"
tags: [common]
- name: Set hostname
ansible.builtin.hostname:
name: "{{ inventory_hostname_short }}"
when: common_set_hostname | default(true)
tags: [common, hostname]
- name: Configure /etc/hosts
ansible.builtin.template:
src: hosts.j2
dest: /etc/hosts
owner: root
group: root
mode: "0644"
backup: true
tags: [common, hosts]
- name: Set timezone
community.general.timezone:
name: "{{ timezone }}"
tags: [common, timezone]
- name: Configure NTP
ansible.builtin.include_tasks: ntp.yml
when: ntp_enabled | default(true)
tags: [common, ntp]
- name: Configure package repositories
ansible.builtin.include_tasks: "repos-{{ ansible_os_family | lower }}.yml"
tags: [common, repos]
- name: Update package cache
ansible.builtin.package:
update_cache: true
cache_valid_time: 3600
when: ansible_os_family == 'Debian'
tags: [common, packages]
- name: Install essential packages
ansible.builtin.package:
name: "{{ common_packages }}"
state: present
tags: [common, packages]
- name: Configure sysctl parameters
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_file: /etc/sysctl.d/99-ansible.conf
reload: true
loop: "{{ common_sysctl_params | dict2items }}"
tags: [common, sysctl]
- name: Configure system limits
community.general.pam_limits:
domain: "{{ item.domain }}"
limit_type: "{{ item.type }}"
limit_item: "{{ item.item }}"
value: "{{ item.value }}"
loop: "{{ common_limits }}"
tags: [common, limits]
- name: Create standard directories
ansible.builtin.file:
path: "{{ item.path }}"
state: directory
owner: "{{ item.owner | default('root') }}"
group: "{{ item.group | default('root') }}"
mode: "{{ item.mode | default('0755') }}"
loop: "{{ common_directories }}"
tags: [common, directories]
- name: Configure SSH server
ansible.builtin.include_tasks: sshd.yml
tags: [common, sshd]
- name: Configure firewall
ansible.builtin.include_tasks: firewall.yml
when: common_firewall_enabled | default(true)
tags: [common, firewall]
- name: Configure fail2ban
ansible.builtin.include_tasks: fail2ban.yml
when: security_fail2ban_enabled | default(true)
tags: [common, security, fail2ban]
- name: Setup monitoring agent
ansible.builtin.include_tasks: monitoring.yml
when: common_monitoring_enabled | default(true)
tags: [common, monitoring]
- name: Configure log rotation
ansible.builtin.template:
src: logrotate-ansible.j2
dest: /etc/logrotate.d/ansible-managed
owner: root
group: root
mode: "0644"
tags: [common, logrotate]
# roles/common/defaults/main.yml
---
# Hostname settings
common_set_hostname: true
# Package settings
common_packages:
- vim
- curl
- wget
- git
- htop
- iotop
- sysstat
- net-tools
- bind-utils
- lsof
- tcpdump
- strace
- tree
- jq
- unzip
- tar
- rsync
- tmux
# Sysctl parameters for production servers
common_sysctl_params:
# Network performance
net.core.somaxconn: 65535
net.core.netdev_max_backlog: 65535
net.ipv4.tcp_max_syn_backlog: 65535
net.ipv4.tcp_fin_timeout: 15
net.ipv4.tcp_keepalive_time: 300
net.ipv4.tcp_keepalive_probes: 5
net.ipv4.tcp_keepalive_intvl: 15
net.ipv4.tcp_tw_reuse: 1
net.ipv4.ip_local_port_range: 1024 65535
# Memory management
vm.swappiness: 10
vm.dirty_ratio: 60
vm.dirty_background_ratio: 5
vm.overcommit_memory: 1
# File system
fs.file-max: 2097152
fs.inotify.max_user_watches: 524288
# Security
net.ipv4.conf.all.rp_filter: 1
net.ipv4.conf.default.rp_filter: 1
net.ipv4.icmp_echo_ignore_broadcasts: 1
net.ipv4.conf.all.accept_redirects: 0
net.ipv4.conf.default.accept_redirects: 0
# System limits
common_limits:
- domain: "*"
type: soft
item: nofile
value: 1048576
- domain: "*"
type: hard
item: nofile
value: 1048576
- domain: "*"
type: soft
item: nproc
value: 65535
- domain: "*"
type: hard
item: nproc
value: 65535
- domain: root
type: soft
item: nofile
value: 1048576
- domain: root
type: hard
item: nofile
value: 1048576
# Standard directories
common_directories:
- path: /opt/apps
owner: root
group: root
mode: "0755"
- path: /opt/scripts
owner: root
group: root
mode: "0755"
- path: /var/log/apps
owner: root
group: root
mode: "0755"
- path: /data
owner: root
group: root
mode: "0755"
# Firewall settings
common_firewall_enabled: true
common_firewall_allowed_ports:
- port: 22
proto: tcp
- port: 80
proto: tcp
- port: 443
proto: tcp
# Monitoring settings
common_monitoring_enabled: true
node_exporter_version: "1.7.0"
node_exporter_port: 9100
# roles/common/tasks/sshd.yml
---
- name: Backup original sshd_config
ansible.builtin.copy:
src: /etc/ssh/sshd_config
dest: /etc/ssh/sshd_config.orig
remote_src: true
force: false
owner: root
group: root
mode: "0600"
- name: Configure SSH server
ansible.builtin.template:
src: sshd_config.j2
dest: /etc/ssh/sshd_config
owner: root
group: root
mode: "0600"
validate: /usr/sbin/sshd -t -f %s
backup: true
notify: Restart sshd
- name: Ensure SSH service is running
ansible.builtin.service:
name: sshd
state: started
enabled: true
{# roles/common/templates/sshd_config.j2 #}
# Managed by Ansible - DO NOT EDIT MANUALLY
# Last modified: {{ ansible_date_time.iso8601 }}
# Basic settings
Port {{ ssh_port | default(22) }}
AddressFamily any
ListenAddress 0.0.0.0
Protocol 2
# Host keys
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key
# Security settings
PermitRootLogin {{ 'yes' if security_ssh_permit_root_login | default(false) else 'no' }}
PasswordAuthentication {{ 'yes' if security_ssh_password_authentication | default(false) else 'no' }}
PubkeyAuthentication yes
PermitEmptyPasswords no
ChallengeResponseAuthentication no
# Authentication
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3
MaxSessions 10
LoginGraceTime 60
# Session settings
ClientAliveInterval 300
ClientAliveCountMax 3
TCPKeepAlive yes
# Logging
SyslogFacility AUTH
LogLevel INFO
# Environment
AcceptEnv LANG LC_*
X11Forwarding no
PrintMotd no
PrintLastLog yes
# Subsystems
Subsystem sftp /usr/lib/openssh/sftp-server
# Allow specific users/groups
{% if ssh_allow_users is defined and ssh_allow_users | length > 0 %}
AllowUsers {{ ssh_allow_users | join(' ') }}
{% endif %}
{% if ssh_allow_groups is defined and ssh_allow_groups | length > 0 %}
AllowGroups {{ ssh_allow_groups | join(' ') }}
{% endif %}
# Deny specific users/groups
{% if ssh_deny_users is defined and ssh_deny_users | length > 0 %}
DenyUsers {{ ssh_deny_users | join(' ') }}
{% endif %}
3.2 实际应用案例
案例一:批量修复 Log4j 漏洞
2021年12月,Log4j 漏洞(CVE-2021-44228)爆发时,需要在2小时内完成所有服务器的检测和修复。这是当时用的紧急 Playbook:
# playbooks/emergency-log4j-fix.yml
# Emergency playbook for CVE-2021-44228 (Log4Shell)
---
- name: Log4j vulnerability detection and remediation
hosts: all
become: true
gather_facts: true
serial: 50
vars:
log4j_scan_paths:
- /opt/apps
- /usr/local
- /var/lib
log4j_vulnerable_versions:
- "2.0"
- "2.1"
- "2.2"
- "2.3"
- "2.4"
- "2.5"
- "2.6"
- "2.7"
- "2.8"
- "2.9"
- "2.10"
- "2.11"
- "2.12.0"
- "2.12.1"
- "2.13"
- "2.14"
- "2.14.0"
- "2.14.1"
tasks:
- name: Search for Log4j JAR files
ansible.builtin.find:
paths: "{{ log4j_scan_paths }}"
patterns:
- "log4j-core-*.jar"
- "log4j-api-*.jar"
recurse: true
file_type: file
register: log4j_files
- name: Analyze found Log4j files
ansible.builtin.set_fact:
vulnerable_jars: "{{ log4j_files.files | selectattr('path', 'search', 'log4j-core-2\\.(0|1[0-4]|[0-9])[\\.\\-]') | list }}"
- name: Report vulnerable files
ansible.builtin.debug:
msg: |
Host: {{ inventory_hostname }}
Vulnerable JARs found: {{ vulnerable_jars | length }}
Files: {{ vulnerable_jars | map(attribute='path') | list }}
when: vulnerable_jars | length > 0
- name: Create backup directory
ansible.builtin.file:
path: /opt/backup/log4j-{{ ansible_date_time.date }}
state: directory
mode: "0755"
when: vulnerable_jars | length > 0
- name: Backup vulnerable JARs
ansible.builtin.copy:
src: "{{ item.path }}"
dest: "/opt/backup/log4j-{{ ansible_date_time.date }}/{{ item.path | basename }}.{{ ansible_date_time.epoch }}"
remote_src: true
loop: "{{ vulnerable_jars }}"
when: vulnerable_jars | length > 0
- name: Apply JndiLookup class removal mitigation
ansible.builtin.shell: |
zip -q -d "{{ item.path }}" org/apache/logging/log4j/core/lookup/JndiLookup.class 2>/dev/null || true
loop: "{{ vulnerable_jars }}"
when: vulnerable_jars | length > 0
register: mitigation_result
- name: Generate vulnerability report
ansible.builtin.template:
src: log4j-report.j2
dest: "/tmp/log4j-report-{{ inventory_hostname }}.txt"
delegate_to: localhost
- name: Aggregate reports
ansible.builtin.fetch:
src: "/tmp/log4j-report-{{ inventory_hostname }}.txt"
dest: "reports/log4j/{{ inventory_hostname }}.txt"
flat: true
delegate_to: localhost
post_tasks:
- name: Send alert if vulnerabilities found
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "Log4j scan complete on {{ inventory_hostname }}: {{ vulnerable_jars | length }} vulnerable JARs found and mitigated"
delegate_to: localhost
when:
- vulnerable_jars | length > 0
- slack_webhook_url is defined
案例二:数据库主从切换
这是用于 MySQL 主从切换的 Playbook。某次凌晨3点主库硬盘故障,就是靠这个 Playbook 在5分钟内完成了切换:
# playbooks/mysql-failover.yml
# MySQL master-slave failover playbook
---
- name: Pre-failover checks
hosts: mysql_replica
become: true
gather_facts: true
any_errors_fatal: true
vars_prompt:
- name: confirm_failover
prompt: "Are you sure you want to perform MySQL failover? (type 'yes' to confirm)"
private: false
- name: new_master
prompt: "Enter the hostname of the new master"
private: false
tasks:
- name: Validate confirmation
ansible.builtin.assert:
that:
- confirm_failover == 'yes'
fail_msg: "Failover not confirmed. Aborting."
- name: Validate new master is in replica group
ansible.builtin.assert:
that:
- new_master in groups['mysql_replica']
fail_msg: "{{ new_master }} is not a valid replica host"
- name: Check replication status on all replicas
community.mysql.mysql_replication:
mode: getreplica
login_user: root
login_password: "{{ mysql_root_password }}"
register: repl_status
- name: Display replication lag
ansible.builtin.debug:
msg: "{{ inventory_hostname }}: Seconds_Behind_Master={{ repl_status.Seconds_Behind_Master | default('N/A') }}"
- name: Ensure replication is caught up
ansible.builtin.assert:
that:
- repl_status.Seconds_Behind_Master is defined
- repl_status.Seconds_Behind_Master | int <= 10
fail_msg: "Replication lag too high on {{ inventory_hostname }}"
when: inventory_hostname == new_master
- name: Stop writes on old master
hosts: mysql_primary
become: true
tasks:
- name: Set read_only on old master
community.mysql.mysql_variables:
variable: read_only
value: "ON"
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Kill long running queries
community.mysql.mysql_query:
login_user: root
login_password: "{{ mysql_root_password }}"
query: |
SELECT CONCAT('KILL ', id, ';')
FROM information_schema.processlist
WHERE command != 'Sleep'
AND time > 5
AND user != 'system user'
register: kill_queries
- name: Wait for all transactions to complete
ansible.builtin.wait_for:
timeout: 30
- name: Promote new master
hosts: "{{ new_master }}"
become: true
tasks:
- name: Stop replication on new master
community.mysql.mysql_replication:
mode: stopreplica
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Reset replica configuration
community.mysql.mysql_replication:
mode: resetreplica
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Disable read_only on new master
community.mysql.mysql_variables:
variable: read_only
value: "OFF"
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Get master status
community.mysql.mysql_replication:
mode: getprimary
login_user: root
login_password: "{{ mysql_root_password }}"
register: new_master_status
- name: Display new master status
ansible.builtin.debug:
msg: |
New Master: {{ inventory_hostname }}
Binlog File: {{ new_master_status.File }}
Binlog Position: {{ new_master_status.Position }}
- name: Reconfigure other replicas
hosts: mysql_replica:!{{ new_master }}
become: true
serial: 1
tasks:
- name: Stop replication
community.mysql.mysql_replication:
mode: stopreplica
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Point to new master
community.mysql.mysql_replication:
mode: changeprimary
primary_host: "{{ new_master }}"
primary_user: repl_user
primary_password: "{{ mysql_repl_password }}"
primary_log_file: "{{ hostvars[new_master]['new_master_status']['File'] }}"
primary_log_pos: "{{ hostvars[new_master]['new_master_status']['Position'] }}"
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Start replication
community.mysql.mysql_replication:
mode: startreplica
login_user: root
login_password: "{{ mysql_root_password }}"
- name: Verify replication is running
community.mysql.mysql_replication:
mode: getreplica
login_user: root
login_password: "{{ mysql_root_password }}"
register: new_repl_status
retries: 5
delay: 2
until:
- new_repl_status.Slave_IO_Running == 'Yes'
- new_repl_status.Slave_SQL_Running == 'Yes'
- name: Update DNS and notify
hosts: localhost
gather_facts: false
tasks:
- name: Update DNS record for mysql-master
community.general.nsupdate:
key_name: "ansible-key"
key_secret: "{{ dns_update_key }}"
server: "{{ dns_server }}"
zone: "prod.example.com"
record: "mysql-master"
type: "A"
value: "{{ hostvars[new_master]['ansible_default_ipv4']['address'] }}"
when: dns_update_enabled | default(false)
- name: Send notification
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: |
:rotating_light: MySQL Failover Complete :rotating_light:
New Master: {{ new_master }}
Time: {{ ansible_date_time.iso8601 }}
Performed by: {{ lookup('env', 'USER') }}
when: slack_webhook_url is defined
四、最佳实践和注意事项
4.1 最佳实践
4.1.1 代码组织
使用完全限定集合名(FQCN)
从 Ansible 2.10 开始,官方推荐使用 FQCN。这能避免模块名冲突,也让代码更清晰:
# Good: FQCN makes it clear which module is being used
- name: Install packages
ansible.builtin.package:
name: nginx
state: present
# Bad: Short name could be ambiguous
- name: Install packages
package:
name: nginx
state: present
所有任务都要有 name
name 不只是注释,它还会出现在执行日志里。好的 name 能让你在半夜看日志时快速定位问题:
# Good: Clear, descriptive names
- name: Install Nginx web server
ansible.builtin.package:
name: nginx
state: present
- name: Deploy Nginx configuration for api.example.com
ansible.builtin.template:
src: nginx-api.conf.j2
dest: /etc/nginx/conf.d/api.conf
# Bad: Vague or missing names
- package:
name: nginx
state: present
- name: Deploy config # Too vague
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
合理使用 Tags
Tags 让你能选择性执行 Playbook 的一部分。规范是:每个 Role 都要有一个顶级 Tag,每个 task 文件有一个次级 Tag:
# Role-level tag in playbook
- role: nginx
tags: [nginx]
# Task-level tags in role
- name: Install Nginx
ansible.builtin.package:
name: nginx
tags: [nginx, nginx:install]
- name: Configure Nginx
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
tags: [nginx, nginx:config]
# Usage examples
# ansible-playbook site.yml --tags nginx # All nginx tasks
# ansible-playbook site.yml --tags nginx:config # Only config tasks
# ansible-playbook site.yml --skip-tags nginx:install # Skip installation
4.1.2 安全实践
敏感信息加密
所有密码、密钥、证书都必须用 Ansible Vault 加密:
# Create vault password script (more secure than plain file)
cat > scripts/vault-password.sh << 'EOF'
#!/bin/bash
# Fetch vault password from HashiCorp Vault or AWS Secrets Manager
# For demo, using environment variable
echo "${ANSIBLE_VAULT_PASSWORD}"
EOF
chmod 700 scripts/vault-password.sh
# Encrypt sensitive files
ansible-vault encrypt inventory/production/group_vars/vault.yml
# Encrypt specific string
ansible-vault encrypt_string 'super_secret_password' --name 'db_password'
使用 no_log 保护敏感输出
- name: Configure database password
ansible.builtin.mysql_user:
name: app_user
password: "{{ db_password }}"
priv: "app_db.*:ALL"
login_user: root
login_password: "{{ mysql_root_password }}"
no_log: true # Prevents password from appearing in logs
- name: Deploy application config with secrets
ansible.builtin.template:
src: app-config.yml.j2
dest: /opt/app/config.yml
mode: "0600"
no_log: true
最小权限原则
# Create dedicated ansible user with limited sudo
- name: Create ansible automation user
ansible.builtin.user:
name: ansible
shell: /bin/bash
create_home: true
groups: [] # No extra groups by default
- name: Configure sudo for specific commands only
ansible.builtin.copy:
content: |
# Ansible automation user sudo rules
ansible ALL=(root) NOPASSWD: /usr/bin/systemctl restart nginx
ansible ALL=(root) NOPASSWD: /usr/bin/systemctl reload nginx
ansible ALL=(root) NOPASSWD: /usr/bin/apt-get update
ansible ALL=(root) NOPASSWD: /usr/bin/apt-get install *
dest: /etc/sudoers.d/ansible
mode: "0440"
validate: visudo -cf %s
4.1.3 性能优化
启用 Pipelining
Pipelining 减少 SSH 连接次数,能显著提升性能:
# ansible.cfg
[ssh_connection]
pipelining = True
但要注意,目标服务器的 sudoers 里不能有 requiretty。
使用 Fact Caching
对于不经常变化的服务器信息,缓存 facts 能大幅减少执行时间:
# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400 # 24 hours
合理设置并发数
# ansible.cfg
[defaults]
forks = 50 # Adjust based on control node capacity
# For sensitive operations, use serial to limit parallelism
- hosts: databases
serial: 1 # One at a time for database operations
- hosts: webservers
serial: "30%" # 30% of hosts at a time for rolling updates
使用异步任务
对于耗时操作,可以用异步执行:
- name: Run long-running backup task
ansible.builtin.shell: /opt/scripts/full-backup.sh
async: 3600 # Allow up to 1 hour
poll: 0 # Don't wait, fire and forget
register: backup_job
- name: Wait for backup to complete
ansible.builtin.async_status:
jid: "{{ backup_job.ansible_job_id }}"
register: backup_result
until: backup_result.finished
retries: 60
delay: 60
4.1.4 错误处理
使用 block/rescue/always
- name: Deploy application with rollback capability
block:
- name: Backup current version
ansible.builtin.copy:
src: /opt/app/current/
dest: /opt/app/backup-{{ ansible_date_time.epoch }}/
remote_src: true
- name: Deploy new version
ansible.builtin.unarchive:
src: "{{ app_artifact_url }}"
dest: /opt/app/current/
remote_src: true
- name: Restart application
ansible.builtin.systemd:
name: myapp
state: restarted
- name: Verify application health
ansible.builtin.uri:
url: http://127.0.0.1:8080/health
status_code: 200
retries: 5
delay: 10
rescue:
- name: Restore from backup
ansible.builtin.copy:
src: /opt/app/backup-{{ ansible_date_time.epoch }}/
dest: /opt/app/current/
remote_src: true
- name: Restart application with old version
ansible.builtin.systemd:
name: myapp
state: restarted
- name: Send failure notification
ansible.builtin.uri:
url: "{{ slack_webhook_url }}"
method: POST
body_format: json
body:
text: "Deployment failed on {{ inventory_hostname }}, rolled back to previous version"
always:
- name: Clean up old backups
ansible.builtin.find:
paths: /opt/app/
patterns: "backup-*"
age: 7d
register: old_backups
- name: Remove old backups
ansible.builtin.file:
path: "{{ item.path }}"
state: absent
loop: "{{ old_backups.files }}"
4.2 注意事项
4.2.1 幂等性陷阱
最常见的问题是 shell 和 command 模块不是幂等的:
# Bad: Not idempotent, runs every time
- name: Add line to file
ansible.builtin.shell: echo "export PATH=/opt/bin:$PATH" >> /etc/profile
# Good: Idempotent, only adds if not present
- name: Add PATH to profile
ansible.builtin.lineinfile:
path: /etc/profile
line: 'export PATH=/opt/bin:$PATH'
state: present
# If you must use shell, add creates/removes conditions
- name: Initialize database
ansible.builtin.shell: /opt/db/init.sh
args:
creates: /opt/db/.initialized # Only runs if this file doesn't exist
4.2.2 变量优先级混乱
Ansible 有22级变量优先级,很容易搞混。原则是:
# Priority (simplified, high to low):
# 1. Extra vars (-e) - Use for one-time overrides
# 2. Task vars - Avoid, hard to track
# 3. Block vars - Avoid, hard to track
# 4. Role vars (vars/main.yml) - Use for internal role variables
# 5. Host vars - Use for host-specific settings
# 6. Group vars - Use for group-specific settings
# 7. Role defaults - Use for user-overridable defaults
# Our convention:
# - defaults/main.yml: All variables that users might want to override
# - vars/main.yml: Internal variables that shouldn't be changed
# - group_vars/: Environment and group-specific values
# - host_vars/: Host-specific values only
4.2.3 Handler 执行顺序
Handler 在所有 tasks 执行完后才运行,而且只运行一次:
# Problem: Handler runs at the end, not immediately
- name: Deploy config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: Reload nginx
- name: Deploy SSL cert # This runs before nginx reload!
ansible.builtin.copy:
src: ssl.crt
dest: /etc/nginx/ssl/
# Solution: Use meta flush_handlers
- name: Deploy config
ansible.builtin.template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: Reload nginx
- name: Flush handlers
ansible.builtin.meta: flush_handlers # Force handler to run now
- name: Deploy SSL cert
ansible.builtin.copy:
src: ssl.crt
dest: /etc/nginx/ssl/
4.2.4 Template 中的 YAML 缩进
Jinja2 模板生成 YAML 时特别容易出问题:
{# Bad: Indentation issues #}
servers:
{% for server in backend_servers %}
- {{ server }}
{% endfor %}
{# Good: Use indentation filters #}
servers:
{{ backend_servers | to_nice_yaml | indent(2) }}
{# Or control whitespace explicitly #}
servers:
{%- for server in backend_servers %}
- {{ server }}
{%- endfor %}
4.2.5 Check Mode 兼容性
有些操作在 check mode 下会失败,要特殊处理:
- name: Get current user
ansible.builtin.command: whoami
register: current_user
changed_when: false
check_mode: false # Always run, even in check mode
- name: Task that depends on registered variable
ansible.builtin.debug:
msg: "Running as {{ current_user.stdout }}"
五、故障排查和监控
5.1 故障排查
5.1.1 调试技巧
启用详细输出
# Verbosity levels
ansible-playbook site.yml -v # Show task results
ansible-playbook site.yml -vv # Show task input parameters
ansible-playbook site.yml -vvv # Show connection debugging
ansible-playbook site.yml -vvvv # Show connection plugin debugging (very verbose)
使用 debug 模块
- name: Debug variable content
ansible.builtin.debug:
var: some_variable
- name: Debug with message
ansible.builtin.debug:
msg: "Value is {{ some_variable }} and type is {{ some_variable | type_debug }}"
- name: Debug all variables for a host
ansible.builtin.debug:
var: hostvars[inventory_hostname]
使用 assert 进行验证
- name: Validate prerequisites
ansible.builtin.assert:
that:
- ansible_distribution == "Ubuntu"
- ansible_distribution_major_version | int >= 20
- ansible_memtotal_mb >= 4096
fail_msg: "Host {{ inventory_hostname }} does not meet requirements"
success_msg: "All prerequisites met"
5.1.2 常见问题
SSH 连接问题
# Test SSH connection manually
ssh -vvv ansible@target-host
# Common fixes in ansible.cfg
[ssh_connection]
ssh_args = -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
权限问题
# Check if become is working
- name: Test privilege escalation
ansible.builtin.command: whoami
become: true
register: whoami_result
- name: Show result
ansible.builtin.debug:
msg: "Running as {{ whoami_result.stdout }}"
模块不存在
# Check if collection is installed
ansible-galaxy collection list
# Install missing collection
ansible-galaxy collection install community.general
5.1.3 Playbook 调试流程
# Step 1: Syntax check
ansible-playbook site.yml --syntax-check
# Step 2: List tasks without running
ansible-playbook site.yml --list-tasks
# Step 3: List hosts that would be affected
ansible-playbook site.yml --list-hosts
# Step 4: Dry run with diff
ansible-playbook site.yml --check --diff
# Step 5: Run on single host first
ansible-playbook site.yml --limit web01.example.com
# Step 6: Step through interactively
ansible-playbook site.yml --step
# Step 7: Start at specific task
ansible-playbook site.yml --start-at-task="Deploy Nginx configuration"
5.2 性能监控
5.2.1 Callback 插件
使用 callback 插件来监控 Playbook 执行:
# ansible.cfg
[defaults]
callback_whitelist = timer, profile_tasks, profile_roles
stdout_callback = yaml
profile_tasks 输出示例:
PLAY RECAP *********************************************************************
web01.example.com : ok=25 changed=3 unreachable=0 failed=0 skipped=5 rescued=0 ignored=0
Thursday 19 December 2024 15:30:45 +0800 (0:00:01.234) 0:02:15.678 ******
===============================================================================
nginx : Deploy Nginx configuration ------------------------------------ 45.23s
common : Install essential packages ----------------------------------- 32.15s
nginx : Install Nginx packages ---------------------------------------- 28.67s
common : Configure sysctl parameters ---------------------------------- 12.34s
5.2.2 自定义执行报告
# playbooks/generate-report.yml
---
- name: Generate execution report
hosts: localhost
gather_facts: false
tasks:
- name: Create report directory
ansible.builtin.file:
path: "{{ playbook_dir }}/../reports/{{ ansible_date_time.date }}"
state: directory
- name: Generate HTML report
ansible.builtin.template:
src: execution-report.html.j2
dest: "{{ playbook_dir }}/../reports/{{ ansible_date_time.date }}/report-{{ ansible_date_time.epoch }}.html"
5.2.3 Prometheus 集成
使用 Pushgateway 记录 Ansible 执行指标:
# callback_plugins/prometheus_metrics.py
from ansible.plugins.callback import CallbackBase
import requests
import time
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'aggregate'
CALLBACK_NAME = 'prometheus_metrics'
def __init__(self):
super(CallbackModule, self).__init__()
self.start_time = None
self.task_times = {}
self.host_results = {}
def v2_playbook_on_start(self, playbook):
self.start_time = time.time()
self.playbook_name = playbook._file_name
def v2_playbook_on_stats(self, stats):
duration = time.time() - self.start_time
metrics = []
metrics.append(f'ansible_playbook_duration_seconds{{playbook="{self.playbook_name}"}} {duration}')
for host in stats.processed:
summary = stats.summarize(host)
for status in ['ok', 'changed', 'failures', 'skipped']:
metrics.append(f'ansible_host_{status}{{host="{host}",playbook="{self.playbook_name}"}} {summary[status]}')
# Push to Prometheus Pushgateway
try:
requests.post(
'http://pushgateway:9091/metrics/job/ansible',
data='\n'.join(metrics)
)
except Exception as e:
self._display.warning(f"Failed to push metrics: {e}")
5.3 备份与恢复
5.3.1 配置备份
# roles/common/tasks/backup.yml
---
- name: Create backup directory
ansible.builtin.file:
path: /opt/backup/ansible/{{ ansible_date_time.date }}
state: directory
mode: "0700"
- name: Backup critical configurations
ansible.builtin.archive:
path:
- /etc/nginx
- /etc/mysql
- /etc/redis
- /etc/ssh/sshd_config
- /etc/sysctl.d
dest: /opt/backup/ansible/{{ ansible_date_time.date }}/config-backup.tar.gz
format: gz
- name: Sync backup to central storage
ansible.posix.synchronize:
src: /opt/backup/ansible/
dest: "{{ backup_server }}:/backup/{{ inventory_hostname }}/"
mode: push
delete: false
recursive: true
delegate_to: "{{ inventory_hostname }}"
5.3.2 灾难恢复 Playbook
# playbooks/disaster-recovery.yml
---
- name: Disaster recovery - rebuild server from scratch
hosts: "{{ target_host }}"
become: true
gather_facts: true
vars_prompt:
- name: confirm_rebuild
prompt: "This will rebuild the server from scratch. Type 'REBUILD' to confirm"
private: false
pre_tasks:
- name: Validate confirmation
ansible.builtin.assert:
that:
- confirm_rebuild == 'REBUILD'
fail_msg: "Rebuild not confirmed"
tasks:
- name: Restore from backup
ansible.builtin.unarchive:
src: "{{ backup_server }}:/backup/{{ inventory_hostname }}/latest/config-backup.tar.gz"
dest: /
remote_src: true
when: restore_from_backup | default(false)
- name: Apply full configuration
ansible.builtin.include_role:
name: "{{ item }}"
loop:
- common
- "{{ server_role }}"
post_tasks:
- name: Verify all services
ansible.builtin.service:
name: "{{ item }}"
state: started
loop: "{{ required_services }}"
- name: Run health checks
ansible.builtin.uri:
url: "http://127.0.0.1:{{ item.port }}{{ item.path }}"
status_code: 200
loop: "{{ health_endpoints }}"
5.3.3 Ansible 控制节点备份
控制节点的配置也需要备份,用 Git 管理所有 Ansible 代码:
#!/bin/bash
# scripts/backup-control-node.sh
# Backup Ansible control node configuration
BACKUP_DIR="/opt/backup/ansible-control"
DATE=$(date +%Y%m%d)
# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"
# Backup Ansible configuration
tar -czf "$BACKUP_DIR/$DATE/ansible-config.tar.gz" \
/etc/ansible \
~/.ansible \
~/.ssh/ansible_* \
2>/dev/null
# Backup custom plugins
tar -czf "$BACKUP_DIR/$DATE/ansible-plugins.tar.gz" \
/usr/share/ansible/plugins \
2>/dev/null
# Commit Ansible repository
cd /opt/ansible-infrastructure
git add -A
git commit -m "Backup: $DATE" || true
git push origin main
# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -type d -mtime +30 -exec rm -rf {} \;
echo "Backup completed: $BACKUP_DIR/$DATE"
六、总结
写 Ansible Playbook 这件事,入门容易精通难。我们团队从最初的“能跑就行”到现在的这套规范,中间踩了无数坑。最痛的那次是文章开头提到的周五晚上事故,直接让我们决心花三个月时间整理规范。
回顾这些年的实践,最重要的几点是:
结构化思维:把所有东西都放到该放的地方。变量放 group_vars,敏感信息用 Vault 加密,可复用的逻辑抽成 Role。这样不管是你自己三个月后回来看,还是新同事接手,都能快速理解。
幂等性意识:写每一个 task 的时候都要问自己:这个 task 跑两遍会怎样?如果答案是“会出问题”,那就得改。shell 模块用得少一点,lineinfile、template 这些声明式的模块用得多一点。
安全第一:密码永远不要明文写在文件里,SSH 密钥要妥善管理,sudo 权限要最小化。安全这东西,出事之前觉得麻烦,出事之后追悔莫及。
持续改进:规范不是一成不变的。每次出问题都要复盘,把教训固化成规范。我们的 Playbook 库现在还在不断迭代,每个季度都会 review 一次,把过时的东西清理掉,把新的最佳实践加进来。
自动化运维这条路,Ansible 只是起点。掌握了 Ansible,后面可以继续学 Terraform、Pulumi 这些基础设施即代码工具,或者深入 Kubernetes 的声明式配置。核心思想是相通的:用代码管理一切,让运维工作可重复、可追溯、可协作。
附录
A. 常用命令速查
# Inventory operations
ansible-inventory --list -i inventory/production/hosts.yml
ansible-inventory --graph -i inventory/production/hosts.yml
ansible all -m ping -i inventory/production/hosts.yml
# Playbook operations
ansible-playbook site.yml --syntax-check
ansible-playbook site.yml --list-tasks
ansible-playbook site.yml --list-hosts
ansible-playbook site.yml --check --diff
ansible-playbook site.yml --tags nginx
ansible-playbook site.yml --skip-tags install
ansible-playbook site.yml --limit webservers
ansible-playbook site.yml --start-at-task="Deploy config"
# Vault operations
ansible-vault create secrets.yml
ansible-vault edit secrets.yml
ansible-vault encrypt secrets.yml
ansible-vault decrypt secrets.yml
ansible-vault encrypt_string 'password' --name 'db_password'
ansible-vault rekey secrets.yml
# Galaxy operations
ansible-galaxy init my_role
ansible-galaxy install -r requirements.yml
ansible-galaxy collection install community.general
# Ad-hoc commands
ansible webservers -m shell -a "uptime"
ansible databases -m service -a "name=mysql state=restarted"
ansible all -m setup -a "filter=ansible_distribution*"
B. Ansible Lint 规则
# .ansible-lint
---
profile: production
exclude_paths:
- .cache/
- .git/
- test/
skip_list:
- yaml[line-length]
- no-changed-when
warn_list:
- command-instead-of-shell
- risky-shell-pipe
enable_list:
- fqcn-builtins
- no-same-owner
use_default_rules: true
verbosity: 1
C. 推荐的 Collection
# requirements.yml
---
collections:
- name: ansible.posix
version: ">=1.5.0"
- name: community.general
version: ">=8.0.0"
- name: community.mysql
version: ">=3.8.0"
- name: community.docker
version: ">=3.4.0"
- name: community.crypto
version: ">=2.16.0"
- name: amazon.aws
version: ">=7.0.0"
- name: google.cloud
version: ">=1.3.0"
D. 项目模板
我们在 GitHub 上维护了一个 Ansible 项目模板,包含了本文提到的所有规范和示例代码。新项目直接 fork 这个模板就能开始:
# Clone template and start new project
git clone https://github.com/your-org/ansible-template.git my-infrastructure
cd my-infrastructure
./scripts/init-project.sh
