找回密码
立即注册
搜索
热搜: Java Python Linux Go
发回帖 发新帖

3498

积分

0

好友

468

主题
发表于 4 小时前 | 查看: 2| 回复: 0

一、概述

1.1 背景介绍

2025年的一个周五下午6点,我们团队正准备下班,突然收到消息:需要在200台服务器上紧急部署一个安全补丁。运维同事小王信心满满地打开他写的 Ansible Playbook 执行了命令。结果呢?因为 Playbook 里硬编码了一个只有他本地才有的路径,加上没有做幂等性处理,直接把50台服务器的 Nginx 配置搞坏了。

那个周末,整个团队都在加班修复问题。从那以后,我们痛定思痛,花了三个月时间整理出一套 Ansible Playbook 编写规范。这套规范在后来的三年里,帮助团队管理了超过2000台服务器,执行了上万次自动化部署,再也没有出现过类似的事故。

说实话,Ansible 的入门门槛很低,随便写几行 YAML 就能跑起来。但正是这种“简单”,让很多人忽视了规范的重要性。一个人写的 Playbook 能跑,不代表团队能维护;今天能跑,不代表明天还能跑。

1.2 技术特点

Ansible 作为一个无代理的配置管理工具,有几个让人特别喜欢的特点:

声明式配置:你告诉 Ansible 你想要什么状态,而不是告诉它怎么做。比如你说“我要 Nginx 服务运行”,Ansible 会自己判断是需要启动、重启还是什么都不做。

幂等性设计:理论上,同一个 Playbook 执行一次和执行一百次,结果应该是一样的。当然,这个“理论上”需要你写的 Playbook 足够规范才能实现。

模块化架构:Ansible 有几千个官方模块,从管理文件到操作云平台,基本上你能想到的操作都有现成的模块。

基于 SSH:不需要在目标机器上安装任何东西,只要能 SSH 连上去就行。这一点在很多安全要求严格的环境里特别有用。

但 Ansible 也有一些让人头疼的地方:

  • YAML 语法对缩进极其敏感,少一个空格就报错
  • 变量优先级有22个层级,搞不清楚哪个变量被覆盖了
  • Jinja2 模板在复杂场景下容易写出让人看不懂的代码
  • 大规模执行时性能是个问题

1.3 适用场景

根据实践经验,Ansible 特别适合这些场景:

配置管理:统一管理服务器的系统配置、软件包、用户权限等。我们用 Ansible 管理所有服务器的 SSH 配置、防火墙规则、系统参数调优。

应用部署:自动化部署 Web 应用、微服务、数据库等。Java 应用、Node.js 服务都可以用 Ansible 部署。

环境初始化:新服务器上线时的标准化配置。有一套 Playbook 能在10分钟内把一台裸机配置成符合公司标准的服务器。

批量操作:需要在大量服务器上执行相同操作的场景。比如批量更新软件包、批量修改配置文件。

灾难恢复:快速重建环境。所有配置都用 Ansible 管理,理论上可以在几小时内重建整个生产环境。

不太适合的场景:

  • 需要实时响应的场景(Ansible 执行有延迟)
  • 需要复杂条件判断和循环的场景(用 Python 或 Go 写更合适)
  • 目标机器无法 SSH 连接的场景

1.4 环境要求

控制节点要求

# Operating System
os: Linux/macOS/WSL2 (Windows native is not recommended)
python: ">=3.9"
ansible: ">=2.14"

# Recommended specs
cpu: 4 cores
memory: 8GB
disk: 50GB SSD

目标节点要求

# Minimum requirements
os: Linux (RHEL/CentOS/Ubuntu/Debian)
python: ">=3.6"
ssh: enabled
sudo: configured for ansible user

网络要求

  • 控制节点能 SSH 到所有目标节点
  • 建议使用专用的管理网络
  • SSH 端口默认22,可以自定义

生产环境的实际配置

# Ansible version we use in production
ansible --version
# ansible [core 2.15.6]
# python version = 3.11.4
# jinja version = 3.1.2

# Install Ansible on control node
pip install ansible==2.15.6 ansible-lint==6.22.0

# Create ansible user on target nodes
useradd -m -s /bin/bash ansible
echo "ansible ALL=(ALL) NOPASSWD:ALL" > /etc/sudoers.d/ansible
chmod 440 /etc/sudoers.d/ansible

二、详细步骤

2.1 准备工作

2.1.1 项目目录结构

一个规范的 Ansible 项目应该有清晰的目录结构。经过多次迭代,最终确定了这样的结构:

ansible-infrastructure/
├── ansible.cfg                 # Ansible configuration
├── requirements.yml            # Role dependencies from Galaxy
├── inventory/
│   ├── production/
│   │   ├── hosts.yml          # Production inventory
│   │   ├── group_vars/
│   │   │   ├── all.yml        # Variables for all hosts
│   │   │   ├── webservers.yml # Variables for web servers
│   │   │   └── databases.yml  # Variables for databases
│   │   └── host_vars/
│   │       ├── web01.yml      # Host-specific variables
│   │       └── db01.yml
│   └── staging/
│       ├── hosts.yml
│       ├── group_vars/
│       └── host_vars/
├── playbooks/
│   ├── site.yml               # Main playbook
│   ├── webservers.yml         # Web server playbook
│   ├── databases.yml          # Database playbook
│   └── deploy-app.yml         # Application deployment
├── roles/
│   ├── common/                # Common configurations
│   │   ├── tasks/
│   │   │   └── main.yml
│   │   ├── handlers/
│   │   │   └── main.yml
│   │   ├── templates/
│   │   ├── files/
│   │   ├── vars/
│   │   │   └── main.yml
│   │   ├── defaults/
│   │   │   └── main.yml
│   │   └── meta/
│   │       └── main.yml
│   ├── nginx/
│   ├── mysql/
│   └── monitoring/
├── library/                   # Custom modules
├── filter_plugins/            # Custom filters
├── callback_plugins/          # Custom callbacks
├── files/                     # Static files
├── templates/                 # Jinja2 templates
└── scripts/                   # Helper scripts
    ├── run-playbook.sh
    └── vault-password.sh

2.1.2 Ansible 配置文件

ansible.cfg 是控制 Ansible 行为的核心配置文件。这是生产环境使用的配置:

# ansible.cfg - Production configuration
[defaults]
# Inventory settings
inventory = inventory/production/hosts.yml
roles_path = roles:~/.ansible/roles:/usr/share/ansible/roles

# Performance tuning
forks = 50
poll_interval = 5
timeout = 30

# SSH settings
remote_user = ansible
private_key_file = ~/.ssh/ansible_ed25519
host_key_checking = False
transport = smart

# Output settings
stdout_callback = yaml
callback_whitelist = timer, profile_tasks, profile_roles
deprecation_warnings = True
system_warnings = True

# Fact caching (significantly improves performance)
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400

# Retry settings
retry_files_enabled = True
retry_files_save_path = ~/.ansible/retry

# Logging
log_path = /var/log/ansible/ansible.log

# Vault settings
vault_password_file = scripts/vault-password.sh

# Misc
nocows = 1
any_errors_fatal = False
error_on_undefined_vars = True

[privilege_escalation]
become = True
become_method = sudo
become_user = root
become_ask_pass = False

[ssh_connection]
ssh_args = -C -o ControlMaster=auto -o ControlPersist=600s -o PreferredAuthentications=publickey
pipelining = True
control_path_dir = ~/.ansible/cp
control_path = %(directory)s/%%h-%%r

[persistent_connection]
connect_timeout = 30
command_timeout = 30

2.1.3 Inventory 配置

Inventory 定义了你要管理的服务器。使用 YAML 格式比 INI 格式更清晰:

# inventory/production/hosts.yml
all:
  children:
    webservers:
      hosts:
        web01.prod.example.com:
          ansible_host: 10.0.1.11
          nginx_worker_processes: 4
        web02.prod.example.com:
          ansible_host: 10.0.1.12
          nginx_worker_processes: 4
        web03.prod.example.com:
          ansible_host: 10.0.1.13
          nginx_worker_processes: 8
      vars:
        nginx_worker_connections: 4096
        app_port: 8080

    databases:
      children:
        mysql_primary:
          hosts:
            db01.prod.example.com:
              ansible_host: 10.0.2.11
              mysql_server_id: 1
        mysql_replica:
          hosts:
            db02.prod.example.com:
              ansible_host: 10.0.2.12
              mysql_server_id: 2
            db03.prod.example.com:
              ansible_host: 10.0.2.13
              mysql_server_id: 3
      vars:
        mysql_port: 3306
        mysql_datadir: /data/mysql

    redis:
      hosts:
        redis01.prod.example.com:
          ansible_host: 10.0.3.11
          redis_port: 6379
        redis02.prod.example.com:
          ansible_host: 10.0.3.12
          redis_port: 6379
        redis03.prod.example.com:
          ansible_host: 10.0.3.13
          redis_port: 6379
      vars:
        redis_maxmemory: 8gb
        redis_maxmemory_policy: allkeys-lru

    monitoring:
      hosts:
        prometheus01.prod.example.com:
          ansible_host: 10.0.4.11
        grafana01.prod.example.com:
          ansible_host: 10.0.4.12

  vars:
    ansible_user: ansible
    ansible_become: true
    ansible_python_interpreter: /usr/bin/python3
    ntp_servers:
      - 10.0.0.1
      - 10.0.0.2
    dns_servers:
      - 10.0.0.10
      - 10.0.0.11

2.2 核心配置

2.2.1 变量管理规范

变量管理是 Ansible 最容易出问题的地方。踩过无数坑后,总结出这些原则:

变量命名规范

# Good: Prefixed with role name, clear meaning
nginx_worker_processes: 4
nginx_worker_connections: 4096
mysql_max_connections: 500
redis_maxmemory: 8gb

# Bad: Ambiguous, no prefix
workers: 4        # Which service?
max_conn: 500     # MySQL? Redis? Nginx?
memory: 8gb       # What memory?

变量文件组织

# inventory/production/group_vars/all.yml
# Global variables for all hosts
---
# Environment identifier
env: production
datacenter: dc1
region: cn-north-1

# Common system settings
timezone: Asia/Shanghai
locale: en_US.UTF-8

# Security settings
security_ssh_permit_root_login: false
security_ssh_password_authentication: false
security_fail2ban_enabled: true

# NTP configuration
ntp_enabled: true
ntp_servers:
  - ntp1.aliyun.com
  - ntp2.aliyun.com

# Package repository
package_repo_base_url: http://mirrors.aliyun.com

# Monitoring endpoints
prometheus_pushgateway: http://10.0.4.11:9091
# inventory/production/group_vars/webservers.yml
# Variables specific to web servers
---
# Nginx settings
nginx_user: www-data
nginx_worker_processes: auto
nginx_worker_connections: 4096
nginx_keepalive_timeout: 65
nginx_client_max_body_size: 100m
nginx_gzip_enabled: true
nginx_gzip_types:
  - text/plain
  - text/css
  - application/json
  - application/javascript
  - text/xml
  - application/xml

# SSL settings
nginx_ssl_protocols: TLSv1.2 TLSv1.3
nginx_ssl_ciphers: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256
nginx_ssl_session_timeout: 1d
nginx_ssl_session_cache: shared:SSL:50m

# Application settings
app_name: myapp
app_user: deploy
app_group: deploy
app_port: 8080
app_deploy_path: /opt/apps/{{ app_name }}
app_log_path: /var/log/{{ app_name }}
app_java_opts: "-Xms2g -Xmx4g -XX:+UseG1GC -XX:MaxGCPauseMillis=200"

敏感信息用 Vault 加密

# Create encrypted file
ansible-vault create inventory/production/group_vars/vault.yml

# Edit encrypted file
ansible-vault edit inventory/production/group_vars/vault.yml
# inventory/production/group_vars/vault.yml (encrypted content)
---
vault_mysql_root_password: "xK9<a href="javascript:;">#mP2</a>$vL5nQ8wR"
vault_mysql_repl_password: "aB3cD4eF5gH6iJ7k"
vault_redis_password: "rEdIs_PaSsWoRd_2024!"
vault_app_secret_key: "7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c"
vault_ssl_key_passphrase: "ssl_key_passphrase_here"
# Reference vault variables in group_vars/all.yml
mysql_root_password: "{{ vault_mysql_root_password }}"
mysql_repl_password: "{{ vault_mysql_repl_password }}"
redis_password: "{{ vault_redis_password }}"

2.2.2 Role 编写规范

Role 是 Ansible 代码复用的核心。一个好的 Role 应该是自包含的、可配置的、有文档的。

# roles/nginx/tasks/main.yml
---
- name: Include OS-specific variables
  ansible.builtin.include_vars: "{{ ansible_os_family | lower }}.yml"
  tags: [nginx, nginx:install]

- name: Install Nginx packages
  ansible.builtin.package:
    name: "{{ nginx_packages }}"
    state: present
  tags: [nginx, nginx:install]

- name: Create Nginx directories
  ansible.builtin.file:
    path: "{{ item }}"
    state: directory
    owner: root
    group: root
    mode: "0755"
  loop:
    - /etc/nginx/conf.d
    - /etc/nginx/ssl
    - /var/cache/nginx
    - /var/log/nginx
  tags: [nginx, nginx:config]

- name: Deploy Nginx main configuration
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
    owner: root
    group: root
    mode: "0644"
    validate: nginx -t -c %s
  notify: Reload nginx
  tags: [nginx, nginx:config]

- name: Deploy SSL certificates
  ansible.builtin.copy:
    content: "{{ item.content }}"
    dest: "{{ item.dest }}"
    owner: root
    group: root
    mode: "{{ item.mode }}"
  loop:
    - content: "{{ nginx_ssl_certificate }}"
      dest: /etc/nginx/ssl/server.crt
      mode: "0644"
    - content: "{{ nginx_ssl_certificate_key }}"
      dest: /etc/nginx/ssl/server.key
      mode: "0600"
  when: nginx_ssl_enabled | default(false)
  notify: Reload nginx
  no_log: true
  tags: [nginx, nginx:ssl]

- name: Deploy virtual host configurations
  ansible.builtin.template:
    src: vhost.conf.j2
    dest: "/etc/nginx/conf.d/{{ item.server_name }}.conf"
    owner: root
    group: root
    mode: "0644"
    validate: nginx -t -c /etc/nginx/nginx.conf
  loop: "{{ nginx_vhosts }}"
  notify: Reload nginx
  tags: [nginx, nginx:vhosts]

- name: Remove default virtual host
  ansible.builtin.file:
    path: "{{ item }}"
    state: absent
  loop:
    - /etc/nginx/sites-enabled/default
    - /etc/nginx/conf.d/default.conf
  notify: Reload nginx
  tags: [nginx, nginx:config]

- name: Ensure Nginx is started and enabled
  ansible.builtin.service:
    name: nginx
    state: started
    enabled: true
  tags: [nginx, nginx:service]
# roles/nginx/handlers/main.yml
---
- name: Reload nginx
  ansible.builtin.service:
    name: nginx
    state: reloaded
  listen: Reload nginx

- name: Restart nginx
  ansible.builtin.service:
    name: nginx
    state: restarted
  listen: Restart nginx
# roles/nginx/defaults/main.yml
---
# Package settings
nginx_packages:
  - nginx

# Worker settings
nginx_user: www-data
nginx_worker_processes: auto
nginx_worker_connections: 4096
nginx_multi_accept: true

# Performance settings
nginx_keepalive_timeout: 65
nginx_keepalive_requests: 1000
nginx_client_max_body_size: 64m
nginx_client_body_buffer_size: 128k
nginx_client_header_buffer_size: 1k
nginx_large_client_header_buffers: 4 16k

# Gzip settings
nginx_gzip_enabled: true
nginx_gzip_comp_level: 6
nginx_gzip_min_length: 1024
nginx_gzip_types:
  - text/plain
  - text/css
  - text/javascript
  - application/json
  - application/javascript
  - application/xml
  - application/xml+rss
  - image/svg+xml

# Logging settings
nginx_access_log: /var/log/nginx/access.log
nginx_error_log: /var/log/nginx/error.log
nginx_log_format: |
  '$remote_addr - $remote_user [$time_local] "$request" '
  '$status $body_bytes_sent "$http_referer" '
  '"$http_user_agent" "$http_x_forwarded_for" '
  'rt=$request_time uct="$upstream_connect_time" '
  'uht="$upstream_header_time" urt="$upstream_response_time"'

# SSL settings
nginx_ssl_enabled: false
nginx_ssl_protocols: TLSv1.2 TLSv1.3
nginx_ssl_ciphers: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384

# Virtual hosts
nginx_vhosts: []
{# roles/nginx/templates/nginx.conf.j2 #}
# Managed by Ansible - DO NOT EDIT MANUALLY
# Last modified: {{ ansible_date_time.iso8601 }}

user {{ nginx_user }};
worker_processes {{ nginx_worker_processes }};
pid /run/nginx.pid;
error_log {{ nginx_error_log }} warn;

events {
    worker_connections {{ nginx_worker_connections }};
    multi_accept {{ 'on' if nginx_multi_accept else 'off' }};
    use epoll;
}

http {
    include /etc/nginx/mime.types;
    default_type application/octet-stream;

    # Logging configuration
    log_format main {{ nginx_log_format }};
    access_log {{ nginx_access_log }} main;

    # Basic settings
    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout {{ nginx_keepalive_timeout }};
    keepalive_requests {{ nginx_keepalive_requests }};
    types_hash_max_size 2048;
    server_tokens off;

    # Client settings
    client_max_body_size {{ nginx_client_max_body_size }};
    client_body_buffer_size {{ nginx_client_body_buffer_size }};
    client_header_buffer_size {{ nginx_client_header_buffer_size }};
    large_client_header_buffers {{ nginx_large_client_header_buffers }};

{% if nginx_gzip_enabled %}
    # Gzip settings
    gzip on;
    gzip_vary on;
    gzip_proxied any;
    gzip_comp_level {{ nginx_gzip_comp_level }};
    gzip_min_length {{ nginx_gzip_min_length }};
    gzip_types {{ nginx_gzip_types | join(' ') }};
{% endif %}

{% if nginx_ssl_enabled %}
    # SSL settings
    ssl_protocols {{ nginx_ssl_protocols }};
    ssl_ciphers {{ nginx_ssl_ciphers }};
    ssl_prefer_server_ciphers on;
    ssl_session_cache shared:SSL:50m;
    ssl_session_timeout 1d;
    ssl_session_tickets off;
{% endif %}

    # Include virtual hosts
    include /etc/nginx/conf.d/*.conf;
}

2.2.3 Playbook 编写规范

Playbook 是将 Role 组合起来完成特定任务的地方。

# playbooks/site.yml
# Master playbook - configures entire infrastructure
---
- name: Apply common configuration to all hosts
  hosts: all
  become: true
  gather_facts: true
  any_errors_fatal: false

  pre_tasks:
    - name: Verify Ansible version
      ansible.builtin.assert:
        that:
          - ansible_version.full is version('2.14', '>=')
        fail_msg: "Ansible 2.14 or higher required"
        success_msg: "Ansible version check passed"
      run_once: true
      delegate_to: localhost
      tags: [always]

    - name: Display target information
      ansible.builtin.debug:
        msg: |
          Target: {{ inventory_hostname }}
          OS: {{ ansible_distribution }} {{ ansible_distribution_version }}
          IP: {{ ansible_default_ipv4.address | default('N/A') }}
      tags: [always]

  roles:
    - role: common
      tags: [common]

- name: Configure web servers
  hosts: webservers
  become: true
  serial: "30%"
  max_fail_percentage: 10

  roles:
    - role: nginx
      tags: [nginx]
    - role: app-deploy
      tags: [app]

- name: Configure database servers
  hosts: databases
  become: true
  serial: 1

  roles:
    - role: mysql
      tags: [mysql]
      when: "'mysql' in group_names or 'mysql_primary' in group_names or 'mysql_replica' in group_names"

- name: Configure Redis servers
  hosts: redis
  become: true

  roles:
    - role: redis
      tags: [redis]

- name: Configure monitoring
  hosts: monitoring
  become: true

  roles:
    - role: prometheus
      tags: [prometheus]
      when: "'prometheus' in inventory_hostname"
    - role: grafana
      tags: [grafana]
      when: "'grafana' in inventory_hostname"
# playbooks/deploy-app.yml
# Application deployment playbook with zero-downtime strategy
---
- name: Pre-deployment checks
  hosts: webservers
  become: true
  gather_facts: true
  any_errors_fatal: true

  vars:
    app_version: "{{ lookup('env', 'APP_VERSION') | default('latest', true) }}"
    deployment_id: "{{ lookup('pipe', 'date +%Y%m%d%H%M%S') }}"

  pre_tasks:
    - name: Validate deployment parameters
      ansible.builtin.assert:
        that:
          - app_version != ''
          - app_name is defined
          - app_deploy_path is defined
        fail_msg: "Missing required deployment parameters"
      tags: [always]

    - name: Check disk space
      ansible.builtin.assert:
        that:
          - item.size_available > 5368709120  # 5GB
        fail_msg: "Insufficient disk space on {{ item.mount }}"
      loop: "{{ ansible_mounts | selectattr('mount', 'in', ['/', '/opt']) | list }}"
      tags: [checks]

    - name: Verify application artifact exists
      ansible.builtin.uri:
        url: "{{ artifact_repo_url }}/{{ app_name }}/{{ app_version }}/{{ app_name }}-{{ app_version }}.tar.gz"
        method: HEAD
        status_code: 200
      delegate_to: localhost
      run_once: true
      tags: [checks]

  tasks:
    - name: Create deployment fact
      ansible.builtin.set_fact:
        current_deployment:
          id: "{{ deployment_id }}"
          version: "{{ app_version }}"
          timestamp: "{{ ansible_date_time.iso8601 }}"
          user: "{{ lookup('env', 'USER') }}"

- name: Deploy application with rolling update
  hosts: webservers
  become: true
  serial: "30%"
  max_fail_percentage: 10

  vars:
    app_version: "{{ lookup('env', 'APP_VERSION') | default('latest', true) }}"

  pre_tasks:
    - name: Remove server from load balancer
      ansible.builtin.uri:
        url: "http://{{ nginx_upstream_manager }}/api/upstream/{{ inventory_hostname }}/down"
        method: POST
        status_code: [200, 204]
      delegate_to: localhost
      when: nginx_upstream_manager is defined
      tags: [lb]

    - name: Wait for connections to drain
      ansible.builtin.wait_for:
        timeout: 30
      tags: [lb]

  roles:
    - role: app-deploy
      vars:
        app_artifact_url: "{{ artifact_repo_url }}/{{ app_name }}/{{ app_version }}/{{ app_name }}-{{ app_version }}.tar.gz"
      tags: [deploy]

  post_tasks:
    - name: Verify application health
      ansible.builtin.uri:
        url: "http://127.0.0.1:{{ app_port }}/health"
        method: GET
        status_code: 200
        return_content: true
      register: health_check
      retries: 10
      delay: 5
      until: health_check.status == 200
      tags: [verify]

    - name: Add server back to load balancer
      ansible.builtin.uri:
        url: "http://{{ nginx_upstream_manager }}/api/upstream/{{ inventory_hostname }}/up"
        method: POST
        status_code: [200, 204]
      delegate_to: localhost
      when: nginx_upstream_manager is defined
      tags: [lb]

- name: Post-deployment verification
  hosts: webservers
  become: true
  gather_facts: false
  any_errors_fatal: true

  tasks:
    - name: Run smoke tests
      ansible.builtin.uri:
        url: "http://127.0.0.1:{{ app_port }}{{ item.path }}"
        method: "{{ item.method | default('GET') }}"
        status_code: "{{ item.status | default(200) }}"
      loop:
        - path: /health
          status: 200
        - path: /api/v1/ping
          status: 200
        - path: /metrics
          status: 200
      tags: [smoke-test]

    - name: Notify deployment success
      ansible.builtin.uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "Deployment successful: {{ app_name }} {{ app_version }} deployed to {{ ansible_play_hosts | length }} hosts"
          channel: "#deployments"
      delegate_to: localhost
      run_once: true
      when: slack_webhook_url is defined
      tags: [notify]

2.3 启动和验证

2.3.1 执行前检查

在执行任何 Playbook 之前,有一套标准的检查流程:

#!/bin/bash
# scripts/run-playbook.sh
# Standard playbook execution wrapper

set -euo pipefail

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"

# Color codes
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'

# Default values
INVENTORY="production"
PLAYBOOK=""
TAGS=""
LIMIT=""
CHECK_MODE=""
DIFF_MODE=""
VERBOSE=""

usage() {
    echo "Usage: $0 -p PLAYBOOK [-i INVENTORY] [-t TAGS] [-l LIMIT] [-C] [-D] [-v]"
    echo ""
    echo "Options:"
    echo "  -p PLAYBOOK   Playbook to run (required)"
    echo "  -i INVENTORY  Inventory to use (default: production)"
    echo "  -t TAGS       Tags to run"
    echo "  -l LIMIT      Limit to specific hosts"
    echo "  -C            Check mode (dry run)"
    echo "  -D            Show diff"
    echo "  -v            Verbose output"
    exit 1
}

while getopts "p:i:t:l:CDvh" opt; do
    case $opt in
        p) PLAYBOOK="$OPTARG" ;;
        i) INVENTORY="$OPTARG" ;;
        t) TAGS="--tags $OPTARG" ;;
        l) LIMIT="--limit $OPTARG" ;;
        C) CHECK_MODE="--check" ;;
        D) DIFF_MODE="--diff" ;;
        v) VERBOSE="-vvv" ;;
        h) usage ;;
        *) usage ;;
    esac
done

if [[ -z "$PLAYBOOK" ]]; then
    echo -e "${RED}Error: Playbook is required${NC}"
    usage
fi

cd "$PROJECT_DIR"

echo -e "${GREEN}=== Pre-flight Checks ===${NC}"

# Check Ansible version
echo -n "Checking Ansible version... "
ANSIBLE_VERSION=$(ansible --version | head -1 | awk '{print $3}' | tr -d '[]')
if [[ "$(printf '%s\n' "2.14" "$ANSIBLE_VERSION" | sort -V | head -n1)" == "2.14" ]]; then
    echo -e "${GREEN}OK ($ANSIBLE_VERSION)${NC}"
else
    echo -e "${RED}FAILED (requires >= 2.14, found $ANSIBLE_VERSION)${NC}"
    exit 1
fi

# Check inventory exists
echo -n "Checking inventory... "
if [[ -d "inventory/$INVENTORY" ]]; then
    echo -e "${GREEN}OK${NC}"
else
    echo -e "${RED}FAILED (inventory/$INVENTORY not found)${NC}"
    exit 1
fi

# Check playbook exists
echo -n "Checking playbook... "
if [[ -f "playbooks/$PLAYBOOK" ]]; then
    echo -e "${GREEN}OK${NC}"
else
    echo -e "${RED}FAILED (playbooks/$PLAYBOOK not found)${NC}"
    exit 1
fi

# Syntax check
echo -n "Running syntax check... "
if ansible-playbook "playbooks/$PLAYBOOK" \
    -i "inventory/$INVENTORY/hosts.yml" \
    --syntax-check >/dev/null; then
    echo -e "${GREEN}OK${NC}"
else
    echo -e "${RED}FAILED${NC}"
    ansible-playbook "playbooks/$PLAYBOOK" \
        -i "inventory/$INVENTORY/hosts.yml" \
        --syntax-check
    exit 1
fi

# Lint check
echo -n "Running lint check... "
if command -v ansible-lint >/dev/null; then
    if ansible-lint "playbooks/$PLAYBOOK" >/dev/null; then
        echo -e "${GREEN}OK${NC}"
    else
        echo -e "${YELLOW}WARNINGS (check output below)${NC}"
        ansible-lint "playbooks/$PLAYBOOK"
    fi
else
    echo -e "${YELLOW}SKIPPED (ansible-lint not installed)${NC}"
fi

echo ""
echo -e "${GREEN}=== Execution ===${NC}"
echo "Inventory: $INVENTORY"
echo "Playbook: $PLAYBOOK"
[[ -n "$TAGS" ]] && echo "Tags: $TAGS"
[[ -n "$LIMIT" ]] && echo "Limit: $LIMIT"
[[ -n "$CHECK_MODE" ]] && echo "Mode: DRY RUN"

echo ""
read -p "Continue? [y/N] " -n 1 -r
echo ""

if [[ ! $REPLY =~ ^[Yy]$ ]]; then
    echo "Aborted."
    exit 0
fi

# Build command
CMD="ansible-playbook playbooks/$PLAYBOOK"
CMD="$CMD -i inventory/$INVENTORY/hosts.yml"
[[ -n "$TAGS" ]] && CMD="$CMD $TAGS"
[[ -n "$LIMIT" ]] && CMD="$CMD $LIMIT"
[[ -n "$CHECK_MODE" ]] && CMD="$CMD $CHECK_MODE"
[[ -n "$DIFF_MODE" ]] && CMD="$CMD $DIFF_MODE"
[[ -n "$VERBOSE" ]] && CMD="$CMD $VERBOSE"

echo "Executing: $CMD"
echo ""

# Execute
eval "$CMD"

echo ""
echo -e "${GREEN}=== Execution Complete ===${NC}"

2.3.2 执行示例

# Dry run with diff - see what would change without making changes
./scripts/run-playbook.sh -p site.yml -C -D

# Run specific tags on specific hosts
./scripts/run-playbook.sh -p site.yml -t nginx -l "web01.prod.example.com"

# Full deployment
./scripts/run-playbook.sh -p deploy-app.yml -i production

# Verbose output for debugging
./scripts/run-playbook.sh -p site.yml -l webservers -v

2.3.3 验证执行结果

# playbooks/verify-deployment.yml
# Post-deployment verification playbook
---
- name: Verify deployment across all hosts
  hosts: all
  become: true
  gather_facts: true

  tasks:
    - name: Check system uptime
      ansible.builtin.command: uptime
      register: uptime_result
      changed_when: false

    - name: Display uptime
      ansible.builtin.debug:
        var: uptime_result.stdout

    - name: Check listening ports
      ansible.builtin.shell: ss -tlnp | grep -E ':(80|443|8080|3306|6379)\s'
      register: ports_result
      changed_when: false
      failed_when: false

    - name: Display listening ports
      ansible.builtin.debug:
        var: ports_result.stdout_lines

    - name: Check service status
      ansible.builtin.systemd:
        name: "{{ item }}"
      register: service_status
      loop: "{{ services_to_check | default([]) }}"
      when: services_to_check is defined

    - name: Generate verification report
      ansible.builtin.template:
        src: verification-report.j2
        dest: "/tmp/verification-{{ inventory_hostname }}-{{ ansible_date_time.epoch }}.txt"
      delegate_to: localhost

三、示例代码和配置

3.1 完整配置示例

这是一个实际使用的完整 Role 示例——通用服务器初始化配置:

# roles/common/tasks/main.yml
---
- name: Import OS-specific variables
  ansible.builtin.include_vars: "{{ item }}"
  with_first_found:
    - "{{ ansible_distribution | lower }}-{{ ansible_distribution_major_version }}.yml"
    - "{{ ansible_distribution | lower }}.yml"
    - "{{ ansible_os_family | lower }}.yml"
    - "default.yml"
  tags: [common]

- name: Set hostname
  ansible.builtin.hostname:
    name: "{{ inventory_hostname_short }}"
  when: common_set_hostname | default(true)
  tags: [common, hostname]

- name: Configure /etc/hosts
  ansible.builtin.template:
    src: hosts.j2
    dest: /etc/hosts
    owner: root
    group: root
    mode: "0644"
    backup: true
  tags: [common, hosts]

- name: Set timezone
  community.general.timezone:
    name: "{{ timezone }}"
  tags: [common, timezone]

- name: Configure NTP
  ansible.builtin.include_tasks: ntp.yml
  when: ntp_enabled | default(true)
  tags: [common, ntp]

- name: Configure package repositories
  ansible.builtin.include_tasks: "repos-{{ ansible_os_family | lower }}.yml"
  tags: [common, repos]

- name: Update package cache
  ansible.builtin.package:
    update_cache: true
    cache_valid_time: 3600
  when: ansible_os_family == 'Debian'
  tags: [common, packages]

- name: Install essential packages
  ansible.builtin.package:
    name: "{{ common_packages }}"
    state: present
  tags: [common, packages]

- name: Configure sysctl parameters
  ansible.posix.sysctl:
    name: "{{ item.key }}"
    value: "{{ item.value }}"
    sysctl_file: /etc/sysctl.d/99-ansible.conf
    reload: true
  loop: "{{ common_sysctl_params | dict2items }}"
  tags: [common, sysctl]

- name: Configure system limits
  community.general.pam_limits:
    domain: "{{ item.domain }}"
    limit_type: "{{ item.type }}"
    limit_item: "{{ item.item }}"
    value: "{{ item.value }}"
  loop: "{{ common_limits }}"
  tags: [common, limits]

- name: Create standard directories
  ansible.builtin.file:
    path: "{{ item.path }}"
    state: directory
    owner: "{{ item.owner | default('root') }}"
    group: "{{ item.group | default('root') }}"
    mode: "{{ item.mode | default('0755') }}"
  loop: "{{ common_directories }}"
  tags: [common, directories]

- name: Configure SSH server
  ansible.builtin.include_tasks: sshd.yml
  tags: [common, sshd]

- name: Configure firewall
  ansible.builtin.include_tasks: firewall.yml
  when: common_firewall_enabled | default(true)
  tags: [common, firewall]

- name: Configure fail2ban
  ansible.builtin.include_tasks: fail2ban.yml
  when: security_fail2ban_enabled | default(true)
  tags: [common, security, fail2ban]

- name: Setup monitoring agent
  ansible.builtin.include_tasks: monitoring.yml
  when: common_monitoring_enabled | default(true)
  tags: [common, monitoring]

- name: Configure log rotation
  ansible.builtin.template:
    src: logrotate-ansible.j2
    dest: /etc/logrotate.d/ansible-managed
    owner: root
    group: root
    mode: "0644"
  tags: [common, logrotate]
# roles/common/defaults/main.yml
---
# Hostname settings
common_set_hostname: true

# Package settings
common_packages:
  - vim
  - curl
  - wget
  - git
  - htop
  - iotop
  - sysstat
  - net-tools
  - bind-utils
  - lsof
  - tcpdump
  - strace
  - tree
  - jq
  - unzip
  - tar
  - rsync
  - tmux

# Sysctl parameters for production servers
common_sysctl_params:
  # Network performance
  net.core.somaxconn: 65535
  net.core.netdev_max_backlog: 65535
  net.ipv4.tcp_max_syn_backlog: 65535
  net.ipv4.tcp_fin_timeout: 15
  net.ipv4.tcp_keepalive_time: 300
  net.ipv4.tcp_keepalive_probes: 5
  net.ipv4.tcp_keepalive_intvl: 15
  net.ipv4.tcp_tw_reuse: 1
  net.ipv4.ip_local_port_range: 1024 65535

  # Memory management
  vm.swappiness: 10
  vm.dirty_ratio: 60
  vm.dirty_background_ratio: 5
  vm.overcommit_memory: 1

  # File system
  fs.file-max: 2097152
  fs.inotify.max_user_watches: 524288

  # Security
  net.ipv4.conf.all.rp_filter: 1
  net.ipv4.conf.default.rp_filter: 1
  net.ipv4.icmp_echo_ignore_broadcasts: 1
  net.ipv4.conf.all.accept_redirects: 0
  net.ipv4.conf.default.accept_redirects: 0

# System limits
common_limits:
  - domain: "*"
    type: soft
    item: nofile
    value: 1048576
  - domain: "*"
    type: hard
    item: nofile
    value: 1048576
  - domain: "*"
    type: soft
    item: nproc
    value: 65535
  - domain: "*"
    type: hard
    item: nproc
    value: 65535
  - domain: root
    type: soft
    item: nofile
    value: 1048576
  - domain: root
    type: hard
    item: nofile
    value: 1048576

# Standard directories
common_directories:
  - path: /opt/apps
    owner: root
    group: root
    mode: "0755"
  - path: /opt/scripts
    owner: root
    group: root
    mode: "0755"
  - path: /var/log/apps
    owner: root
    group: root
    mode: "0755"
  - path: /data
    owner: root
    group: root
    mode: "0755"

# Firewall settings
common_firewall_enabled: true
common_firewall_allowed_ports:
  - port: 22
    proto: tcp
  - port: 80
    proto: tcp
  - port: 443
    proto: tcp

# Monitoring settings
common_monitoring_enabled: true
node_exporter_version: "1.7.0"
node_exporter_port: 9100
# roles/common/tasks/sshd.yml
---
- name: Backup original sshd_config
  ansible.builtin.copy:
    src: /etc/ssh/sshd_config
    dest: /etc/ssh/sshd_config.orig
    remote_src: true
    force: false
    owner: root
    group: root
    mode: "0600"

- name: Configure SSH server
  ansible.builtin.template:
    src: sshd_config.j2
    dest: /etc/ssh/sshd_config
    owner: root
    group: root
    mode: "0600"
    validate: /usr/sbin/sshd -t -f %s
    backup: true
  notify: Restart sshd

- name: Ensure SSH service is running
  ansible.builtin.service:
    name: sshd
    state: started
    enabled: true
{# roles/common/templates/sshd_config.j2 #}
# Managed by Ansible - DO NOT EDIT MANUALLY
# Last modified: {{ ansible_date_time.iso8601 }}

# Basic settings
Port {{ ssh_port | default(22) }}
AddressFamily any
ListenAddress 0.0.0.0
Protocol 2

# Host keys
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key

# Security settings
PermitRootLogin {{ 'yes' if security_ssh_permit_root_login | default(false) else 'no' }}
PasswordAuthentication {{ 'yes' if security_ssh_password_authentication | default(false) else 'no' }}
PubkeyAuthentication yes
PermitEmptyPasswords no
ChallengeResponseAuthentication no

# Authentication
AuthorizedKeysFile .ssh/authorized_keys
MaxAuthTries 3
MaxSessions 10
LoginGraceTime 60

# Session settings
ClientAliveInterval 300
ClientAliveCountMax 3
TCPKeepAlive yes

# Logging
SyslogFacility AUTH
LogLevel INFO

# Environment
AcceptEnv LANG LC_*
X11Forwarding no
PrintMotd no
PrintLastLog yes

# Subsystems
Subsystem sftp /usr/lib/openssh/sftp-server

# Allow specific users/groups
{% if ssh_allow_users is defined and ssh_allow_users | length > 0 %}
AllowUsers {{ ssh_allow_users | join(' ') }}
{% endif %}
{% if ssh_allow_groups is defined and ssh_allow_groups | length > 0 %}
AllowGroups {{ ssh_allow_groups | join(' ') }}
{% endif %}

# Deny specific users/groups
{% if ssh_deny_users is defined and ssh_deny_users | length > 0 %}
DenyUsers {{ ssh_deny_users | join(' ') }}
{% endif %}

3.2 实际应用案例

案例一:批量修复 Log4j 漏洞

2021年12月,Log4j 漏洞(CVE-2021-44228)爆发时,需要在2小时内完成所有服务器的检测和修复。这是当时用的紧急 Playbook:

# playbooks/emergency-log4j-fix.yml
# Emergency playbook for CVE-2021-44228 (Log4Shell)
---
- name: Log4j vulnerability detection and remediation
  hosts: all
  become: true
  gather_facts: true
  serial: 50

  vars:
    log4j_scan_paths:
      - /opt/apps
      - /usr/local
      - /var/lib
    log4j_vulnerable_versions:
      - "2.0"
      - "2.1"
      - "2.2"
      - "2.3"
      - "2.4"
      - "2.5"
      - "2.6"
      - "2.7"
      - "2.8"
      - "2.9"
      - "2.10"
      - "2.11"
      - "2.12.0"
      - "2.12.1"
      - "2.13"
      - "2.14"
      - "2.14.0"
      - "2.14.1"

  tasks:
    - name: Search for Log4j JAR files
      ansible.builtin.find:
        paths: "{{ log4j_scan_paths }}"
        patterns:
          - "log4j-core-*.jar"
          - "log4j-api-*.jar"
        recurse: true
        file_type: file
      register: log4j_files

    - name: Analyze found Log4j files
      ansible.builtin.set_fact:
        vulnerable_jars: "{{ log4j_files.files | selectattr('path', 'search', 'log4j-core-2\\.(0|1[0-4]|[0-9])[\\.\\-]') | list }}"

    - name: Report vulnerable files
      ansible.builtin.debug:
        msg: |
          Host: {{ inventory_hostname }}
          Vulnerable JARs found: {{ vulnerable_jars | length }}
          Files: {{ vulnerable_jars | map(attribute='path') | list }}
      when: vulnerable_jars | length > 0

    - name: Create backup directory
      ansible.builtin.file:
        path: /opt/backup/log4j-{{ ansible_date_time.date }}
        state: directory
        mode: "0755"
      when: vulnerable_jars | length > 0

    - name: Backup vulnerable JARs
      ansible.builtin.copy:
        src: "{{ item.path }}"
        dest: "/opt/backup/log4j-{{ ansible_date_time.date }}/{{ item.path | basename }}.{{ ansible_date_time.epoch }}"
        remote_src: true
      loop: "{{ vulnerable_jars }}"
      when: vulnerable_jars | length > 0

    - name: Apply JndiLookup class removal mitigation
      ansible.builtin.shell: |
        zip -q -d "{{ item.path }}" org/apache/logging/log4j/core/lookup/JndiLookup.class 2>/dev/null || true
      loop: "{{ vulnerable_jars }}"
      when: vulnerable_jars | length > 0
      register: mitigation_result

    - name: Generate vulnerability report
      ansible.builtin.template:
        src: log4j-report.j2
        dest: "/tmp/log4j-report-{{ inventory_hostname }}.txt"
      delegate_to: localhost

    - name: Aggregate reports
      ansible.builtin.fetch:
        src: "/tmp/log4j-report-{{ inventory_hostname }}.txt"
        dest: "reports/log4j/{{ inventory_hostname }}.txt"
        flat: true
      delegate_to: localhost

  post_tasks:
    - name: Send alert if vulnerabilities found
      ansible.builtin.uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "Log4j scan complete on {{ inventory_hostname }}: {{ vulnerable_jars | length }} vulnerable JARs found and mitigated"
      delegate_to: localhost
      when:
        - vulnerable_jars | length > 0
        - slack_webhook_url is defined

案例二:数据库主从切换

这是用于 MySQL 主从切换的 Playbook。某次凌晨3点主库硬盘故障,就是靠这个 Playbook 在5分钟内完成了切换:

# playbooks/mysql-failover.yml
# MySQL master-slave failover playbook
---
- name: Pre-failover checks
  hosts: mysql_replica
  become: true
  gather_facts: true
  any_errors_fatal: true

  vars_prompt:
    - name: confirm_failover
      prompt: "Are you sure you want to perform MySQL failover? (type 'yes' to confirm)"
      private: false

    - name: new_master
      prompt: "Enter the hostname of the new master"
      private: false

  tasks:
    - name: Validate confirmation
      ansible.builtin.assert:
        that:
          - confirm_failover == 'yes'
        fail_msg: "Failover not confirmed. Aborting."

    - name: Validate new master is in replica group
      ansible.builtin.assert:
        that:
          - new_master in groups['mysql_replica']
        fail_msg: "{{ new_master }} is not a valid replica host"

    - name: Check replication status on all replicas
      community.mysql.mysql_replication:
        mode: getreplica
        login_user: root
        login_password: "{{ mysql_root_password }}"
      register: repl_status

    - name: Display replication lag
      ansible.builtin.debug:
        msg: "{{ inventory_hostname }}: Seconds_Behind_Master={{ repl_status.Seconds_Behind_Master | default('N/A') }}"

    - name: Ensure replication is caught up
      ansible.builtin.assert:
        that:
          - repl_status.Seconds_Behind_Master is defined
          - repl_status.Seconds_Behind_Master | int <= 10
        fail_msg: "Replication lag too high on {{ inventory_hostname }}"
      when: inventory_hostname == new_master

- name: Stop writes on old master
  hosts: mysql_primary
  become: true

  tasks:
    - name: Set read_only on old master
      community.mysql.mysql_variables:
        variable: read_only
        value: "ON"
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Kill long running queries
      community.mysql.mysql_query:
        login_user: root
        login_password: "{{ mysql_root_password }}"
        query: |
          SELECT CONCAT('KILL ', id, ';')
          FROM information_schema.processlist
          WHERE command != 'Sleep'
          AND time > 5
          AND user != 'system user'
      register: kill_queries

    - name: Wait for all transactions to complete
      ansible.builtin.wait_for:
        timeout: 30

- name: Promote new master
  hosts: "{{ new_master }}"
  become: true

  tasks:
    - name: Stop replication on new master
      community.mysql.mysql_replication:
        mode: stopreplica
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Reset replica configuration
      community.mysql.mysql_replication:
        mode: resetreplica
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Disable read_only on new master
      community.mysql.mysql_variables:
        variable: read_only
        value: "OFF"
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Get master status
      community.mysql.mysql_replication:
        mode: getprimary
        login_user: root
        login_password: "{{ mysql_root_password }}"
      register: new_master_status

    - name: Display new master status
      ansible.builtin.debug:
        msg: |
          New Master: {{ inventory_hostname }}
          Binlog File: {{ new_master_status.File }}
          Binlog Position: {{ new_master_status.Position }}

- name: Reconfigure other replicas
  hosts: mysql_replica:!{{ new_master }}
  become: true
  serial: 1

  tasks:
    - name: Stop replication
      community.mysql.mysql_replication:
        mode: stopreplica
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Point to new master
      community.mysql.mysql_replication:
        mode: changeprimary
        primary_host: "{{ new_master }}"
        primary_user: repl_user
        primary_password: "{{ mysql_repl_password }}"
        primary_log_file: "{{ hostvars[new_master]['new_master_status']['File'] }}"
        primary_log_pos: "{{ hostvars[new_master]['new_master_status']['Position'] }}"
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Start replication
      community.mysql.mysql_replication:
        mode: startreplica
        login_user: root
        login_password: "{{ mysql_root_password }}"

    - name: Verify replication is running
      community.mysql.mysql_replication:
        mode: getreplica
        login_user: root
        login_password: "{{ mysql_root_password }}"
      register: new_repl_status
      retries: 5
      delay: 2
      until:
        - new_repl_status.Slave_IO_Running == 'Yes'
        - new_repl_status.Slave_SQL_Running == 'Yes'

- name: Update DNS and notify
  hosts: localhost
  gather_facts: false

  tasks:
    - name: Update DNS record for mysql-master
      community.general.nsupdate:
        key_name: "ansible-key"
        key_secret: "{{ dns_update_key }}"
        server: "{{ dns_server }}"
        zone: "prod.example.com"
        record: "mysql-master"
        type: "A"
        value: "{{ hostvars[new_master]['ansible_default_ipv4']['address'] }}"
      when: dns_update_enabled | default(false)

    - name: Send notification
      ansible.builtin.uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: |
            :rotating_light: MySQL Failover Complete :rotating_light:
            New Master: {{ new_master }}
            Time: {{ ansible_date_time.iso8601 }}
            Performed by: {{ lookup('env', 'USER') }}
      when: slack_webhook_url is defined

四、最佳实践和注意事项

4.1 最佳实践

4.1.1 代码组织

使用完全限定集合名(FQCN)

从 Ansible 2.10 开始,官方推荐使用 FQCN。这能避免模块名冲突,也让代码更清晰:

# Good: FQCN makes it clear which module is being used
- name: Install packages
  ansible.builtin.package:
    name: nginx
    state: present

# Bad: Short name could be ambiguous
- name: Install packages
  package:
    name: nginx
    state: present

所有任务都要有 name

name 不只是注释,它还会出现在执行日志里。好的 name 能让你在半夜看日志时快速定位问题:

# Good: Clear, descriptive names
- name: Install Nginx web server
  ansible.builtin.package:
    name: nginx
    state: present

- name: Deploy Nginx configuration for api.example.com
  ansible.builtin.template:
    src: nginx-api.conf.j2
    dest: /etc/nginx/conf.d/api.conf

# Bad: Vague or missing names
- package:
    name: nginx
    state: present

- name: Deploy config  # Too vague
  template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf

合理使用 Tags

Tags 让你能选择性执行 Playbook 的一部分。规范是:每个 Role 都要有一个顶级 Tag,每个 task 文件有一个次级 Tag:

# Role-level tag in playbook
- role: nginx
  tags: [nginx]

# Task-level tags in role
- name: Install Nginx
  ansible.builtin.package:
    name: nginx
  tags: [nginx, nginx:install]

- name: Configure Nginx
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  tags: [nginx, nginx:config]

# Usage examples
# ansible-playbook site.yml --tags nginx           # All nginx tasks
# ansible-playbook site.yml --tags nginx:config    # Only config tasks
# ansible-playbook site.yml --skip-tags nginx:install  # Skip installation

4.1.2 安全实践

敏感信息加密

所有密码、密钥、证书都必须用 Ansible Vault 加密:

# Create vault password script (more secure than plain file)
cat > scripts/vault-password.sh << 'EOF'
#!/bin/bash
# Fetch vault password from HashiCorp Vault or AWS Secrets Manager
# For demo, using environment variable
echo "${ANSIBLE_VAULT_PASSWORD}"
EOF
chmod 700 scripts/vault-password.sh

# Encrypt sensitive files
ansible-vault encrypt inventory/production/group_vars/vault.yml

# Encrypt specific string
ansible-vault encrypt_string 'super_secret_password' --name 'db_password'

使用 no_log 保护敏感输出

- name: Configure database password
  ansible.builtin.mysql_user:
    name: app_user
    password: "{{ db_password }}"
    priv: "app_db.*:ALL"
    login_user: root
    login_password: "{{ mysql_root_password }}"
  no_log: true  # Prevents password from appearing in logs

- name: Deploy application config with secrets
  ansible.builtin.template:
    src: app-config.yml.j2
    dest: /opt/app/config.yml
    mode: "0600"
  no_log: true

最小权限原则

# Create dedicated ansible user with limited sudo
- name: Create ansible automation user
  ansible.builtin.user:
    name: ansible
    shell: /bin/bash
    create_home: true
    groups: []  # No extra groups by default

- name: Configure sudo for specific commands only
  ansible.builtin.copy:
    content: |
      # Ansible automation user sudo rules
      ansible ALL=(root) NOPASSWD: /usr/bin/systemctl restart nginx
      ansible ALL=(root) NOPASSWD: /usr/bin/systemctl reload nginx
      ansible ALL=(root) NOPASSWD: /usr/bin/apt-get update
      ansible ALL=(root) NOPASSWD: /usr/bin/apt-get install *
    dest: /etc/sudoers.d/ansible
    mode: "0440"
    validate: visudo -cf %s

4.1.3 性能优化

启用 Pipelining

Pipelining 减少 SSH 连接次数,能显著提升性能:

# ansible.cfg
[ssh_connection]
pipelining = True

但要注意,目标服务器的 sudoers 里不能有 requiretty

使用 Fact Caching

对于不经常变化的服务器信息,缓存 facts 能大幅减少执行时间:

# ansible.cfg
[defaults]
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts_cache
fact_caching_timeout = 86400  # 24 hours

合理设置并发数

# ansible.cfg
[defaults]
forks = 50  # Adjust based on control node capacity
# For sensitive operations, use serial to limit parallelism
- hosts: databases
  serial: 1  # One at a time for database operations

- hosts: webservers
  serial: "30%"  # 30% of hosts at a time for rolling updates

使用异步任务

对于耗时操作,可以用异步执行:

- name: Run long-running backup task
  ansible.builtin.shell: /opt/scripts/full-backup.sh
  async: 3600  # Allow up to 1 hour
  poll: 0  # Don't wait, fire and forget
  register: backup_job

- name: Wait for backup to complete
  ansible.builtin.async_status:
    jid: "{{ backup_job.ansible_job_id }}"
  register: backup_result
  until: backup_result.finished
  retries: 60
  delay: 60

4.1.4 错误处理

使用 block/rescue/always

- name: Deploy application with rollback capability
  block:
    - name: Backup current version
      ansible.builtin.copy:
        src: /opt/app/current/
        dest: /opt/app/backup-{{ ansible_date_time.epoch }}/
        remote_src: true

    - name: Deploy new version
      ansible.builtin.unarchive:
        src: "{{ app_artifact_url }}"
        dest: /opt/app/current/
        remote_src: true

    - name: Restart application
      ansible.builtin.systemd:
        name: myapp
        state: restarted

    - name: Verify application health
      ansible.builtin.uri:
        url: http://127.0.0.1:8080/health
        status_code: 200
        retries: 5
        delay: 10

  rescue:
    - name: Restore from backup
      ansible.builtin.copy:
        src: /opt/app/backup-{{ ansible_date_time.epoch }}/
        dest: /opt/app/current/
        remote_src: true

    - name: Restart application with old version
      ansible.builtin.systemd:
        name: myapp
        state: restarted

    - name: Send failure notification
      ansible.builtin.uri:
        url: "{{ slack_webhook_url }}"
        method: POST
        body_format: json
        body:
          text: "Deployment failed on {{ inventory_hostname }}, rolled back to previous version"

  always:
    - name: Clean up old backups
      ansible.builtin.find:
        paths: /opt/app/
        patterns: "backup-*"
        age: 7d
      register: old_backups

    - name: Remove old backups
      ansible.builtin.file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_backups.files }}"

4.2 注意事项

4.2.1 幂等性陷阱

最常见的问题是 shell 和 command 模块不是幂等的:

# Bad: Not idempotent, runs every time
- name: Add line to file
  ansible.builtin.shell: echo "export PATH=/opt/bin:$PATH" >> /etc/profile

# Good: Idempotent, only adds if not present
- name: Add PATH to profile
  ansible.builtin.lineinfile:
    path: /etc/profile
    line: 'export PATH=/opt/bin:$PATH'
    state: present

# If you must use shell, add creates/removes conditions
- name: Initialize database
  ansible.builtin.shell: /opt/db/init.sh
  args:
    creates: /opt/db/.initialized  # Only runs if this file doesn't exist

4.2.2 变量优先级混乱

Ansible 有22级变量优先级,很容易搞混。原则是:

# Priority (simplified, high to low):
# 1. Extra vars (-e)           - Use for one-time overrides
# 2. Task vars                 - Avoid, hard to track
# 3. Block vars                - Avoid, hard to track
# 4. Role vars (vars/main.yml) - Use for internal role variables
# 5. Host vars                 - Use for host-specific settings
# 6. Group vars                - Use for group-specific settings
# 7. Role defaults             - Use for user-overridable defaults

# Our convention:
# - defaults/main.yml: All variables that users might want to override
# - vars/main.yml: Internal variables that shouldn't be changed
# - group_vars/: Environment and group-specific values
# - host_vars/: Host-specific values only

4.2.3 Handler 执行顺序

Handler 在所有 tasks 执行完后才运行,而且只运行一次:

# Problem: Handler runs at the end, not immediately
- name: Deploy config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  notify: Reload nginx

- name: Deploy SSL cert  # This runs before nginx reload!
  ansible.builtin.copy:
    src: ssl.crt
    dest: /etc/nginx/ssl/

# Solution: Use meta flush_handlers
- name: Deploy config
  ansible.builtin.template:
    src: nginx.conf.j2
    dest: /etc/nginx/nginx.conf
  notify: Reload nginx

- name: Flush handlers
  ansible.builtin.meta: flush_handlers  # Force handler to run now

- name: Deploy SSL cert
  ansible.builtin.copy:
    src: ssl.crt
    dest: /etc/nginx/ssl/

4.2.4 Template 中的 YAML 缩进

Jinja2 模板生成 YAML 时特别容易出问题:

{# Bad: Indentation issues #}
servers:
{% for server in backend_servers %}
  - {{ server }}
{% endfor %}

{# Good: Use indentation filters #}
servers:
{{ backend_servers | to_nice_yaml | indent(2) }}

{# Or control whitespace explicitly #}
servers:
{%- for server in backend_servers %}
  - {{ server }}
{%- endfor %}

4.2.5 Check Mode 兼容性

有些操作在 check mode 下会失败,要特殊处理:

- name: Get current user
  ansible.builtin.command: whoami
  register: current_user
  changed_when: false
  check_mode: false  # Always run, even in check mode

- name: Task that depends on registered variable
  ansible.builtin.debug:
    msg: "Running as {{ current_user.stdout }}"

五、故障排查和监控

5.1 故障排查

5.1.1 调试技巧

启用详细输出

# Verbosity levels
ansible-playbook site.yml -v      # Show task results
ansible-playbook site.yml -vv     # Show task input parameters
ansible-playbook site.yml -vvv    # Show connection debugging
ansible-playbook site.yml -vvvv   # Show connection plugin debugging (very verbose)

使用 debug 模块

- name: Debug variable content
  ansible.builtin.debug:
    var: some_variable

- name: Debug with message
  ansible.builtin.debug:
    msg: "Value is {{ some_variable }} and type is {{ some_variable | type_debug }}"

- name: Debug all variables for a host
  ansible.builtin.debug:
    var: hostvars[inventory_hostname]

使用 assert 进行验证

- name: Validate prerequisites
  ansible.builtin.assert:
    that:
      - ansible_distribution == "Ubuntu"
      - ansible_distribution_major_version | int >= 20
      - ansible_memtotal_mb >= 4096
    fail_msg: "Host {{ inventory_hostname }} does not meet requirements"
    success_msg: "All prerequisites met"

5.1.2 常见问题

SSH 连接问题

# Test SSH connection manually
ssh -vvv ansible@target-host

# Common fixes in ansible.cfg
[ssh_connection]
ssh_args = -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null

权限问题

# Check if become is working
- name: Test privilege escalation
  ansible.builtin.command: whoami
  become: true
  register: whoami_result

- name: Show result
  ansible.builtin.debug:
    msg: "Running as {{ whoami_result.stdout }}"

模块不存在

# Check if collection is installed
ansible-galaxy collection list

# Install missing collection
ansible-galaxy collection install community.general

5.1.3 Playbook 调试流程

# Step 1: Syntax check
ansible-playbook site.yml --syntax-check

# Step 2: List tasks without running
ansible-playbook site.yml --list-tasks

# Step 3: List hosts that would be affected
ansible-playbook site.yml --list-hosts

# Step 4: Dry run with diff
ansible-playbook site.yml --check --diff

# Step 5: Run on single host first
ansible-playbook site.yml --limit web01.example.com

# Step 6: Step through interactively
ansible-playbook site.yml --step

# Step 7: Start at specific task
ansible-playbook site.yml --start-at-task="Deploy Nginx configuration"

5.2 性能监控

5.2.1 Callback 插件

使用 callback 插件来监控 Playbook 执行:

# ansible.cfg
[defaults]
callback_whitelist = timer, profile_tasks, profile_roles
stdout_callback = yaml

profile_tasks 输出示例

PLAY RECAP *********************************************************************
web01.example.com          : ok=25   changed=3    unreachable=0    failed=0    skipped=5    rescued=0    ignored=0

Thursday 19 December 2024  15:30:45 +0800 (0:00:01.234)       0:02:15.678 ******
===============================================================================
nginx : Deploy Nginx configuration ------------------------------------ 45.23s
common : Install essential packages ----------------------------------- 32.15s
nginx : Install Nginx packages ---------------------------------------- 28.67s
common : Configure sysctl parameters ---------------------------------- 12.34s

5.2.2 自定义执行报告

# playbooks/generate-report.yml
---
- name: Generate execution report
  hosts: localhost
  gather_facts: false

  tasks:
    - name: Create report directory
      ansible.builtin.file:
        path: "{{ playbook_dir }}/../reports/{{ ansible_date_time.date }}"
        state: directory

    - name: Generate HTML report
      ansible.builtin.template:
        src: execution-report.html.j2
        dest: "{{ playbook_dir }}/../reports/{{ ansible_date_time.date }}/report-{{ ansible_date_time.epoch }}.html"

5.2.3 Prometheus 集成

使用 Pushgateway 记录 Ansible 执行指标:

# callback_plugins/prometheus_metrics.py
from ansible.plugins.callback import CallbackBase
import requests
import time

class CallbackModule(CallbackBase):
    CALLBACK_VERSION = 2.0
    CALLBACK_TYPE = 'aggregate'
    CALLBACK_NAME = 'prometheus_metrics'

    def __init__(self):
        super(CallbackModule, self).__init__()
        self.start_time = None
        self.task_times = {}
        self.host_results = {}

    def v2_playbook_on_start(self, playbook):
        self.start_time = time.time()
        self.playbook_name = playbook._file_name

    def v2_playbook_on_stats(self, stats):
        duration = time.time() - self.start_time

        metrics = []
        metrics.append(f'ansible_playbook_duration_seconds{{playbook="{self.playbook_name}"}} {duration}')

        for host in stats.processed:
            summary = stats.summarize(host)
            for status in ['ok', 'changed', 'failures', 'skipped']:
                metrics.append(f'ansible_host_{status}{{host="{host}",playbook="{self.playbook_name}"}} {summary[status]}')

        # Push to Prometheus Pushgateway
        try:
            requests.post(
                'http://pushgateway:9091/metrics/job/ansible',
                data='\n'.join(metrics)
            )
        except Exception as e:
            self._display.warning(f"Failed to push metrics: {e}")

5.3 备份与恢复

5.3.1 配置备份

# roles/common/tasks/backup.yml
---
- name: Create backup directory
  ansible.builtin.file:
    path: /opt/backup/ansible/{{ ansible_date_time.date }}
    state: directory
    mode: "0700"

- name: Backup critical configurations
  ansible.builtin.archive:
    path:
      - /etc/nginx
      - /etc/mysql
      - /etc/redis
      - /etc/ssh/sshd_config
      - /etc/sysctl.d
    dest: /opt/backup/ansible/{{ ansible_date_time.date }}/config-backup.tar.gz
    format: gz

- name: Sync backup to central storage
  ansible.posix.synchronize:
    src: /opt/backup/ansible/
    dest: "{{ backup_server }}:/backup/{{ inventory_hostname }}/"
    mode: push
    delete: false
    recursive: true
  delegate_to: "{{ inventory_hostname }}"

5.3.2 灾难恢复 Playbook

# playbooks/disaster-recovery.yml
---
- name: Disaster recovery - rebuild server from scratch
  hosts: "{{ target_host }}"
  become: true
  gather_facts: true

  vars_prompt:
    - name: confirm_rebuild
      prompt: "This will rebuild the server from scratch. Type 'REBUILD' to confirm"
      private: false

  pre_tasks:
    - name: Validate confirmation
      ansible.builtin.assert:
        that:
          - confirm_rebuild == 'REBUILD'
        fail_msg: "Rebuild not confirmed"

  tasks:
    - name: Restore from backup
      ansible.builtin.unarchive:
        src: "{{ backup_server }}:/backup/{{ inventory_hostname }}/latest/config-backup.tar.gz"
        dest: /
        remote_src: true
      when: restore_from_backup | default(false)

    - name: Apply full configuration
      ansible.builtin.include_role:
        name: "{{ item }}"
      loop:
        - common
        - "{{ server_role }}"

  post_tasks:
    - name: Verify all services
      ansible.builtin.service:
        name: "{{ item }}"
        state: started
      loop: "{{ required_services }}"

    - name: Run health checks
      ansible.builtin.uri:
        url: "http://127.0.0.1:{{ item.port }}{{ item.path }}"
        status_code: 200
      loop: "{{ health_endpoints }}"

5.3.3 Ansible 控制节点备份

控制节点的配置也需要备份,用 Git 管理所有 Ansible 代码:

#!/bin/bash
# scripts/backup-control-node.sh
# Backup Ansible control node configuration

BACKUP_DIR="/opt/backup/ansible-control"
DATE=$(date +%Y%m%d)

# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"

# Backup Ansible configuration
tar -czf "$BACKUP_DIR/$DATE/ansible-config.tar.gz" \
    /etc/ansible \
    ~/.ansible \
    ~/.ssh/ansible_* \
    2>/dev/null

# Backup custom plugins
tar -czf "$BACKUP_DIR/$DATE/ansible-plugins.tar.gz" \
    /usr/share/ansible/plugins \
    2>/dev/null

# Commit Ansible repository
cd /opt/ansible-infrastructure
git add -A
git commit -m "Backup: $DATE" || true
git push origin main

# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -type d -mtime +30 -exec rm -rf {} \;

echo "Backup completed: $BACKUP_DIR/$DATE"

六、总结

写 Ansible Playbook 这件事,入门容易精通难。我们团队从最初的“能跑就行”到现在的这套规范,中间踩了无数坑。最痛的那次是文章开头提到的周五晚上事故,直接让我们决心花三个月时间整理规范。

回顾这些年的实践,最重要的几点是:

结构化思维:把所有东西都放到该放的地方。变量放 group_vars,敏感信息用 Vault 加密,可复用的逻辑抽成 Role。这样不管是你自己三个月后回来看,还是新同事接手,都能快速理解。

幂等性意识:写每一个 task 的时候都要问自己:这个 task 跑两遍会怎样?如果答案是“会出问题”,那就得改。shell 模块用得少一点,lineinfile、template 这些声明式的模块用得多一点。

安全第一:密码永远不要明文写在文件里,SSH 密钥要妥善管理,sudo 权限要最小化。安全这东西,出事之前觉得麻烦,出事之后追悔莫及。

持续改进:规范不是一成不变的。每次出问题都要复盘,把教训固化成规范。我们的 Playbook 库现在还在不断迭代,每个季度都会 review 一次,把过时的东西清理掉,把新的最佳实践加进来。

自动化运维这条路,Ansible 只是起点。掌握了 Ansible,后面可以继续学 Terraform、Pulumi 这些基础设施即代码工具,或者深入 Kubernetes 的声明式配置。核心思想是相通的:用代码管理一切,让运维工作可重复、可追溯、可协作。

附录

A. 常用命令速查

# Inventory operations
ansible-inventory --list -i inventory/production/hosts.yml
ansible-inventory --graph -i inventory/production/hosts.yml
ansible all -m ping -i inventory/production/hosts.yml

# Playbook operations
ansible-playbook site.yml --syntax-check
ansible-playbook site.yml --list-tasks
ansible-playbook site.yml --list-hosts
ansible-playbook site.yml --check --diff
ansible-playbook site.yml --tags nginx
ansible-playbook site.yml --skip-tags install
ansible-playbook site.yml --limit webservers
ansible-playbook site.yml --start-at-task="Deploy config"

# Vault operations
ansible-vault create secrets.yml
ansible-vault edit secrets.yml
ansible-vault encrypt secrets.yml
ansible-vault decrypt secrets.yml
ansible-vault encrypt_string 'password' --name 'db_password'
ansible-vault rekey secrets.yml

# Galaxy operations
ansible-galaxy init my_role
ansible-galaxy install -r requirements.yml
ansible-galaxy collection install community.general

# Ad-hoc commands
ansible webservers -m shell -a "uptime"
ansible databases -m service -a "name=mysql state=restarted"
ansible all -m setup -a "filter=ansible_distribution*"

B. Ansible Lint 规则

# .ansible-lint
---
profile: production

exclude_paths:
  - .cache/
  - .git/
  - test/

skip_list:
  - yaml[line-length]
  - no-changed-when

warn_list:
  - command-instead-of-shell
  - risky-shell-pipe

enable_list:
  - fqcn-builtins
  - no-same-owner

use_default_rules: true

verbosity: 1

C. 推荐的 Collection

# requirements.yml
---
collections:
  - name: ansible.posix
    version: ">=1.5.0"
  - name: community.general
    version: ">=8.0.0"
  - name: community.mysql
    version: ">=3.8.0"
  - name: community.docker
    version: ">=3.4.0"
  - name: community.crypto
    version: ">=2.16.0"
  - name: amazon.aws
    version: ">=7.0.0"
  - name: google.cloud
    version: ">=1.3.0"

D. 项目模板

我们在 GitHub 上维护了一个 Ansible 项目模板,包含了本文提到的所有规范和示例代码。新项目直接 fork 这个模板就能开始:

# Clone template and start new project
git clone https://github.com/your-org/ansible-template.git my-infrastructure
cd my-infrastructure
./scripts/init-project.sh

Ansible 自动化向上箭头标志




上一篇:阿里Java面试复盘:高压下的情绪管理与工作稳定性实战攻略
下一篇:Bitcoin 2026拉斯维加斯大会前瞻:全球算力盛宴与矿机巨头云集
您需要登录后才可以回帖 登录 | 立即注册

手机版|小黑屋|网站地图|云栈社区 ( 苏ICP备2022046150号-2 )

GMT+8, 2026-4-29 08:51 , Processed in 0.665031 second(s), 42 queries , Gzip On.

Powered by Discuz! X3.5

© 2025-2026 云栈社区.

快速回复 返回顶部 返回列表