ansible部署prometheus+node-exporter
簡單部署prometheus監控系統
yum安裝ansible
yum install ansible
ansible的hosts文件
[alertmanagers]
10.9.119.1
[prometheus]
10.9.119.1
[node-exporter]
10.9.119.1
10.9.119.2
10.9.119.3
文件層次格式如下:

prometheus
prometheus.yml
- hosts: prometheus
remote_user: root
tasks:
- name: create dir
file:
path: /opt/prometheus
state: directory # 沒有目錄則創建
- name: copy file
unarchive:
src: prometheus-2.24.0.linux-amd64.tar.gz
dest: /opt/prometheus
- name: create link
file:
src: /opt/prometheus/prometheus-2.24.0.linux-amd64
dest: /opt/prometheus/prometheus
state: link # 軟鏈接
- name: copy service file
template:
src: prometheus.service.j2
dest: /usr/lib/systemd/system/prometheus.service
- name: copy config yaml
template:
src: prometheus.yml.j2
dest: /opt/prometheus/prometheus/prometheus.yml
notify:
- restart prometheus
- name: create rules dir
file:
path: /opt/prometheus/prometheus/rules
state: directory
- name: copy rules yaml # node里面有特殊符號所以使用copy
copy:
src: node.yml
dest: /opt/prometheus/prometheus/rules/node.yml
notify: # 此動作將觸發handlers
- restart prometheus
- name: start prometheus
service:
name: prometheus
state: started
enabled: yes
handlers:
- name: restart prometheus
service:
name: prometheus
state: restarted
prometheus.service.j2 可以使用copy模塊,這里使用了template
[Unit]
Description=Prometheus
Documentation=
After=network.target
[Service]
WorkingDirectory=/opt/prometheus/prometheus
ExecStart=/opt/prometheus/prometheus/prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=3s
[Install]
WantedBy=multi-user.target
prometheus.yml.j2
# 全局配置
global:
scrape_interval: 30s #抓取間隔時間
evaluation_interval: 30s #規則引擎執行間隔時間
query_log_file: ./promql.log
# 告警配置
alerting:
alertmanagers: # Alertmanagers配置
- static_configs: # Alertmanager靜態配置
- targets: # alertmanager發送目標配置
{% for alertmanager in groups['alertmanagers'] %}
- {{ alertmanager }}:9093
{% endfor %}
rule_files: # 規則文件配置
- "rules/*.yml"
scrape_configs: # 抓取配置
- job_name: 'prometheus' #任務 采集目標分類
static_configs: # 抓取目標靜態配置
- targets:
{% for prometheu in groups['prometheus'] %}
- "{{ prometheu }}:9090" #抓取目標
{% endfor %}
- job_name: "node"
static_configs:
- targets:
{% for node in groups['node-exporter'] %}
- "{{ node }}:9100"
{% endfor %}
node-rules規則文件node.yml
groups:
- name: node.rules # 報警規則組名稱
rules:
- alert: node is Down
expr: up == 0
for: 30s #持續時間,表示持續30秒獲取不到信息,則觸發報警
labels:
severity: serious # 自定義標簽 嚴重的
annotations:
summary: "Instance {{ $labels.instance }} down" # 自定義摘要
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." # 自定義具體描述
- alert: node Filesystem
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: {{$labels.mountpoint }} 分區使用過高"
description: "{{$labels.instance}}: {{$labels.mountpoint }} 分區使用大於 80% (當前值: {{ $value }})"
- alert: node Memory
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: 內存使用過高"
description: "{{$labels.instance}}: 內存使用大於 80% (當前值: {{ $value }})"
- alert: node CPU
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: CPU使用過高"
description: "{{$labels.instance}}: CPU使用大於 80% (當前值: {{ $value }})"
node-exporter
node-exporter.yml
- hosts: node-exporter
remote_user: root
tasks:
- name: create dir
file:
path: /opt/prometheus
state: directory
- name: copy file
unarchive:
src: node_exporter-1.0.1.linux-amd64.tar.gz
dest: /opt/prometheus
- name: create link
file:
src: /opt/prometheus/node_exporter-1.0.1.linux-amd64
dest: /opt/prometheus/node_exporter
state: link
- name: copy service file
template:
src: node_exporter.service.j2
dest: /usr/lib/systemd/system/node_exporter.service
- name: start node_exporter
service:
name: node_exporter
state: restarted
enabled: yes
node_exporter.service.j2
[Unit]
Description=Node Exporter
Documentation=
After=network.target
[Service]
WorkingDirectory=/opt/prometheus/node_exporter/
ExecStart=/opt/prometheus/node_exporter/node_exporter
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=3s
[Install]
WantedBy=multi-user.target
alertmanager
alertmanager.yaml
- hosts: alertmanagers
remote_user: root
tasks:
- name: create dir
file:
path: /opt/prometheus
state: directory
- name: copy file
unarchive:
src: alertmanager-0.21.0.linux-amd64.tar.gz
dest: /opt/prometheus
- name: create link
file:
src: /opt/prometheus/alertmanager-0.21.0.linux-amd64
dest: /opt/prometheus/alertmanager
state: link
- name: copy service file
template:
src: alertmanager.service.j2
dest: /usr/lib/systemd/system/alertmanager.service
- name: copy config yaml
template:
src: alertmanager.yml.j2
dest: /opt/prometheus/alertmanager/alertmanager.yml
notify:
- restart alertmanager
- name: start server
service:
name: alertmanager
state: restarted
enabled: yes
handlers:
- name: restart alertmanager
service:
name: alertmanager
state: restarted
alertmanager.service.j2
[Unit]
Description=AlertManager
Documentation=
After=network.target
[Service]
WorkingDirectory=/opt/prometheus/alertmanager/
ExecStart=/opt/prometheus/alertmanager/alertmanager
ExecReload=/bin/kill -HUP $MAINPID
ExecStop=/bin/kill -KILL $MAINPID
Type=simple
KillMode=control-group
Restart=on-failure
RestartSec=3s
[Install]
WantedBy=multi-user.target
alertmanager.yml.j2 這里使用了郵箱告警
global:
resolve_timeout: 5m # 當告警的狀態有firing變為resolve的以后還要呆多長時間,才宣布告警解除。
smtp_from: "123456789@qq.com"
smtp_smarthost: 'smtp.qq.com:465'
smtp_auth_username: "123456789@qq.com" # 郵箱賬號
smtp_auth_password: "bcvizcgqbgojjjeb" # 口令密碼,非QQ密碼
smtp_require_tls: false # 使用465端口,這里選false
route:
group_by: ['alertname'] # 采用哪個標簽作為分組的依據
group_wait: 10s # 分組等待的時間10s
group_interval: 10s # 上下兩組發送告警的間隔時間10s
repeat_interval: 24h # 重復發送告警時間。默認1h 不會重復發送相同告警 靜默
receiver: 'default-receiver' # 默認接收人
# 所有不匹配以下子路由的告警都將保留在根節點,並發送到'default-receiver'
routes: # 分組
- receiver: 'db'
group_wait: 10s
match_re:
# 使用正則匹配告警包含兩個服務,發送到db
service: mysql|redis #所有service=mysql或者service=redis的告警分配到db接收端
- receiver: 'web'
group_by: [product, environment] # 采用product和environment標簽作為分組的依據
match:
team: frontend # 所有告警標簽帶有frontend發送到web
receivers:
- name: 'default-receiver'
email_configs:
- to: '123456789@qq.com' # 告警收件人
- name: 'db'
# 通過郵箱發送報警
email_configs:
- to: '111111111@qq.com'
- name: 'web'
email_configs:
- to: '222222222@qq.com'
inhibit_rules: # 抑制,但兩個都報警了,級別嚴重的會抑制級別警告的,只發生嚴重級別的告警
- source_match:
severity: 'critical' # critaical的報警會抑制warning級別的報警信息
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
ansible-playbook部署
ansible-playbook -C 可以測試
ansible-playbook prometheus.yaml
ansible-playbook node-exporter.yaml
ansible-playbook alertmanager.yaml
后續會改成roles方式
