基礎信息配置:
角色 | ip | port | hostname |
prometheus server | 172.16.23.120 | 9090 | master |
node_exporter master | 172.16.23.120 | 9100 | master |
node_exporter node1 | 172.16.23.121 | 9100 | node1 |
在node1上安裝node_exporter並進行相關的配置,這里將在master機器采用ansible進行配置:
[root@master prometheus]# ansible -i /etc/ansible/hosts 172.16.23.121 -m shell -a "hostname" 172.16.23.121 | CHANGED | rc=0 >> node1 [root@master prometheus]# ansible 172.16.23.121 -m shell -a "hostname" 172.16.23.121 | CHANGED | rc=0 >> node1 [root@master prometheus]# ansible 172.16.23.121 -m copy -a "src=/usr/local/node_exporter-1.0.1.linux-amd64.tar.gz dest=/usr/local/node_exporter-1.0.1.linux-amd64.tar.gz" 172.16.23.121 | CHANGED => { "ansible_facts": { "discovered_interpreter_python": "/usr/bin/python" }, "changed": true, "checksum": "279ec99af23dd18ac319cfb27ee544d88c3c14c9", "dest": "/usr/local/node_exporter-1.0.1.linux-amd64.tar.gz", "gid": 0, "group": "root", "md5sum": "e39f161ae0e7b486b98772c4aeb93805", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:usr_t:s0", "size": 9520728, "src": "/root/.ansible/tmp/ansible-tmp-1600599484.89-31367299931714/source", "state": "file", "uid": 0 } [root@master prometheus]# ansible 172.16.23.121 -m shell -a "ls -l /usr/local/node_exporter-1.0.1.linux-amd64.tar.gz" 172.16.23.121 | CHANGED | rc=0 >> -rw-r--r--. 1 root root 9520728 9月 20 18:58 /usr/local/node_exporter-1.0.1.linux-amd64.tar.gz
[root@master prometheus]# ansible 172.16.23.121 -m shell -a "cd /usr/local/;tar xf node_exporter-1.0.1.linux-amd64.tar.gz;ln -sv node_exporter-1.0.1.linux-amd64 node_exporter"
172.16.23.121 | CHANGED | rc=0 >>
"node_exporter" -> "node_exporter-1.0.1.linux-amd64"
然后進行服務上的一些配置:
[root@master prometheus]# ansible 172.16.23.121 -m copy -a "src=/usr/lib/systemd/system/node_exporter.service dest=/usr/lib/systemd/system/node_exporter.service" 172.16.23.121 | CHANGED => { "ansible_facts": { "discovered_interpreter_python": "/usr/bin/python" }, "changed": true, "checksum": "085e9b628d8e7419bab151285b8aa1c717a40892", "dest": "/usr/lib/systemd/system/node_exporter.service", "gid": 0, "group": "root", "md5sum": "ea66958b12a29bef63e3b52563f1dcbe", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:systemd_unit_file_t:s0", "size": 283, "src": "/root/.ansible/tmp/ansible-tmp-1600599820.88-3938928668863/source", "state": "file", "uid": 0 } [root@master prometheus]# ansible 172.16.23.121 -m shell -a "systemctl enable node_exporter;systemctl start node_exporter;ps -ef|grep node_exporter;lsof -i:9100" 172.16.23.121 | CHANGED | rc=0 >> root 2567 2566 0 19:09 pts/1 00:00:00 /bin/sh -c systemctl enable node_exporter;systemctl start node_exporter;ps -ef|grep node_exporter;lsof -i:9100 root 2592 1 0 19:09 ? 00:00:00 /usr/local/node_exporter/node_exporter --web.listen-address=:9100 --collector.systemd --collector.systemd.unit-whitelist="(ssh|docker|rsyslog|redis-server).service" root 2594 2567 0 19:09 pts/1 00:00:00 grep node_exporter COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME node_expo 2592 root 3u IPv6 27161 0t0 TCP *:jetdirect (LISTEN)
修改prometheus server配置將node1配置添加進去,收集node1數據:
[root@master prometheus]# cat prometheus.yml # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: # - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: - targets: ['localhost:9090'] - job_name: 'linux_node' static_configs: - targets: ['172.16.23.120:9100'] labels: nodename: master role: master - targets: ['172.16.23.121:9100'] labels: nodename: node1 role: node1
然后重啟prometheus server端:
# systemctl restart prometheus
然后刷新prometheus控制台:
可以看見有報錯,由於網絡或者其他原因get不到node1的數據,開始排查:
[root@master prometheus]# ping 172.16.23.121 PING 172.16.23.121 (172.16.23.121) 56(84) bytes of data. 64 bytes from 172.16.23.121: icmp_seq=1 ttl=64 time=1.21 ms 64 bytes from 172.16.23.121: icmp_seq=2 ttl=64 time=1.34 ms
網絡可以看見是通的,在網頁打開http://172.16.23.121:9100/metrics也是訪問不到,那么原因就是防火牆問題:
[root@node1 local]# systemctl status firewalld ● firewalld.service - firewalld - dynamic firewall daemon Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled) Active: active (running) since 日 2020-09-20 18:37:34 CST; 39min ago Docs: man:firewalld(1) Main PID: 714 (firewalld) CGroup: /system.slice/firewalld.service └─714 /usr/bin/python -Es /usr/sbin/firewalld --nofork --nopid 9月 20 18:37:31 node1 systemd[1]: Starting firewalld - dynamic firewall daemon... 9月 20 18:37:34 node1 systemd[1]: Started firewalld - dynamic firewall daemon.
於是停掉防火牆,刷新控制台如下:
至此新增node1節點到監控成功,數據采集正常
prometheus配置文件label關鍵字分析:
上面prometheus控制台中labels這個下面有instance這個key,而在配置文件中卻沒有,instance這個key默認不設置就是targets的內容,當instance這個key設置了那么就會取設置的值:
- job_name: 'linux_node' static_configs: - targets: ['172.16.23.120:9100'] labels: instance: 172.16.23.120 nodename: master role: master - targets: ['172.16.23.121:9100'] labels: instance: 172.16.23.121 nodename: node1 role: node1
控制台看看效果:
我們還可以根據設置其他labels屬性進行定義不同的節點
目前上面都是prometheus server去pull拉取節點的數據metrics收集到tsdb中,當然也可以進行節點push到server端,有兩種方式:
1.被監控的節點進行--collector.textfile設置,具體操作如下:
修改node_exporter的啟動參數增加--collector.textfile參數:
[root@node1 node_exporter]# cat /usr/lib/systemd/system/node_exporter.service [Unit] Description=node_exporter [Service] ExecStart=/usr/local/node_exporter/node_exporter \ --web.listen-address=:9100 \ --collector.systemd \ --collector.systemd.unit-whitelist="(ssh|docker|rsyslog|redis-server).service" \ --collector.textfile.directory=/usr/local/node_exporter/textfile.collected # 這是修改增加的一行 Restart=on-failure [Install] WantedBy=mutil-user.target
創建上面的目錄:
# mkdir /usr/local/node_exporter/textfile.collected
prometheus的數據格式是key value,於是編寫一個小腳本:
# cd /usr/local/node_exporter # mkdir scripts # cd scripts/ [root@node1 node_exporter]# cat scripts/login_user_count.sh #!/bin/bash echo "node_login_users `who|wc -l`" > /usr/local/node_exporter/textfile.collected/login_users.prom [root@node1 node_exporter]# crontab -l */1 * * * * /usr/local/node_exporter/scripts/login_user_count.sh
看看本地采集的數據:
[root@node1 scripts]# cat ../textfile.collected/login_users.prom node_login_users 2
重啟node_exporter服務:
# systemctl daemon-reload
# systemctl restart node_exporter
查看本地的metrics數據:
然后從prometheus server端也就是控制台查詢數據:
這樣被監控的節點就可以主動將本地的metric數據push到server端了
2.新增node2節點,通過自定義label標簽push text的數據到prometheus server,操作如下:
2.1安裝node_exporter,步驟參考上面,並添加自動參數:--collector.textfile.directory=/usr/local/node_exporter/textfile.collected
2.2編寫腳本並配置定時任務:
[root@node2 node_exporter]# cat /usr/local/node_exporter/scripts/login_user_count.sh #!/bin/bash num=$(who|wc -l) echo "ops_bash{hostname=\"node2\"} ${num}" > /usr/local/node_exporter/textfile.collected/ops_bash.prom [root@node2 node_exporter]# crontab -l */1 * * * * /usr/local/node_exporter/scripts/login_user_count.sh
查詢本地獲取的數據:
[root@node2 node_exporter]# cat /usr/local/node_exporter/textfile.collected/ops_bash.prom ops_bash{hostname="node2"} 2
然后查詢本地metric數據:
[root@node2 node_exporter]# curl -s http://172.16.23.122:9100/metrics|grep ops_bash node_textfile_mtime_seconds{file="ops_bash.prom"} 1.600615621e+09 # HELP ops_bash Metric read from /usr/local/node_exporter/textfile.collected/ops_bash.prom # TYPE ops_bash untyped ops_bash{hostname="node2"} 2
2.3node2本地數據正常后,將node2節點配置到prometheus server端,修改配置如下:
- targets: ['172.16.23.122:9100']
2.4重啟prometheus服務:
2.5查詢promsql:
可以看見,除了在prometheus的配置文件可以定義label便簽也可以在node節點的prom文件中定義label
[root@node2 node_exporter]# cat textfile.collected/ops_bash.prom ops_bash{hostname="node2"} 2