prometheus 使用 ipmi exporter 增加硬件級別監控


prometheus 監控硬件

安裝ipmitool 並加載相應模塊

yum install ipmitool freeipmi  -y
modprobe ipmi_msghandler
modprobe ipmi_devintf
modprobe ipmi_poweroff
modprobe ipmi_si
modprobe ipmi_watchdog

下載 ipmi_exporter 源碼包

wget https://github.com/soundcloud/ipmi_exporter/releases/download/v1.0.0/ipmi_exporter-v1.0.0.linux-amd64.tar.gz  
tar -xf ipmi_exporter-v1.0.0.linux-amd64.tar.gz   -C /opt/
cd /opt/ipmi_exporter-v1.0.0.linux-amd64/

增加配置文件

cat ipmi_remote.yml
modules:
        10.193.x.x:               #遠控卡ip地址
                    user: "root"  #遠控卡用戶
                    pass: "xxxxxxxxxxxxx"  #遠控卡密碼
                    # Available collectors are bmc, ipmi, chassis, and dcmi 
                    collectors:
                    - bmc
                    - ipmi
                    - dcmi
                    - chassis
                    # Got any sensors you don't care about? Add them here. 
                    exclude_sensor_ids:
                    - 2
                    - 29
                    - 32

啟動ipmi_exporter

./ipmi_exporter  --config.file=/usr/local/ipmi_exporter-v1.0.0.linux-amd64/ipmi_remote.yml  --web.listen-address=:19293 & 

增加prometheus server job 配置

#增加監控ipmi exporter rules 規則
  - "rules/Memory_hardware.yml"
  - "rules/power.yml"
  - "rules/fan.yml"
  - "rules/processor.yml"
  - "rules/harddisk.yml"

#增加主配置文件job
#cat /usr/local/prometheus/prometheus.yml
  - job_name: 'ipmi_exporter'
    file_sd_configs:
    - refresh_interval: 5s  
      files:
      - ./conf.d/ipmi_exporter.json
#cat  /usr/local/prometheus/conf.d/ipmi_exporter.json 
[
{
"targets": ["10.65.x.x:19293"],
"labels": {
"hostname": "lgy-storage-glusterxxx"
}
}
]

增加rules 配置文件

# cd /usr/local/prometheus/rules
# cat Memory_hardware.yml  (內存條監控)
groups:
- name: Memory_hardware
  rules:
  - alert: Memory_hardware error
    expr: ipmi_sensor_state{type="Memory"} == 1
    for: 3m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} 內存硬件警告"
      description: "{{ $labels.instance }} of job {{$labels.job}} 內存硬件警告,當前狀態[{{ $value }}]."



# cat power.yml (服務器電源模塊監控)
groups:
- name: power status
  rules:
  - alert: power bad
    expr: ipmi_sensor_state{name="Status",type="Power Supply"} == 1
    for: 3m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} 電源壞了"
      description: "{{ $labels.instance }} of job {{$labels.job}} 電源壞了,當前狀態[{{ $value }}]."


#  cat fan.yml  (服務器風扇監控)
groups:
- name: fan status
  rules:
  - alert: speed fan bad
    expr: ipmi_fan_speed_state{} == 1
    for: 3m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} 風扇壞了"
      description: "{{ $labels.instance }} of job {{$labels.job}} 風扇壞了,當前狀態[{{ $value }}]."


# cat processor.yml (服務器處理器監控)
groups:
- name: Processor
  rules:
  - alert: Processor hardware error
    expr: ipmi_sensor_state{name="Status",type="Processor"} == 1
    for: 3m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} 處理器硬件警告"


#  cat harddisk.yml (硬盤監控,主要是raid 組監控,系統盤和數據盤分開做的raid 組,會有兩個參數)
groups:
- name: harddisk
  rules:
  - alert: hard disk bad
    expr: ipmi_sensor_state{type="Drive Slot"} == 1
    for: 3m
    labels:
      user: caizh
    annotations:
      summary: "Instance {{ $labels.instance }} 硬盤壞了"
      description: "{{ $labels.instance }} of job {{$labels.job}} 硬盤壞了,當前狀態[{{ $value }}]."




免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM