prometheus有兩個指標可以告警gpu卡的錯誤
# HELP dcgm_ecc_dbe_volatile_total Total number of double-bit volatile ECC errors. 雙位易失性ECC錯誤的總數 # TYPE dcgm_ecc_dbe_volatile_total counter dcgm_ecc_dbe_volatile_total{gpu="0",uuid="GPU-4d52e430-b8c7-a0b9-7fda-4aa825af5c97"} 0 # HELP dcgm_ecc_sbe_volatile_total Total number of single-bit volatile ECC errors. 單位易失性ECC錯誤的總數 # TYPE dcgm_ecc_sbe_volatile_total counter dcgm_ecc_sbe_volatile_total{gpu="0",uuid="GPU-4d52e430-b8c7-a0b9-7fda-4aa825af5c97"} 0
