Prometheus Grafana監控全方位實踐

本文轉載自查看原文 2019-10-28 11:41 441 運維相關

這次就不用 docker 部署服務了，這樣大家會更容易接受。歡迎閱讀。

引言

Prometheus 是一個監控系統，也是一個時間序列數據庫，用Go語言開發的,官方文檔。通過從某些特定的目標如主機，Mysql，Redis等，收集帶有時間標記的一些指標（metrics），比如服務器內存情況，數據庫連接數量等數據，經過一定的處理，按照時間序列順序進行顯示。
你可以配置規則，對這些指標進行處理，當某些指標符合某種規則，會觸發報警等。項目地址: https://github.com/prometheus/prometheus。Prometheus現在已經成為Kubernetes的官方監控方案，真棒。

為什么要用這個工具，因為我們有好多機器需要監控，我們要運維！而且雷鋒們造了好多收集不同軟件服務監控指標的工具，所以還是要用的。

這個工具號稱：

多維數據模型（由指標名稱和鍵/值維度的集合定義的時間序列)
靈活的查詢語言
不依賴於分布式存儲；單服務器節點是自治的
拉取數據都是通過HTTP
通過中間網關支持推送時間序列
通過服務發現或靜態配置發現目標
多種圖形和儀表板支持模式
支持集群聯邦

說那么多，就是這個工具，你可以把它配置成單機器服務，也可以配置成分布式集群。

你可以去某些地方拉數據，也可以推數據給它，所有的請求都是HTTP協議，這樣方便你定義自己的數據格式，制造一個造數據的服務，提供給Prometheus去拉或推。

你可以為Prometheus配置數據來源的靜態地址，或者配置服務發現（啥是服務發現，就是現在我的數據提供的服務名是ServiceABC，它在A機器也可能在B機器，就是不知道它在哪里，服務發現會告訴你它現在在哪里），它就知道去哪里把指標數據拉下來。而且，它有各種可視化賊漂亮的UI組件支持。

Grafana是什么呢？是一個跨平台的開源的度量分析和可視化工具，用Go語言開發的，官網文檔，可以通過將采集的數據查詢然后可視化的展示。數據源可以來自Graphite，InfluxDB，OpenTSDB，Prometheus，Elasticsearch，CloudWatch和KairosDB等。項目地址：https://github.com/grafana/grafana

Prometheus的UI界面還是很丑，Grafana很漂亮，嗯，這很好，所以大家一起搭配干活，當然Grafana其實有時候並不需要Prometheus，它自己可以配置去監控各種數據庫，Prometheus只是它的數據源。

小試牛刀

時間序列的數據從哪里來，Prometheus 只是收集數據的，它要去某些地方拉數據，所以要有一些可以提供數據的服務。當然，你可以自己制作自己的數據服務。

官網提供了一些造數據的服務。先來看一個最基本的：機器節點指標導出服務：node_exporter，可以導出機器CPU，內存等使用情況。

我們來進行嘗試。

你可以在這里下載合適你操作系統的最新node_exporter。
我是Mac蘋果系統，所以下載了https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.darwin-amd64.tar.gz。

解壓后就可以用了：

tar xvf node_exporter-0.18.1.darwin-amd64.tar.gz
cd node_exporter-0.18.1.darwin-amd64

./node_exporter
INFO[0000] Starting node_exporter (version=0.18.1, branch=HEAD, revision=3db77732e925c08f675d7404a8c46466b2ece83e)  source="node_exporter.go:156"
INFO[0000] Build context (go=go1.11.10, user=root@4a30727bb68c, date=20190604-16:47:36)  source="node_exporter.go:157"
INFO[0000] Enabled collectors:                           source="node_exporter.go:97"
INFO[0000]  - boottime                                   source="node_exporter.go:104"
INFO[0000]  - cpu                                        source="node_exporter.go:104"
INFO[0000]  - diskstats                                  source="node_exporter.go:104"
INFO[0000]  - filesystem                                 source="node_exporter.go:104"
INFO[0000]  - loadavg                                    source="node_exporter.go:104"
INFO[0000]  - meminfo                                    source="node_exporter.go:104"
INFO[0000] Listening on :9100                            source="node_exporter.go:170"

這個數據服務暴露了 9100 端口給 Prometheus 取數據。

打開 http://127.0.0.1:9100/metrics可以看到這個node_exporter提供的數據指標是怎么樣的：

# HELP node_network_receive_bytes_total Network device statistic receive_bytes.
# TYPE node_network_receive_bytes_total counter  這個counter表示只增長的類型，也就是只會增加不會減少的值，且數值只能是正整數。大數值會啟用科學計數法。
node_network_receive_bytes_total{device="XHC20"} 0
node_network_receive_bytes_total{device="awdl0"} 3072
node_network_receive_bytes_total{device="bridge0"} 0
node_network_receive_bytes_total{device="en0"} 0
node_network_receive_bytes_total{device="en1"} 4.133417984e+09

# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 42696.99
node_cpu_seconds_total{cpu="0",mode="nice"} 0
node_cpu_seconds_total{cpu="0",mode="system"} 9593.58
node_cpu_seconds_total{cpu="0",mode="user"} 27073.77

上面兩行是注釋說明，HELP是指標說明，TYPE是值類型。指標定義是這種格式的：指標名{key1="value1",key2="value2"} value，大括號里是指標下面的細分，你可以認為是二級指標。

說明node_network_receive_bytes_total這個大指標是統計機器網絡接收字節數的，大指標下面有小指標，如網卡en0，en1等，后面的數字表示具體的值：接收字節數。

參考一下架構圖說清楚Prometheus怎么用這些數據：

jobs/exporter稱之為導出器，上面我們使用的node_exporter 就是屬於這部分，是Prometheus主要的指標來源。
Prometheus Server是服務核心組件，存儲使用時序數據庫TSDB將數據保存在硬盤上，由於官方對SSD做了專門的優化，所以使用SSD性能會更優。
Service dicovery服務發現，配置Prometheus可以直接在寫在yaml文件中，但如果配置較長也可以寫入其他文件並啟用文件發現(file_sd)功能讓其自行偵聽配置文件變化，甚至可以使用consul或者kubernetes這樣的服務發現來動態更新配置以適應頻繁的節點變更。
Prometheus使用pull模型從節點暴露出來的端口拉取配置，這相比push方式更容易避免節點異常帶來的干擾和繁瑣的工作。
Pushgateway類似於一個中轉站，Prometheus的服務端只會使用pull方式拉取數據，但是某些節點因為某些原因只能使用push方式推送數據，所以這是一個存放推送數據的中轉站。
Alertmanager是一個告警系統，可以通過在配置文件中添加規則的方式，計算並發出警報，它支持多種發送方式比如Email等。
對於已經存儲的歷史數據，Prometheus提供了PromQL語言進行查詢，並自帶了一個簡易的UI界面，可以在界面上進行查詢、繪圖、查看配置、告警等等。當然，現在都用Grafana這個更漂亮的工具來查了。

一句話，Prometheus可以從某些地方獲取到監控數據，並且存起來，配套Grafana等客戶端工具，你可以輕松監控，無煩惱。

下面我們進行集成，首先到官網下載Prometheus，我是Mac蘋果系統，所以下載了https://github.com/prometheus/prometheus/releases/download/v2.12.0/prometheus-2.12.0.darwin-amd64.tar.gz。

解壓后就可以用了：

tar xvfz prometheus-*.tar.gz
cd prometheus-*

./prometheus

默認情況下，Prometheus監控自己，我們需要讓它監控其他的人，編輯配置文件prometheus.yml：

vim prometheus.yml

# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. 每15秒抓一次數據
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']
    
  # 我們加了這個，填入node_exporter暴露出的端口  
  - job_name: "node"
    static_configs:
    - targets: ["127.0.0.1:9100"]

然后執行：

# By default, Prometheus stores its database in ./data (flag --storage.tsdb.path).
./prometheus --config.file=prometheus.yml

level=info ts=2019-09-11T06:41:06.459Z caller=main.go:740 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2019-09-11T06:41:46.492Z caller=main.go:768 msg="Completed loading of configuration file" filename=prometheus.yml
level=info ts=2019-09-11T06:41:46.492Z caller=main.go:623 msg="Server is ready to receive web requests."

默認指標數據會保存在當前目錄./data下。我們打開 http://127.0.0.1:9090查看Prometheus。

在http://127.0.0.1:9090/targets可以看到我們監控的模板，State綠色表示是健康的。

在http://127.0.0.1:9090/graph點擊選擇insert metric at cursor可以下拉指標，查看具體數據，如下圖：

當然你也可以自己輸入：node_cpu_seconds_total{cpu="0",mode="idle"}，查出 CPU0每一次空閑的秒數，Console會顯示出 node_cpu_seconds_total{cpu="0",instance="127.0.0.1:9100",job="node",mode="idle"} 46541.04。

計算出CPU所有核數的busy狀態總和在整個CPU時間的占比可以用更復雜的表達式：

(((count(count(node_cpu_seconds_total) by (cpu))) - avg(sum by (mode)(irate(node_cpu_seconds_total{mode='idle'}[5m])))) * 100) / count(count(node_cpu_seconds_total) by (cpu))

學習PromQL (Prometheus Query Language) 。但我們並不關心這些復雜的表達式，因為有人幫我們把這些都做好了，上Grafana！！里面有好多插件幫你集成了。

先到https://grafana.com/grafana/download/6.3.5下載我們想要的包。

我是Mac方式安裝，其他操作系統請參考其他安裝方法，我這樣做：

brew update
brew install grafana

brew tap homebrew/services
brew services start grafana

我們打開 http://127.0.0.1:3000查看Grafana，第一次賬號密碼:admin/admin。

Open the side menu by clicking the Grafana icon in the top header.
In the side menu under the Dashboards link you should find a link named Data Sources.
Click the + Add data source button in the top header.
Select Prometheus from the Type dropdown.

添加數據源，點擊 Add data source,選擇Prometheus，在URL輸入框鍵入http://127.0.0.1:9090，點擊save & test，如果出現下圖中的綠色提示，則表示配置有效，否則可能是地址或者端口等其他錯誤，需要自行修改。

點擊左側的Home回到首頁，創建Dashboard，搜索相應的 Node Exporter，有一堆東西出來，跟着他們的提示做，最后就出來很漂亮的監控畫面了。

可以參考這個文章：Book

實戰

Prometheus官方推薦了一些監控導出器：EXPORTERS。

Grafana官方推薦了一些數據源Data Source。

搭配這兩個一起干活，可以組合出很多種監控方案。具體請查看官方文檔。

感謝您的閱讀，再見👋。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Grafana+Prometheus打造全方位立體監控系統 Prometheus+Grafana+Alertmanager搭建全方位的監控告警系統基於Prometheus搭建SpringCloud全方位立體監控體系 k8s全方位監控-prometheus部署 Zabbix+Grafana打造全方位立體化監控系統 prometheus+grafana監控mysql最佳實踐 Prometheus+Grafana監控部署實踐 k8s全方位監控 -prometheus實現短信告警接口編寫（python） Kubernetes運維之使用Prometheus全方位監控K8S 使用Prometheus+Grafana監控MySQL實踐