Prometheus + Grafana（十）系統監控之Elasticsearch

本文轉載自查看原文 2020-01-06 15:32 7462 ElasticSearch/ Prometheus+Grafana

前言

根據 Promtheus 上的 Exporters and Integrations 頁面所載，Prometheus 有釋出給 Elasticsearch 所用的 exporter ，也就是 elasticsearch_exporter 。

本文即是根據 elasticsearch_exporter 上的指示，主要介紹如何使用Grafana和Prometheus以及elasticsearch_exporter對Elasticsearch性能進行監控。

1.安裝elasticsearch_exporter

1.1.下載

下載地址：https://github.com/justwatchcom/elasticsearch_exporter/releases

1.2.下載解壓

下載elasticsearch_exporter-1.1.0.linux-amd64.tar.gz安裝包並解壓到/usr/local目錄

wget https://github.com/justwatchcom/elasticsearch_exporter/releases/download/v1.1.0/elasticsearch_exporter-1.1.0.linux-amd64.tar.gz
tar -xvf elasticsearch_exporter-1.1.0.linux-amd64.tar.gz
cd elasticsearch_exporter-1.1.0.linux-amd64/

1.3.啟動

nohup ./elasticsearch_exporter --es.uri http://10.x.xx.100:9200 &

## 參數說明：
--es.uri         　　　　默認http://localhost:9200，連接到的Elasticsearch節點的地址（主機和端口）。 這可以是本地節點（例如localhost：9200），也可以是遠程Elasticsearch服務器的地址
--es.all                默認flase，如果為true，則查詢群集中所有節點的統計信息，而不僅僅是查詢我們連接到的節點。
--es.cluster_settings   默認flase，如果為true，請在統計信息中查詢集群設置
--es.indices            默認flase，如果為true，則查詢統計信息以獲取集群中的所有索引。
--es.indices_settings   默認flase，如果為true，則查詢集群中所有索引的設置統計信息。
--es.shards             默認flase，如果為true，則查詢集群中所有索引的統計信息，包括分片級統計信息（意味着es.indices = true）。
--es.snapshots          默認flase，如果為true，則查詢集群快照的統計信息。

輸入以下命令查看日志是否啟動成功：

tail -1000f nohup.out

啟動成功后，可以訪問 http://10.xx.xxx.100:9114/metrics/ ，看抓取的指標信息：

2.Prometheus配置

2.1.配置

修改prometheus組件的prometheus.yml加入elasticsearch節點：

2.2.啟動驗證

保存以后重啟Prometheus，查看targets：

注：State=UP，說明成功

3.Grafana配置

3.1.下載儀表盤

下載地址：https://grafana.com/grafana/dashboards/2322

3.2.導入儀表盤

3.3.查看儀表盤

注：以上儀表盤導入后再結合自身業務修改過的

3.4.預警指標

序號	預警名稱	預警規則	描述
1	集群狀態預警	當集群狀態不符合預期【健康狀態為red或yellow】時進行預警
2	集群健康預警	當集群健康狀態不符合預期【!=1】時進行預警
3	節點狀態預警	當節點狀態不符合預期【!=1】時進行預警
4	節點數預警	當集群中的節點數達到閾值【<5】時進行預警
5	斷路器跳閘預警	當集群中的斷路器達到閾值【>0】時進行預警
6	內存預警	當內存使用達到閾值【>80%】時進行預警
7	Gc耗時預警	當Gc耗時達到閾值【>0.3s】時進行預警
8	Gc次數預警	當每秒Gc次數達到閾值【>5】時進行預警
9	磁盤預警	當磁盤使用情況達到閾值【>80%】時進行預警

4.其它

注冊為系統服務開機自動啟動

## 准備配置文件
cat <<\EOF >/etc/systemd/system/elasticsearch_exporter.service
[Unit]
Description=Elasticsearch stats exporter for Prometheus
Documentation=Prometheus exporter for various metrics about ElasticSearch, written in Go.

[Service]
ExecStart=/usr/local/elasticsearch_exporter/elasticsearch_exporter --es.uri http://10.x.xx.100:9200

[Install]
WantedBy=multi-user.target
EOF


## 啟動並設置為開機自動啟動
systemctl daemon-reload
systemctl enable elasticsearch_exporter.service
systemctl stop elasticsearch_exporter.service
systemctl start elasticsearch_exporter.service
systemctl status elasticsearch_exporter.service

5.核心指標

5.1.集群健康和節點可用性

通過cluster healthAPI可以獲取集群的健康狀況，可以把集群的健康狀態當做是集群平穩運行的重要信號，一旦狀態發生變化則需要引起重視；API返回的一些重要參數指標及對應的prometheus監控項如下：

返回參數	備注	metric name
status	集群狀態，green（所有的主分片和副本分片都正常運行）、yellow（所有的主分片都正常運行，但不是所有的副本分片都正常運行）red（有主分片沒能正常運行）	elasticsearch_cluster_health_status
number_of_nodes/number_of_data_nodes	集群節點數/數據節點數	elasticsearch_cluster_health_number_of_nodes/data_nodes
active_primary_shards	活躍的主分片總數	elasticsearch_cluster_health_active_primary_shards
active_shards	活躍的分片總數（包括復制分片）	elasticsearch_cluster_health_active_shards
relocating_shards	當前節點正在遷移到其他節點的分片數量，通常為0，集群中有節點新加入或者退出時該值會增加	elasticsearch_cluster_health_relocating_shards
initializing_shards	正在初始化的分片	elasticsearch_cluster_health_initializing_shards
unassigned_shards	未分配的分片數，通常為0，當有節點的副本分片丟失該值會增加	elasticsearch_cluster_health_unassigned_shards
number_of_pending_tasks	只有主節點能處理集群級元數據的更改(創建索引，更新映射，分配分片等)，通過`pending-tasks` API可以查看隊列中等待的任務，絕大部分情況下元數據更改的隊列基本上保持為零	elasticsearch_cluster_health_number_of_pending_tasks

依據上述監控項，配置集群狀態Singlestat面板，健康狀態一目了然：

5.2.主機級別的系統和網絡指標

metric name	description
elasticsearch_process_cpu_percent	Percent CPU used by process CPU使用率
elasticsearch_filesystem_data_free_bytes	Free space on block device in bytes 磁盤可用空間
elasticsearch_process_open_files_count	Open file descriptors ES進程打開的文件描述符
elasticsearch_transport_rx_packets_total	Count of packets receivedES節點之間網絡入流量
elasticsearch_transport_tx_packets_total	Count of packets sentES節點之間網絡出流量

如果CPU使用率持續增長，通常是由於大量的搜索或索引工作造成的負載。可能需要添加更多的節點來重新分配負載。

文件描述符用於節點間的通信、客戶端連接和文件操作。如果打開的文件描述符達到系統的限制（一般Linux運行每個進程有1024個文件描述符，生產環境建議調大65535），新的連接和文件操作將不可用，直到有舊的被關閉。

如果ES集群是寫負載型，建議使用SSD盤，需要重點關注磁盤空間使用情況。當segment被創建、查詢和合並時，Elasticsearch會進行大量的磁盤讀寫操作。

節點之間的通信是衡量群集是否平衡的關鍵指標之一，可以通過發送和接收的字節速率，來查看集群的網絡正在接收多少流量。

5.3.JVM內存和垃圾回收

metric name	description
elasticsearch_jvm_gc_collection_seconds_count	Count of JVM GC runs垃圾搜集數
elasticsearch_jvm_gc_collection_seconds_sum	GC run time in seconds垃圾回收時間
elasticsearch_jvm_memory_committed_bytes	JVM memory currently committed by area最大使用內存限制
elasticsearch_jvm_memory_used_bytes	JVM memory currently used by area 內存使用量

主要關注JVM Heap 占用的內存以及JVM GC 所占的時間比例，定位是否有 GC 問題。Elasticsearch依靠垃圾回收來釋放堆棧內存，默認當JVM堆棧使用率達到75%的時候啟動垃圾回收，添加堆棧設置告警可以判斷當前垃圾回收的速度是否比產生速度快，若不能滿足需求，可以調整堆棧大小或者增加節點。

5.4.搜索和索引性能

搜索請求

metric name	description
elasticsearch_indices_search_query_total	query總數
elsticsearch_indices_search_query_time_seconds	query時間
elasticsearch_indices_search_fetch_total	fetch總數
elasticsearch_indices_search_fetch_time_seconds	fetch時間

索引請求

metric name	description
elasticsearch_indices_indexing_index_total	Total index calls索引index數
elasticsearch_indices_indexing_index_time_seconds_total	Cumulative index time in seconds累計index時間
elasticsearch_indices_refresh_total	Total time spent refreshing in second refresh時間
elasticsearch_indices_refresh_time_seconds_total	Total refreshess refresh數
elasticsearch_indices_flush_total	Total flushes flush數
elasticsearch_indices_flush_time_seconds	Cumulative flush time in seconds累計flush時間

將時間和操作數畫在同一張圖上，左邊y軸顯示時間，右邊y軸顯示對應操作計數，ops/time查看平均操作耗時判斷性能是否異常。通過計算獲取平均索引延遲，如果延遲不斷增大，可能是一次性bulk了太多的文檔。

Elasticsearch通過flush操作將數據持久化到磁盤，如果flush延遲不斷增大，可能是磁盤IO能力不足，如果持續下去最終將導致無法索引數據。

5.5.資源飽和度

metric name	description
elasticsearch_thread_pool_queue_count	Thread Pool operations queued 線程池中排隊的線程數
elasticsearch_thread_pool_rejected_count	Thread Pool operations rejected 線程池中被拒絕的線程數
elasticsearch_indices_fielddata_memory_size_bytes	Field data cache memory usage in bytes fielddata緩存的大小
elasticsearch_indices_fielddata_evictions	Evictions from filter cache fielddata緩存的驅逐次數
elasticsearch_indices_filter_cache_memory_size_bytes	Filter cache memory usage in bytes 過濾器高速緩存的大小
elasticsearch_indices_filter_cache_evictions	Evictions from filter cache 過濾器緩存的驅逐次數
elasticsearch_cluster_health_number_of_pending_tasks	Cluster level changes which have not yet been executed 待處理任務數
elasticsearch_indices_get_missing_total	Total get missing 丟失文件的請求數
elasticsearch_indices_get_missing_time_seconds	Total time of get missing in seconds 文檔丟失的請求時間

通過采集以上指標配置視圖，Elasticsearch節點使用線程池來管理線程對內存和CPU使用。可以通過請求隊列和請求被拒絕的情況，來確定節點是否夠用。

每個Elasticsearch節點都維護着很多類型的線程池。一般來講，最重要的幾個線程池是搜索（search），索引（index），合並（merger）和批處理（bulk）。

每個線程池隊列的大小代表着當前節點有多少請求正在等待服務。一旦線程池達到最大隊列大小（不同類型的線程池的默認值不一樣），后面的請求都會被線程池拒絕。

參考：

https://shenshengkun.github.io/posts/550bdf86.html

　　https://yq.aliyun.com/articles/548354

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Prometheus + Grafana（七）系統監控之Redis Prometheus + Grafana（八）系統監控之Kafka Prometheus + Grafana（九）系統監控之RabbitMQ Prometheus + Grafana 監控系統搭 Prometheus + Grafana（十四）系統監控之Canal Prometheus + Grafana（四）系統監控之釘釘預警 Grafana+Prometheus系統監控之webhook Grafana+Prometheus系統監控之Redis Grafana+Zabbix+Prometheus 監控系統 Prometheus + Grafana（十二）系統監控之Spark