Elasticsearch 監控和部署

本文轉載自查看原文 2017-05-27 10:47 1514

Elasticsearch:

! [ https://elasticsearch.cn/book/elasticsearch_definitive_guide_2.x/_cluster_health.html ]

【監控】
server command：
1.集群健康：
a.health check --- command：GET _cluster/health
example:
{
"cluster_name": "elasticsearch_zach",
"status": "green",
"timed_out": false,
"number_of_nodes": 1,
"number_of_data_nodes": 1,
"active_primary_shards": 10,
"active_shards": 10,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0
}
b.issue example:
command：GET _cluster/health?level=indices
{
"cluster_name": "elasticsearch_zach",
"status": "red",
"timed_out": false,
"number_of_nodes": 8,
"number_of_data_nodes": 8,
"active_primary_shards": 90,
"active_shards": 180,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 20
"indices": {
"v1": {
"status": "green",
"number_of_shards": 10,
"number_of_replicas": 1,
"active_primary_shards": 10,
"active_shards": 20,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0
},
"v2": {
"status": "red", -----------issue
"number_of_shards": 10,
"number_of_replicas": 1,
"active_primary_shards": 0,
"active_shards": 0,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 20 ----------issue
},
"v3": {
"status": "green",
"number_of_shards": 10,
"number_of_replicas": 1,
"active_primary_shards": 10,
"active_shards": 20,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 0
},
....
}
}
c.當構建單元和集成測試時，或者實現和 Elasticsearch 相關的自動化腳本時，
cluster-health API 還有另一個小技巧非常有用。你可以指定一個 wait_for_status 參數，
它只有在狀態達標之后才會返回。比如：
command：GET _cluster/health?wait_for_status=green

2.單個節點統計：
a.GET _nodes/stats
在輸出內容的開頭，我們可以看到集群名稱和我們的第一個節點：

{
"cluster_name": "elasticsearch_zach",
"nodes": {
"UNr6ZMf5Qk-YCPA_L18BOQ": {
"timestamp": 1408474151742,
"name": "Zach",
"transport_address": "inet[zacharys-air/192.168.1.131:9300]",
"host": "zacharys-air",
"ip": [
"inet[zacharys-air/192.168.1.131:9300]",
"NONE"
],

b.索引
c.操作系統和進程部分
CPU/負載/內存使用率/Swap 使用率/打開的文件描述符
d.線程池
e.文件系統和網絡部分
f.斷路器

3.集群統計
GET _cluster/stats

4.索引統計：
GET my_index/_stats
GET my_index,another_index/_stats
GET _all/_stats
統計 my_index 索引。
使用逗號分隔索引名可以請求多個索引統計值。
使用特定的 _all 可以請求全部索引的統計值

5.等待中任務：
GET _cluster/pending_tasks

6.cat API
a.GET /_cat

eg：GET /_cat

=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}

b.要啟用表頭，添加 ?v 參數即可：

GET /_cat/health?v

epoch time cluster status node.total node.data shards pri relo init
1408[..] 12[..] el[..] 1 1 114 114 0 0 114
unassign

【部署】
1.硬件/內存/CPU/硬盤/網絡
2.Java虛擬機
3.Transport Client (傳輸客戶端) 與 Node Client (節點客戶端)
4.配置管理

elasticsearch.yml:
a.集群/節點
cluster.name: elasticsearch_production
node.name: elasticsearch_005_data

b.插件、日志以及你最重要的數據路徑：
path.data: /path/to/data1,/path/to/data2
# Path to log files:
path.logs: /path/to/logs
# Path to where plugins are installed:
path.plugins: /path/to/plugins

c.最小主節點數：
minimum_master_nodes
eg：discovery.zen.minimum_master_nodes: 2

d.集群恢復方面設置：
keywords example:
gateway.recover_after_nodes: 8
gateway.expected_nodes: 10
gateway.recover_after_time: 5m
eg：
這意味着 Elasticsearch 會采取如下操作：
等待集群至少存在 8 個節點
等待 5 分鍾，或者10 個節點上線后，才進行數據恢復，這取決於哪個條件先達到。

e.單播代替組播：
#discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["host1", "host2:port"]

5.[red]不要觸碰的配置：
a.垃圾回收器
b.線程池

6.堆內存:大小(小於32GB)和交換
a.設置：
（1）指定 ES_HEAP_SIZE 環境變量或者用下面的命令設置它：export ES_HEAP_SIZE=10g
（2）此外，你也可以通過命令行參數的形式，在程序啟動的時候把內存大小傳遞給它，
如果你覺得這樣更簡單的話：./bin/elasticsearch -Xmx10g -Xms10g
確保堆內存最小值（ Xms ）與最大值（ Xmx ）的大小是相同的，
防止程序在運行時改變堆內存大小，這是一個很耗系統資源的過程。

b.內存的（少於）一半給 Lucene
Lucene 被設計為可以利用操作系統底層機制來緩存內存數據結構。
標准的建議是把 50％的可用內存作為 Elasticsearch 的堆內存.

c.性能優化：
禁用swapping：
1.臨時：sudo swapoff -a
2.需要永久禁用，你可能需要修改 /etc/fstab 文件，這要參考你的操作系統相關文檔。
3.降低 swappiness 的值：
對於大部分Linux操作系統，可以在 sysctl 中這樣配置：
vm.swappiness = 1
4.需要打開配置文件中的 mlockall 開關。它的作用就是允許 JVM 鎖住內存，
禁止操作系統交換出去。在你的 elasticsearch.yml 文件中，設置如下：
bootstrap.mlockall: true

7.文件描述符和 MMap
a.GET /_nodes/process

{
"cluster_name": "elasticsearch__zach",
"nodes": {
"TGn9iO2_QQKb0kavcLbnDw": {
"name": "Zach",
"transport_address": "inet[/192.168.1.131:9300]",
"host": "zacharys-air",
"ip": "192.168.1.131",
"version": "2.0.0-SNAPSHOT",
"build": "612f461",
"http_address": "inet[/192.168.1.131:9200]",
"process": {
"refresh_interval_in_millis": 1000,
"id": 19808,
"max_file_descriptors": 64000, --------【顯示 Elasticsearch 進程可以訪問的可用文件描述符數量】
"mlockall": true
}
}
}
}
b.Elasticsearch 對各種文件混合使用了 NioFs（注：非阻塞文件系統）和
MMapFs （注：內存映射文件系統）。請確保你配置的最大映射數量，
以便有足夠的虛擬內存可用於 mmapped 文件。這可以暫時設置：
sysctl -w vm.max_map_count=262144
或者你可以在 /etc/sysctl.conf 通過修改 vm.max_map_count 永久設置它。

【部署后】
1.動態變更設置
eg：
PUT /_cluster/settings
{
"persistent" : {
"discovery.zen.minimum_master_nodes" : --這個永久設置會在全集群重啟時存活下來
},
"transient" : {
"indices.store.throttle.max_bytes_per_sec" : "50mb" --這個臨時設置會在第一次全集群重啟后被移除。
}
}

2.日志記錄：
a.Elasticsearch 會輸出很多日志，都放在 ES_HOME/logs 目錄下
b.我們調高節點發現的日志記錄級別：
PUT /_cluster/settings
{
"transient" : {
"logger.discovery" : "DEBUG"
}
}
c.慢日志：
PUT /my_index/_settings
{
"index.search.slowlog.threshold.query.warn" : "10s", 查詢慢於 10 秒輸出一個 WARN 日志。
"index.search.slowlog.threshold.fetch.debug": "500ms", 獲取慢於 500 毫秒輸出一個 DEBUG 日志
"index.indexing.slowlog.threshold.index.info": "5s" 索引慢於 5 秒輸出一個 INFO 日志。
}
PUT /_cluster/settings
{
"transient" : {
"logger.index.search.slowlog" : "DEBUG", 設置搜索慢日志為 DEBUG 級別。
"logger.index.indexing.slowlog" : "WARN" 設置索引慢日志為 WARN 級別。
}
}
d.索引性能技巧:
(1)科學的測試性能
(2)使用批量請求並調整其大小
(3)存儲
(4)段和合並
PUT /_cluster/settings
{
"persistent" : {
"indices.store.throttle.max_bytes_per_sec" : "100mb"
}
}
PUT /_cluster/settings
{
"transient" : {
"indices.store.throttle.type" : "none" 設置限流類型為 none 徹底關閉合並限流。等你完成了導入，記得改回 merge 重新打開限流。
}
}
如果你使用的是機械磁盤而非 SSD，你需要添加下面這個配置到你的 elasticsearch.yml 里：

index.merge.scheduler.max_thread_count: 1
(5)
其他
最后，還有一些其他值得考慮的東西需要記住：

如果你的搜索結果不需要近實時的准確度，考慮把每個索引的 index.refresh_interval 改到 30s 。如果你是在做大批量導入，導入期間你可以通過設置這個值為 -1 關掉刷新。別忘記在完工的時候重新開啟它。
如果你在做大批量導入，考慮通過設置 index.number_of_replicas: 0關閉副本。文檔在復制的時候，整個文檔內容都被發往副本節點，然后逐字的把索引過程重復一遍。這意味着每個副本也會執行分析、索引以及可能的合並過程。

相反，如果你的索引是零副本，然后在寫入完成后再開啟副本，恢復過程本質上只是一個字節到字節的網絡傳輸。相比重復索引過程，這個算是相當高效的了。

如果你沒有給每個文檔自帶 ID，使用 Elasticsearch 的自動 ID 功能。這個為避免版本查找做了優化，因為自動生成的 ID 是唯一的。
如果你在使用自己的 ID，嘗試使用一種 Lucene 友好的 ID。包括零填充序列 ID、UUID-1 和納秒；這些 ID 都是有一致的，壓縮良好的序列模式。相反的，像 UUID-4 這樣的 ID，本質上是隨機的，壓縮比很低，會明顯拖慢 Lucene。

f.推遲分片分配
修改參數 delayed_timeout ，默認等待時間可以全局設置也可以在索引級別進行修改:

eg:PUT /_all/_settings 通過使用 _all 索引名，我們可以為集群里面的所有的索引使用這個參數
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "5m" 默認時間被修改成了 5 分鍾
}
}

g.滾動重啟
可能的話，停止索引新的數據。雖然不是每次都能真的做到，但是這一步可以幫助提高恢復速度。
禁止分片分配。這一步阻止 Elasticsearch 再平衡缺失的分片，直到你告訴它可以進行了。如果你知道維護窗口會很短，這個主意棒極了。你可以像下面這樣禁止分配：

PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "none"
}
}
關閉單個節點。
執行維護/升級。
重啟節點，然后確認它加入到集群了。
用如下命令重啟分片分配：
PUT /_cluster/settings
{
"transient" : {
"cluster.routing.allocation.enable" : "all"
}
}
分片再平衡會花一些時間。一直等到集群變成綠色狀態后再繼續。 to status green before continuing.
重復第 2 到 6 步操作剩余節點。
到這步你可以安全的恢復索引了（如果你之前停止了的話），不過等待集群完全均衡后再恢復索引，也會有助於提高處理速度。

h.備份集群: (https://elasticsearch.cn/book/elasticsearch_definitive_guide_2.x/backing-up-your-cluster.html)
使用 snapshot API
創建倉庫:
PUT _snapshot/my_backup 給我們的倉庫取一個名字，在本例它叫 my_backup 。
{
"type": "fs", 我們指定倉庫的類型應該是一個共享文件系統。
"settings": {
"location": "/mount/backups/my_backup" 最后，我們提供一個已掛載的設備作為目的地址。
}
}
快照所有打開的索引
快照指定索引
列出快照相關的信息
刪除快照
監控快照進度
取消一個快照

i.從快照恢復 (https://elasticsearch.cn/book/elasticsearch_definitive_guide_2.x/_restoring_from_a_snapshot.html#_restoring_from_a_snapshot)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 elasticsearch 監控 Elasticsearch 監控指標解析 zabbix 監控 ElasticSearch prometheus監控elasticsearch ElasticSearch監控工具 - cerebro elasticsearch 性能監控基礎 Prometheus監控elasticsearch zabbix監控Elasticsearch集群 prometheus 監控elasticsearch elasticsearch 筆試題 Elasticsearch集群監控指標學習