elasticsearch5.0.1集群排錯的幾個思路總結


1.首先查看集群整體健康狀態

# curl -XGET http://10.27.35.94:9200/_cluster/health?pretty
{
"cluster_name" : "yunva-es",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 7,
"number_of_data_nodes" : 6,
"active_primary_shards" : 85,
"active_shards" : 157,
"relocating_shards" : 0,
"initializing_shards" : 6,
"unassigned_shards" : 19,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 86.26373626373626
}

 

如果是red狀態,說明有節點掛掉,找到掛掉的索引分片和節點

如下例子,可以看到 voice:live:logout 這個索引的0分片都沒有分配說明掛掉了,我們可以查看之前正常的時候的分片情況(可以定期將分片的分配情況記錄下來)

# curl 10.26.241.237:9200/_cat/shards
....
voice:live:logout 2 p STARTED 428 62.9kb 10.27.65.121 yunva_etl_es6
voice:live:logout 2 r STARTED 428 62.9kb 10.26.241.239 yunva_etl_es3
voice:live:logout 4 r STARTED 444 99.8kb 10.45.150.115 yunva_etl_es9
voice:live:logout 4 p STARTED 444 99.8kb 10.25.177.47 yunva_etl_es11
voice:live:logout 1 p STARTED 419 97.7kb 10.26.241.239 yunva_etl_es3
voice:live:logout 1 r STARTED 419 97.7kb 10.25.177.47 yunva_etl_es11
voice:live:logout 3 p STARTED 440 73.2kb 10.27.35.94 yunva_etl_es7
voice:live:logout 3 r STARTED 440 73.2kb 10.27.78.228 yunva_etl_es5
voice:live:logout 0 p UNASSIGNED 
voice:live:logout 0 r UNASSIGNED

定期記錄分片的腳本

# cat es_shard.sh 
#!/bin/bash

echo $(date +"%Y-%m-%d %H:%M:%S") >> /data/es_shards.txt
curl -XGET http://10.26.241.237:9200/_cat/shards >> /data/es_shards.txt

 

2.依次查詢節點的健康狀態,如果哪個節點不返回,或者很慢,可能是內存溢出,需要直接重啟該節點

# curl -XGET http://IP:9200/_cluster/health?pretty

內存溢出的典型特征會在elasticsearch/bin目錄下產生類似如下文件:

hs_err_pid27186.log
java_pid1151.hprof

3.zabbix添加監控
①如果掛掉自動啟動(注意不能是root用戶)

自動啟動elasticsearch腳本:

# cat /usr/local/zabbix-agent/scripts/start_es.sh

#!/bin/bash
# if elasticsearch process exists kill it
source /etc/profile

count_es=`ps -ef|grep elasticsearch|grep -v grep|wc -l`
if [ $count_es -gt 1 ];then
ps -ef|grep elasticsearch|grep -v grep|/bin/kill `awk '{print $2}'`
fi
rm -f /data/elasticsearch-5.0.1/bin/java_pid*.hprof
# start it
su yunva -c "cd /data/elasticsearch-5.0.1/bin && /bin/bash elasticsearch &"

 

②有hs_err*.log或者hprof文件刪除文件然后重啟該節點(可以直接觸發start_es.sh腳本)

elasticsearch報錯監控項:
UserParameter=es_debug,sudo /bin/find /data/elasticsearch-5.0.1/bin/ -name hs_err_pid*.log -o -name java_pid*.hprof|wc -l

 

java報錯的監控項:

UserParameter=java_error,sudo /bin/find /home -name hs_err_pid*.log -o -name java_pid*.hprof -o -name jvm.log|wc -l

③curl -XGET http://IP:9200/_cluster/health?pretty 如果響應時間超過30S重啟

for IP in 10.28.50.131 10.26.241.239 10.25.135.215 10.26.241.237 10.27.78.228 10.27.65.121 10.27.35.94 10.30.136.143 10.174.12.230 10.45.150.115 10.25.177.47
do 
curl -XGET http://$IP:9200/_cluster/health?pretty
done

 

4.優化配置:

# 以下配置可以減少當es節點短時間宕機或重啟時shards重新分布帶來的磁盤io讀寫浪費

discovery.zen.fd.ping_timeout: 300s
discovery.zen.fd.ping_retries: 8
discovery.zen.fd.ping_interval: 30s
discovery.zen.ping_timeout: 300s

 

5.es集群狀態檢測
UserParameter=es_cluster_status,curl -sXGET http://10.11.117.18:9200/_cluster/health/?pretty | grep "status"|awk -F '[ "]+' '{print $4}'|grep -c 'green'

后續如果有其他方面的一些好的方法也會更新上來

 

索引修改以后,需要刷新index表達式,否則無法正常識別

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM