線上 ELK 集群健康值 red 狀態問題排查與解決


之前一直運行正常的數據分析平台,最近一段時間沒有注意發現日志索引數據一直未生成,大概持續了n多天,當前狀態: 單台機器, Elasticsearch(下面稱ES)單節點(空集群),1000+shrads, 約200G大小。

問題排查

服務器內存,CPU狀態檢查

使用 top 查看服務器 cpu,內存等占用情況,如下圖示(當時樓主的服務器ES應用的CPU占用在90%以上,肯定有問題)

top

內存占用也極高(當時樓主的8G內存的服務器僅剩下150M左右的空閑,肯定是ES的問題)

free

ES集群狀態

查看ES集群健康值,發現 statusred,這種狀態表示部分主分片不可用,樓主當前的狀態是歷史數據可查,但是無法生成新的 index 數據。

curl http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 663,
  "active_shards" : 663,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 6,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.10313901345292
}

查看每個索引的狀態,發現大部分索引狀態是 red ,處於不可用狀態,因為打開的索引數據過多,導致ES占用大量的CPU,內存,使得 logstash 不可用,也就無法創建新的索引數據,從而導致數據丟失。

curl -XGET   "http://localhost:9200/_cat/indices?v"

health status index          pri rep docs.count docs.deleted store.size pri.store.size
red    open   jr-2016.12.20    3   0
red    open   jr-2016.12.21    3   0
red    open   jr-2016.12.22    3   0
red    open   jr-2016.12.23    3   0
red    open   jr-2016.12.24    3   0
red    open   jr-2016.12.25    3   0
red    open   jr-2016.12.26    3   0
red    open   jr-2016.12.27    3   0

ES集群分片不可用,導致的查詢失敗

查詢ES時拋出的異常:

[2018-08-06 18:27:24,553][DEBUG][action.search            ] [Godfrey Calthrop] All shards failed for phase: [query]
[jr-2018.08.06][[jr-2018.08.06][2]] NoShardAvailableActionException[null]
    at org.elasticsearch.action.search.AbstractSearchAsyncAction.start(AbstractSearchAsyncAction.java:129)
    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:115)
    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:47)
    at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137)
    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
    at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52)
    at org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83)
    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
    at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:582)
    at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:85)
    at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:54)
    at org.elasticsearch.rest.RestController.executeHandler(RestController.java:205)
    at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:166)
    at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:128)
    at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:86)
    at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:449)
    at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:61)

問題解決

通過以上排查大概知道是歷史索引數據處於 open 狀態過多,從而導致ES的CPU,內存占用過高導致的不可用。

#關閉不需要的索引,減少內存占用
curl -XPOST "http://localhost:9200/index_name/_close"

小插曲

關閉非熱點索引數據后,樓主的ES集群的健康值依然是 red 狀態,樓主最后聯想到索引的 red 狀態可能會影響ES的狀態,果不其然如下所示

curl GET http://10.252.148.85:9200/_cluster/health?level=indices

{
	"cluster_name": "elasticsearch",
	"status": "red",
	"timed_out": false,
	"number_of_nodes": 1,
	"number_of_data_nodes": 1,
	"active_primary_shards": 660,
	"active_shards": 660,
	"relocating_shards": 0,
	"initializing_shards": 0,
	"unassigned_shards": 9,
	"delayed_unassigned_shards": 0,
	"number_of_pending_tasks": 0,
	"number_of_in_flight_fetch": 0,
	"task_max_waiting_in_queue_millis": 0,
	"active_shards_percent_as_number": 98.65470852017937,
	"indices": {
		"jr-2018.08.06": {
			"status": "red",
			"number_of_shards": 3,
			"number_of_replicas": 0,
			"active_primary_shards": 0,
			"active_shards": 0,
			"relocating_shards": 0,
			"initializing_shards": 0,
			"unassigned_shards": 3
		}
	}
}

解決方法,刪除這條索引數據(這條數據是樓主排查問題期間產生的臟數據,索引直接刪除)

curl -XDELETE 'http://10.252.148.85:9200/jr-2018.08.06'

小結

當ES處於單點時,應注意ES的索引狀態以及服務器的監控,及時清理或者關閉不必要的索引數據,避免這種情況發生。技術成長的道路上,與你同行。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM