etcd報錯:failed to send out heartbeat on time


報錯內容:

2019-06-05 02:09:03.008888 W | rafthttp: health check for peer 8816eaa680e63c73 could not connect: dial tcp 192.168.49.138:2380: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2019-06-05 02:09:03.010827 W | rafthttp: health check for peer 8816eaa680e63c73 could not connect: dial tcp 192.168.49.138:2380: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2019-06-05 02:09:04.631367 I | rafthttp: peer 8816eaa680e63c73 became active
2019-06-05 02:09:04.631405 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream MsgApp v2 reader)
2019-06-05 02:09:04.632227 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream Message reader)
2019-06-05 02:09:04.634697 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream MsgApp v2 writer)
2019-06-05 02:09:04.635154 I | rafthttp: established a TCP streaming connection with peer 8816eaa680e63c73 (stream Message writer)
2019-06-05 02:09:04.961320 I | etcdserver: updating the cluster version from 3.0 to 3.3
2019-06-05 02:09:04.965052 N | etcdserver/membership: updated the cluster version from 3.0 to 3.3
2019-06-05 02:09:04.965231 I | etcdserver/api: enabled capabilities for version 3.3

2019-06-05 02:20:39.344648 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 237.022208ms, to a3d1fb0d28ed2953)
2019-06-05 02:20:39.344676 W | etcdserver: server is likely overloaded
2019-06-05 02:20:39.344685 W | etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for 237.127928ms, to 8816eaa680e63c73)
2019-06-05 02:20:39.344689 W | etcdserver: server is likely overloaded

報錯信息主要為:failed to send out heartbeat on time (exceeded the 100ms timeout for 401.80886ms)

心跳檢測報錯主要與以下因素有關(磁盤速度、cpu性能和網絡不穩定問題):

  • etcd使用了raft算法,leader會定時地給每個follower發送心跳,如果leader連續兩個心跳時間沒有給follower發送心跳,etcd會打印這個log以給出告警。通常情況下這個issue是disk運行過慢導致的,leader一般會在心跳包里附帶一些metadata,leader需要先把這些數據固化到磁盤上,然后才能發送。寫磁盤過程可能要與其他應用競爭,或者因為磁盤是一個虛擬的或者是SATA類型的導致運行過慢,此時只有更好更快磁盤硬件才能解決問題。etcd暴露給Prometheus的metrics指標walfsyncduration_seconds就顯示了wal日志的平均花費時間,通常這個指標應低於10ms。

  • 第二種原因就是CPU計算能力不足。如果是通過監控系統發現CPU利用率確實很高,就應該把etcd移到更好的機器上,然后通過cgroups保證etcd進程獨享某些核的計算能力,或者提高etcd的priority。

  • 第三種原因就可能是網速過慢。如果Prometheus顯示是網絡服務質量不行,譬如延遲太高或者丟包率過高,那就把etcd移到網絡不擁堵的情況下就能解決問題。但是如果etcd是跨機房部署的,長延遲就不可避免了,那就需要根據機房間的RTT調整heartbeat-interval,而參數election-timeout則至少是heartbeat-interval的5倍。

參考
https://blog.csdn.net/linux_player_c/article/details/79875806


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM