最近項目在Kubernetes上使用Eureka遇到一些問題,在網站上找到一篇針對Eureka自我保護機制原理的文章,覺得不錯,總結如下:
Eureka的自我保護特性主要用於減少在網絡分區或者不穩定狀況下的不一致性問題
- Eureka自我保護的產生原因:
Eureka在運行期間會統計心跳失敗的比例,在15分鍾內是否低於85%,如果出現了低於的情況,Eureka Server會將當前的實例注冊信息保護起來,同時提示一個警告,一旦進入保護模式,Eureka Server將會嘗試保護其服務注冊表中的信息,不再刪除服務注冊表中的數據。也就是不會注銷任何微服務。
- Kubernetes環境
但在Kubernetes環境下,如果某個節點的Kubelet下線,比較容易造成自我保護。階段如下:
-
- Kubelet下線,會造成大量的服務處於Unknow狀態, Kubernetes為了維持Deployment指定的Pod數量,會在其他節點上啟動服務,注冊到Eureka,這個時候是不會觸發自我保護的。
- 重新啟動Kubelet進行讓節點Ready,這時候Kubernetes發現Pod數量超過了設計值,然后Terminate原來Unknown的Pod,這個時候就會出現大量的服務下線狀態,從而觸發自我保護
- 而處於自我保護狀態的Eureka不再同步服務的信息,同時也不再和另一個實例保持同步。
這個是個比較核心的問題,如果這樣的話,只能夠手工刪除Eureka實例讓他重建,恢復正常狀況。
所以在Kubernetes環境下,關閉服務保護,讓Eureka和服務保持同步狀態。
目前的解決辦法
- Eureka Server端:配置關閉自我保護,並按需配置Eureka Server清理無效節點的時間間隔。
eureka.server.enable-self-preservation# 設為false,關閉自我保護 eureka.server.eviction-interval-timer-in-ms # 清理間隔(單位毫秒,默認是60*1000)
- Eureka Client端:配置開啟健康檢查,並按需配置續約更新時間和到期時間
eureka.client.healthcheck.enabled# 開啟健康檢查(需要spring-boot-starter-actuator依賴) eureka.instance.lease-renewal-interval-in-seconds# 續約更新時間間隔(默認30秒) eureka.instance.lease-expiration-duration-in-seconds # 續約到期時間(默認90秒)
原文如下, 我把結論翻譯一下
- 我在自我保護方面的經驗是,在大多數情況下,它錯誤地假設一些弱微服務實例是一個糟糕的網絡分區。
- 自我保護永不過期,直到並且除非關閉微服務(或解決網絡故障)。
- 如果啟用了自我保留,我們無法微調實例心跳間隔,因為自我保護假定心跳以30秒的間隔接收。
- 除非這些網絡故障在您的環境中很常見,否則我建議將其關閉(即使大多數人建議將其保留)。
Eureka is an AP system in terms of CAP theorem which in turn makes the information in the registry inconsistent between servers during a network partition. The self-preservation feature is an effort to minimize this inconsistency.
-
Defining self-preservation
Self-preservation is a feature where Eureka servers stop expiring instances from the registry when they do not receive heartbeats (from peers and client microservices) beyond a certain threshold.
Let’s try to understand this concept in detail.
-
Starting with a healthy system
Consider the following healthy system.
The healthy system — before encountering network partition
Suppose that all the microservices are in healthy state and registered with Eureka server. In case if you are wondering why — that’s because Eureka instances register with and send heartbeats only to the very first server configured in service-url list. i.e.
eureka.client.service-url.defaultZone=server1,server2
Eureka servers replicate the registry information with adjacent peers and the registry indicates that all the microservice instances are in UP state. Also suppose that instance 2 used to invoke instance 4 after discovering it from a Eureka registry.
-
Encountering a network partition
Assume a network partition happens and the system is transitioned to following state.
During network partition - enters self-preservation
Due to the network partition instance 4 and 5 lost connectivity with the servers, however instance 2 is still having connectivity to instance 4. Eureka server then evicts instance 4 and 5 from the registry since it’s no longer receiving heartbeats. Then it will start observing that it suddenly lost more than 15% of the heartbeats, so it enters self-preservation mode.
From now onward Eureka server stops expiring instances in the registry even if remaining instances go down.
During self-preservation - stops expiring instances
Instance 3 has gone down, but it remains active in the server registry. However servers accept new registrations.
-
The rationale behind self-prservation
Self-preservation feature can be justified for following two reasons.
-
Servers not receiving heartbeats could be due to a poor network partition (i.e. does not necessarily mean the clients are down) which could be resolved sooner.
-
Even though the connectivity is lost between servers and some clients, clients might have connectivity with each other. i.e. Instance 2 is having connectivity to instance 4 as in the above diagram during the network partition.
-
Configurations (with defaults)
Listed below are the configurations that can directly or indirectly impact self-preservation behavior.
eureka.instance.lease-renewal-interval-in-seconds = 30
Indicates the frequency the client sends heartbeats to server to indicate that it is still alive. It’s not advisable to change this value since self-preservation assumes that heartbeats are always received at intervals of 30 seconds.
eureka.instance.lease-expiration-duration-in-seconds = 90
Indicates the duration the server waits since it received the last heartbeat before it can evict an instance from its registry. This value should be greater than lease-renewal-interval-in-seconds
. Setting this value too long impacts the precision of actual heartbeats per minute calculation described in the next section, since the liveliness of the registry is dependent on this value. Setting this value too small could make the system intolerable to temporary network glitches.
eureka.server.eviction-interval-timer-in-ms = 60 * 1000
A scheduler is run at this frequency which will evict instances from the registry if the lease of the instances are expired as configured by lease-expiration-duration-in-seconds
. Setting this value too long will delay the system entering into self-preservation mode.
eureka.server.renewal-percent-threshold = 0.85
This value is used to calculate the expected heartbeats per minute as described in the next section.
eureka.server.renewal-threshold-update-interval-ms = 15 * 60 * 1000
A scheduler is run at this frequency which calculates the expected heartbeats per minute as described in the next section.
eureka.server.enable-self-preservation = true
Last but not least, self-preservation can be disabled if required.
Making sense of configurations
Eureka server enters self-preservation mode if the actual number of heartbeats in last minute
is less than the expected number of heartbeats per minute
.
Expected number of heartbeats per minute
We can see the means of calculating expected number of heartbeats per minute threshold. Netflix code assumes that heartbeats are always received at intervals of 30 seconds for this calculation.
Suppose the number of registered application instances at some point in time is N and the configured renewal-percent-threshold
is 0.85.
-
Number of heartbeats expected from one instance / min = 2
-
Number of heartbeats expected from N instances / min = 2 * N
-
Expected minimum heartbeats / min = 2 * N * 0.85
Since N is a variable, 2 * N * 0.85 is calculated in every 15 minutes by default (or based on renewal-threshold-update-interval-ms
).
Actual number of heartbeats in last minute
This is calculated by a scheduler which runs in a frequency of one minute.
Also as describe above, two schedulers run independently in order to calculate actual and expected number of heartbeats. However it’s another scheduler which compares these two values and identifies whether the system is in self-preservation mode — which is EvictionTask
. This scheduler runs in a frequency of eviction-interval-timer-in-ms
and evicts expired instances, however it checks whether the system has reached self-preservation mode (by comparing actual and expected heartbeats) before evicting.
The eureka dashboard also does this comparison every time when you launch it in order to display the message ‘…INSTANCES ARE NOT BEING EXPIRED JUST TO BE SAFE’.
-
Conclusion
-
My experience with self-preservation is that it’s a false-positive most of the time where it incorrectly assumes a few down microservice instances to be a poor network partition.
-
Self-preservation never expires, until and unless the down microservices are brought back (or the network glitch is resolved).
-
If self-preservation is enabled, we cannot fine-tune the instance heartbeat interval, since self-preservation assumes heartbeats are received at intervals of 30 seconds.
-
Unless these kinds of network glitches are common in your environment, I would suggest to turn it off (even though most people recommend to keep it on).