outlier detection
在異常檢測領域中,常常需要決定新觀察的點是否屬於與現有觀察點相同的分布(則它稱為inlier),或者被認為是不同的(稱為outlier)。離群是異常的數據,但是不一定是錯誤的數據點。
在Envoy中,離群點檢測是動態確定上游集群中是否有某些主機表現不正常,然后將它們從正常的負載均衡集群中刪除的過程。outlier detection可以與healthy check同時/獨立啟用,並構成整個上游運行狀況檢查解決方案的基礎。
此處概念不做過多的說明,具體可以參考官方文檔與自行google
監測類型
- 連續的5xx
- 連續的網關錯誤
- 連續的本地來源錯誤
更多介紹參考官方文檔 outlier detection
離群檢測測試
說明,此處只能在單機環境測試更多還的參考與實際環境
環境准備
docker-compose 模擬后端5個節點
version: '3'
services:
envoy:
image: envoyproxy/envoy-alpine:v1.15-latest
environment:
- ENVOY_UID=0
ports:
- 80:80
- 443:443
- 82:9901
volumes:
- ./envoy.yaml:/etc/envoy/envoy.yaml
networks:
envoymesh:
aliases:
- envoy
depends_on:
- webserver1
- webserver2
webserver1:
image: sealloong/envoy-end:latest
networks:
envoymesh:
aliases:
- myservice
- webservice
expose:
- 90
webserver2:
image: sealloong/envoy-end:latest
networks:
envoymesh:
aliases:
- myservice
- webservice
expose:
- 90
webserver3:
image: sealloong/envoy-end:latest
networks:
envoymesh:
aliases:
- myservice
- webservice
expose:
- 90
webserver4:
image: sealloong/envoy-end:latest
networks:
envoymesh:
aliases:
- myservice
- webservice
expose:
- 90
webserver5:
image: sealloong/envoy-end:latest
networks:
envoymesh:
aliases:
- myservice
- webservice
expose:
- 90
networks:
envoymesh: {}
envoy 配置文件
admin:
access_log_path: /dev/null
address:
socket_address: { address: 0.0.0.0, port_value: 9901 }
static_resources:
listeners:
- name: listener_0
address:
socket_address: { address: 0.0.0.0, port_value: 80 }
filter_chains:
- filters:
- name: envoy_http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
stat_prefix: ingress_http
codec_type: AUTO
route_config:
name: local_route
virtual_hosts:
- name: local_service
domains: [ "*" ]
routes:
- match: { prefix: "/" }
route: { cluster: local_service }
http_filters:
- name: envoy.filters.http.router
clusters:
- name: local_service
connect_timeout: 0.25s
type: STRICT_DNS
lb_policy: ROUND_ROBIN
load_assignment:
cluster_name: local_service
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address: { address: webservice, port_value: 90 }
health_checks:
timeout: 3s
interval: 90s
unhealthy_threshold: 5
healthy_threshold: 5
no_traffic_interval: 240s
http_health_check:
path: "/ping"
expected_statuses:
start: 200
end: 201
outlier_detection:
consecutive_5xx: 2
base_ejection_time: 30s
max_ejection_percent: 40
interval: 20s
success_rate_minimum_hosts: 5
success_rate_request_volume: 10
配置說明
outlier_detection:
consecutive_5xx: 2 # 連續的5xx錯誤數量
base_ejection_time: 30s # 彈出主機的基准時間。實際時間等於基本時間乘以主機彈出的次數
max_ejection_percent: 40 # 可彈出主機集群的最大比例,默認值為10% ,此處為40% 即集群中5個節點的2個節點
interval: 20s # 間隔時間
success_rate_minimum_hosts: 5 # 集群中最小主機數量
success_rate_request_volume: 10 # 在一個時間間隔內中收集請求檢測的最小數量
此處為了效果,將主動檢測狀態時間增加,主機彈出時間增加
路由
/502bad
模擬一個502的錯誤
運行結果
模擬一些5xx請求和200請求
workers
envoy_1 | [2020-09-13 06:10:01.093][1][warning][main] [source/server/server.cc:537] there is no configured limit to the number of allowed active connections. Set a limit via the runtime key overload.global_downstream_max_connections
webserver2_1 | [GIN] 2020/09/13 - 06:10:08 | 200 | 63.272?s | 172.22.0.7 | GET "/"
webserver5_1 | [GIN] 2020/09/13 - 06:10:10 | 200 | 46.732?s | 172.22.0.7 | GET "/"
webserver1_1 | [GIN] 2020/09/13 - 06:10:11 | 200 | 45.43?s | 172.22.0.7 | GET "/"
webserver3_1 | [GIN] 2020/09/13 - 06:10:13 | 502 | 43.858?s | 172.22.0.7 | GET "/502bad"
webserver4_1 | [GIN] 2020/09/13 - 06:10:14 | 502 | 47.486?s | 172.22.0.7 | GET "/502bad"
webserver2_1 | [GIN] 2020/09/13 - 06:10:15 | 200 | 15.691?s | 172.22.0.7 | GET "/"
webserver5_1 | [GIN] 2020/09/13 - 06:10:16 | 200 | 14.719?s | 172.22.0.7 | GET "/"
webserver1_1 | [GIN] 2020/09/13 - 06:10:16 | 200 | 15.758?s | 172.22.0.7 | GET "/"
webserver3_1 | [GIN] 2020/09/13 - 06:10:17 | 502 | 15.697?s | 172.22.0.7 | GET "/502bad"
webserver2_1 | [GIN] 2020/09/13 - 06:10:17 | 502 | 14.002?s | 172.22.0.7 | GET "/502bad"
webserver5_1 | [GIN] 2020/09/13 - 06:10:17 | 502 | 14.913?s | 172.22.0.7 | GET "/502bad"
webserver1_1 | [GIN] 2020/09/13 - 06:10:18 | 502 | 14.911?s | 172.22.0.7 | GET "/502bad"
webserver4_1 | [GIN] 2020/09/13 - 06:10:18 | 502 | 30.429?s | 172.22.0.7 | GET "/502bad"
webserver5_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 14.377?s | 172.22.0.7 | GET "/"
webserver1_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 14.861?s | 172.22.0.7 | GET "/"
webserver2_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 18.924?s | 172.22.0.7 | GET "/"
webserver5_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 15.899?s | 172.22.0.7 | GET "/"
webserver1_1 | [GIN] 2020/09/13 - 06:10:19 | 200 | 24.849?s | 172.22.0.7 | GET "/"
集群已彈出 20%的節點,健康檢查結果為 failed_outlier_check
請求已分配到其余三台節點
30秒后,彈出主機已回復正常
再次模擬請求
30秒后,如在時間間隔內,無新增請求,節點依舊為 failed_outlier_check
,有新增請求時恢復。