SkyWalking鏈路追蹤系統-告警篇


1、概述

Skywalking發送告警的基本原理是每隔一段時間輪詢skywalking-oap收集到的鏈路追蹤的數據,再根據所配置的告警規則(如服務響應時間、服務響應時間百分比)等,如果達到閾值則發送響應的告警信息。 發送告警信息是以線程池異步的方式調用webhook接口完成的,具體的webhook接口可以由使用者自行定義,從而可以在指定的webhook接口中自行編寫各種告警方式,比如釘釘告警、郵件告警等等。告警的信息也可以在RocketBotui中查看到。

目前對應我前面文章中部署的8.4.0版本支持的告警接口如下:

  • 普通webhook
  • gRPCHook
  • Slack Chat Hook
  • WeChat Hook(微信告警)
  • Dingtalk Hook(釘釘告警)
  • Feishu Hook(飛書告警)

2、告警規則

2.1 默認告警規則

Skywalking中,告警規則稱為rule,默認安裝的Skywalking oap server組件中包含了告警規則的配置文件,位於安裝目錄下config文件夾下alarm-settings.yml文件中,在容器中運行的也是一樣的

# kubectl -n monitoring exec -it skywalking-oap-57d7f454f5-w4k4j -- bash
bash-5.0# pwd
/skywalking       
bash-5.0# cat config/alarm-settings.yml

以下是默認的告警規則配置文件內容

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
  service_sla_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    # The length of time to evaluate the metrics
    period: 10
    # How many times after the metrics match the condition, will trigger alarm
    count: 2
    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
    silence-period: 3
    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
  service_resp_time_percentile_rule:
    # Metrics value need to be long, double or int
    metrics-name: service_percentile
    op: ">"
    threshold: 1000,1000,1000,1000,1000
    period: 10
    count: 3
    silence-period: 5
    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
  service_instance_resp_time_rule:
    metrics-name: service_instance_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 2
    silence-period: 5
    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
  database_access_resp_time_rule:
    metrics-name: database_access_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
  endpoint_relation_resp_time_rule:
    metrics-name: endpoint_relation_resp_time
    threshold: 1000
    op: ">"
    period: 10
    count: 2
    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
#  Because the number of endpoint is much more than service and instance.
#
#  endpoint_avg_rule:
#    metrics-name: endpoint_avg
#    op: ">"
#    threshold: 1000
#    period: 10
#    count: 2
#    silence-period: 5
#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

webhooks:
#  - http://127.0.0.1/notify/
#  - http://127.0.0.1/go-wechat/

2.2 告警規則詳解

下面取默認的告警規則中的一條進行分析

rules:
  # Rule unique name, must be ended with `_rule`.
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

首先提示聲明了告警規則名稱應該具有唯一性,且必須以 _rule 結尾,這里是service_resp_time_rule(服務響應時間)

  • metrics-name:告警指標,指標度量值為longdoubleint類型

  • op:度量值和閾值的比較方式,這里是大於

  • threshold:閾值,這里是1000,毫秒為單位

  • period:評估度量標准的時間長度,也就是告警檢查周期,分鍾為單位

  • count:累計達到多少次告警值后觸發告警

  • silence-period:忽略相同告警信息的周期,默認與告警檢查周期一致。簡單來說,就是在觸發告警時開始計時N,在N+period時間內保持沉默silence不會再次觸發告警,這和alertmanager的告警抑制類似

  • message:告警消息主體,通過變量在發送消息時進行自動替換

除此之外,還有以下可選(高級)規則配置:

  • 排除或包含服務配置,默認匹配此指標中的所有服務

    ...
      service_percent_rule:
        metrics-name: service_percent
        include-names:
          - service_a
          - service_b
        exclude-names:
          - service_c
    ...
    
  • 多種值情況的指標閾值,例如P50、P75、P90、P95、P99的閾值,主要表示樣本的分布及其數量,例如P50表示取值周期內有50%的響應都大於1000ms,這和prometheus聚合指標quantile是一樣的,如果同時寫表示都滿足時觸發

    例如下面的規則表示在過去10分鍾內,由於p50 > 1000、p75 > 1000、p90 > 1000、p95 > 1000、p99 > 1000多個條件,服務累計3次的響應時間百分比都大於1000ms,觸發告警

    ...
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
        op: ">"
        threshold: 1000,1000,1000,1000,1000
        period: 10
        count: 3
        silence-period: 5
        message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
    
  • 復合規則composite-rules,針對相同實體級別而言的規則,例如服務級別的警報規則,同時滿足指定的多個規則時觸發

    rules:
      endpoint_percent_rule:
        # Metrics value need to be long, double or int
        metrics-name: endpoint_percent
    ...
        # Specify if the rule can send notification or just as an condition of composite rule 僅作為復合規則的條件
        only-as-condition: false
      service_percent_rule:
        metrics-name: service_percent
    ...
        only-as-condition: false
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
    ...
        only-as-condition: false
      meter_service_status_code_rule:
        metrics-name: meter_status_code
    ...
        only-as-condition: false
    composite-rules:
      comp_rule:
        # Must satisfied percent rule and resp time rule 
        expression: service_percent_rule && service_resp_time_percentile_rule
        message: Service {name} successful rate is less than 80% and P50 of response time is over 1000ms # 服務成功率小於80%,響應時間大於1000ms
    

到這里,就能分析出上面列出的所有默認告警規則的含義,依次為:

1 最近3分鍾內服務平均響應時間超過1秒
2 最近2分鍾內服務成功率低於80%
3 最近3分鍾的服務響應時間百分比超過1秒
4 最近2分鍾內服務實例的平均響應時間超過1秒
5 最近2分鍾內數據庫訪問的平均響應時間超過1秒
6 最近2分鍾內端點平均響應時間超過1秒
7 過去2分鍾內端點關系的平均響應時間超過1秒
  這條規則默認沒有打開,並且提示:由於端點的數量遠遠多於服務和實例,活動端點相關度量告警將比服務和服務實例度量告警消耗更多內存

3、自定義告警規則

Skywalking的配置大部分內容是通過應用的application.yml及系統的環境變量設置的,同時也支持下面系統的動態配置來源

  • gRPC服務
  • Zookeeper
  • Etcd
  • Consul
  • Apollo
  • Nacos
  • k8s configmap

參考Skywalking動態配置說明,如果開啟了動態配置,可以通過鍵alarm.default.alarm-settings覆蓋掉默認配置文件alarm-settings.yml

本文記錄的是基於k8shelm部署的Skywalking,因此可以通過k8s-configmap進行自定義配置的注入,最終在Skywalking配置文件中的實現如下,此文件中有很多變量,通過分析chart,發現已經寫好邏輯會根據是否啟用動態配置來自動注入所有變量,所以就無需在value.yaml中聲明了

cluster:
  selector: ${SW_CLUSTER:standalone}
...
  kubernetes:
    namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
    labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,release=skywalking}
    uidEnvName: ${SW_CLUSTER_K8S_UID:SKYWALKING_COLLECTOR_UID}
...
configuration:
  selector: ${SW_CONFIGURATION:k8s-configmap}
...
  k8s-configmap:
      # Sync period in seconds. Defaults to 60 seconds.
      period: ${SW_CONFIG_CONFIGMAP_PERIOD:60}
      # Which namespace is confiigmap deployed in.
      namespace: ${SW_CLUSTER_K8S_NAMESPACE:default}
      # Labelselector is used to locate specific configmap
      labelSelector: ${SW_CLUSTER_K8S_LABEL:app=collector,release=skywalking}

在自定義配置告警規則的同時加入webhook后端報警相關配置,configmap文件寫法可以參考官方helm configmap示例

我這里只把默認的報警規則提示信息改成了中文報警信息,具體每條規則的參數沒有變化,同時還加入了釘釘webhook配置,具體流程如下

修改chart包的value.yaml,開啟動態配置

...
oap:
  name: oap
  dynamicConfigEnabled: true # 開啟動態配置功能
...

修改chart包中templateoap-configmap.yaml,配置自定義的rule和釘釘webhook

{{- if .Values.oap.dynamicConfigEnabled }}
apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-dynamic-config
  labels:
    app: {{ template "skywalking.name" . }}
    release: {{ .Release.Name }}
    component: {{ .Values.oap.name }}
data:
  alarm.default.alarm-settings: |-
    rules:
      # Rule unique name, must be ended with `_rule`.
      service_resp_time_rule:
        metrics-name: service_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 3
        silence-period: 5
        message: 最近3分鍾內服務 {name} 的平均響應時間超過1秒
      service_sla_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_sla
        op: "<"
        threshold: 8000
        # The length of time to evaluate the metrics
        period: 10
        # How many times after the metrics match the condition, will trigger alarm
        count: 2
        # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
        silence-period: 3
        message: 最近2分鍾內服務 {name} 的成功率低於80%
      service_resp_time_percentile_rule:
        # Metrics value need to be long, double or int
        metrics-name: service_percentile
        op: ">"
        threshold: 1000,1000,1000,1000,1000
        period: 10
        count: 3
        silence-period: 5
        message: 最近3分鍾的服務 {name} 的響應時間百分比超過1秒
      service_instance_resp_time_rule:
        metrics-name: service_instance_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 2
        silence-period: 5
        message: 最近2分鍾內服務實例 {name} 的平均響應時間超過1秒
      database_access_resp_time_rule:
        metrics-name: database_access_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        # message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
        message: 最近2分鍾內數據庫訪問 {name} 的平均響應時間超過1秒
      endpoint_relation_resp_time_rule:
        metrics-name: endpoint_relation_resp_time
        threshold: 1000
        op: ">"
        period: 10
        count: 2
        message: 最近2分鍾內端點 {name} 的平均響應時間超過1秒
    dingtalkHooks:
      textTemplate: |-
        {
          "msgtype": "text",
          "text": {
            "content": "SkyWalking 鏈路追蹤告警: \n %s."
          }
        }
      webhooks:
        - url: https://oapi.dingtalk.com/robot/send?access_token=<釘釘機器人token>
          secret: <釘釘機器人加簽>
{{- end }}

修改完成后,執行helm進行更新

# ls                                                                                
skywalking
# helm -n monitoring upgrade skywalking skywalking --values ./skywalking/values.yaml
# helm -n monitoring list                                                           
NAME            NAMESPACE       REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
skywalking      monitoring      3               2021-03-22 13:35:36.779541 +0800 CST    deployed        skywalking-4.0.0
# helm -n monitoring history skywalking                                             
REVISION        UPDATED                         STATUS          CHART                   APP VERSION     DESCRIPTION                                                                              
1               Sun Mar 21 17:45:34 2021        superseded      skywalking-4.0.0                        Install complete                                                                         
2               Mon Mar 22 13:35:36 2021        deployed        skywalking-4.0.0                        Upgrade complete 

觀察pod狀態,直到正常

# kubectl -n monitoring get pods                               
NAME                              READY   STATUS      RESTARTS   AGE
elasticsearch-logging-0           1/1     Running     0          19h
elasticsearch-logging-1           1/1     Running     0          19h
elasticsearch-logging-2           1/1     Running     0          19h
skywalking-es-init-ktdcn          0/1     Completed   0          19h
skywalking-oap-7bbb775965-49895   1/1     Running     0          15s
skywalking-oap-7bbb775965-s89dz   1/1     Running     0          43s
skywalking-ui-698cdb4dbc-mjl2m    1/1     Running     0          19h

4、測試告警

為了測試告警功能,拉上業務研發在項目中簡單寫了個url地址,請求時會超時5s返回

然后利用瀏覽器或postman請求應用的/api/timeout進行測試

查看Skywalkingui界面,鏈路追蹤

告警界面

到釘釘中查看報警消息

到這里,在Skywalking中配置報警就完成了 ~

附:在一次Skywalking線上分享會上記錄的關於使用Skywalking定位問題的思路:

  • 縱覽全局,Skywalking拓撲圖
  • 監控告警,metric/tracing確定問題存在故障(根據metric做告警,根據tracing統計作比較)
  • 確定故障在哪,tracing調用關系,確定故障出現在哪個service或者endpoint
  • profile手段(skywalking新能力)或者常見傳統性能定位方法,定位單節點問題所在(比如CPU、內存、io、網絡 ——> 動態追蹤采樣 ——> 火焰圖)基本可以解決99.9%的問題


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM