使用ElastAlert+ELK實現日志監控釘釘告警


使用ElastAlert+ELK實現日志監控釘釘告警

介紹

目前公司使用ELK做日志收集和展示分析.所以想對一些關鍵日志進行監控告警.比如Nginx的5xx日志,比如php-fpm的Fatal嚴重錯誤日志等.通過監控ES的日志數據,然后使用Python調用釘釘接口來實現日志的告警


ElastAlert介紹

ElastAlert是一個開源的工具,用於從Elastisearch中檢索數據,並根據匹配模式發出告警.github項目地址如下:https://github.com/Yelp/elastalert

官方文檔如下:https://elastalert.readthedocs.io/en/latest/elastalert.html

它支持多種監控模式和告警方式,具體可以查閱Github項目介紹.但是自帶的ElastAlert並不支持釘釘告警,在github上有第三方的釘釘python項目.地址如下:https://github.com/xuyaoqiang/elastalert-dingtalk-plugin

第三方的釘釘告警插件並沒有艾特相關人員的功能,所以我再此基礎上進行了二次開發,增加了這個功能


ElastAlert安裝

新版的ElastAlert不支持python2了.所以需要安裝Python3環境

  • 安裝依賴
1
sudo yum -y install python3 python3-devel python3-libs python3-setuptools git gcc

如果是Ubuntu系統:

1
2
3
4
5
6
sudo apt update
sudo apt -y upgrade
sudo apt install -y python3.6-dev
sudo apt install -y libffi-dev libssl-dev
sudo apt install -y python3-pip
sudo apt install -y python3-venv
  • 安裝elastalert模塊
1
pip3 install elastalert  -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
  • 克隆ElastAlert項目
1
2
git clone https://github.com/Yelp/elastalert.git
cp -r elastalert /data/
  • 安裝模塊
1
2
3
4
cd /data/elastalert/
pip3 install "setuptools>=11.3"

pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
  • 創建ElastAlert的索引
1
sudo elastalert-create-index --index elastalert
  • 修改ElastAlert的配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
cp config.yaml.example config.yaml
vim config.yaml

rules_folder: rule #rule匹配模式的目錄,可以自定義一個/data/elastalert路徑下的相對目錄
run_every: #ElastAlert多久向Elasticsearch發送一次請求
minutes: 1
buffer_time: #如果某些日志源不是實時的,則ElastAlert將緩沖最近一段時間的結果.這個值默認是15,但是無法觸發告警,設置為1正常
minutes: 1
es_host: localhost #ES集群節點,隨便指定任意一台均可
es_port: 9200 #ES端口號
es_username: elastic # 如果ES使用了X-pack安全驗證,則需要配置此項,否則注釋
es_password: password # 同上
writeback_index: elastalert_status #ElastAlert索引名
alert_time_limit: #如果告警發送失敗,則會在下面時間范圍內嘗試重新發送
days: 2
  • 配置釘釘報警
1
2
3
4
git clone https://github.com/xuyaoqiang/elastalert-dingtalk-plugin.git
cd elastalert-dingtalk-plugin
pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
cp -r elastalert_modules /data/elastalert/

Rule規則

官方支持很多Rule模式,在example_rules目錄下也有很多參考Rule可以參考.一般常用的是類型(type)是frequence

rule的yaml配置要放在config.yml配置文件中定義的目錄下,我這里是rule目錄.

下面這個rule是監控Nginx的5XX狀態碼,並且調用釘釘告警

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
#rule名字,必須唯一
name: the count of servnginx log that reponse status code is 5xx and it appears greater than 5 in the period 1 minute

#類型,官方提供多種類型
type: frequency

#ES索引,支持通配符
index: logstash-*-nginx-access-*

#在timeframe時間內,匹配到多少個結果便告警
num_events: 1

#監控周期.默認是minutes: 1
timeframe:
seconds: 5

#匹配模式.
filter:
- range:
status:
from: 500
to: 599

#告警方式,下面是調用第三方的釘釘告警
alert:
- "elastalert_modules.dingtalk_alert.DingTalkAlerter"

#釘釘的webhook
dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=" #參考地址,需要自行配置
dingtalk_msgtype: text

#原生的告警信息不友好,自定義告警內容的格式
alert_text: "
域 名: {}\n
調用方式: {}\n
請求鏈接: {}\n
狀 態 碼: {}\n
后端服務器: {}\n
數 量: {}
"
alert_text_type: alert_text_only

#告警內容
alert_text_args:
- domain
- request_method
- request
- status
- upstreamaddr
- num_hits

測試Rule文件是否正確.在elastalert目錄下執行下個命令可以測試某個rule是否正常工作

1
/usr/local/bin/elastalert-test-rule --config config.yaml rule/nginx.yaml

這一步可能會有一些報錯情況.一般都是擴展模塊版本或者依賴關系的問題.比如下面這個問題,就需要執行pip3 install jira==2.0.0:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
Traceback (most recent call last):
File "/usr/local/bin/elastalert-test-rule", line 11, in <module>
load_entry_point('elastalert==0.1.20', 'console_scripts', 'elastalert-test-rule')()
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 476, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2700, in load_entry_point
return ep.load()
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2318, in load
return self.resolve()
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2324, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "/usr/local/lib/python3.6/site-packages/elastalert/test_rule.py", line 20, in <module>
import elastalert.config
File "/usr/local/lib/python3.6/site-packages/elastalert/config.py", line 99
raise EAException("Could not import module %s: %s" % (module_name, e)), None, sys.exc_info()[2]
^
SyntaxError: invalid syntax

或者下面這個錯誤

[work@idc-function-elk10 elastalert]$ /usr/local/bin/elastalert-test-rule --config config.yaml rule/nginx.yaml
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 570, in _build_master
ws.require(__requires__)
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 888, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 779, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.VersionConflict: (elastalert 0.1.20 (/usr/local/lib/python3.6/site-packages), Requirement.parse('elastalert==0.2.4'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/elastalert-test-rule", line 6, in <module>
from pkg_resources import load_entry_point
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3095, in <module>
@_call_aside
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3079, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3108, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 572, in _build_master
return cls._build_from_requirements(__requires__)
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 585, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 774, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'jira>=2.0.0' distribution was not found and is required by elastalert

執行ElastAlert

一切沒問題后,就可以執行ElastAlert.如果是針對單個Rule執行就使用下列命令.(在ElastAlert目錄下)

1
2
3
4
5
6
7
8
9
[root@idc-function-elk10 elastalert]# python3  -m elastalert.elastalert --verbose --rule /data/elastalert/rule/nginx.yaml


1 rules loaded
INFO:elastalert:Starting up
INFO:elastalert:Disabled rules are: []
INFO:elastalert:Sleeping for 59.999906 seconds
INFO:elastalert:Queried rule the count of servnginx log that reponse status code is 5xx and it appears greater than 5 in the period 1 minute from 2020-08-26 09:11 CST to 2020-08-26 09:26 CST: 10000 / 10000 hits (scrolling..)
INFO:elastalert:Queried rule the count of servnginx log that reponse status code is 5xx and it appears greater than 5 in the period 1 minute from 2020-08-26 09:11 CST to 2020-08-26 09:26 CST: 20000 / 10000 hits (scrolling..)

等待幾秒鍾后,釘釘會收到告警(我這里用的是200狀態碼測試).報警內容是Rule配置文件中自定義的格式和內容

image-20200826094746820


Rule2. 監控php-fpm的Fatal錯誤信息

fpm的錯誤日志也收集到了ELK中.我們期望只要pfm日志中出現”Fatal”關鍵字錯誤信息就立即告警.最初計划是用ElastAlert的黑名單(blacklist)類型的Rule.但是由於fpm的錯誤日志沒有解析,而是直接保存原始日志,所以不符合要求.

參考github上我提的ISSUE:balacklist query hits but no matches no alerts

也可以用Any類型的type.Rule文件如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
[root@idc-function-elk10 elastalert]# sed '/^#/d' rule/php-fpm.yaml | sed '/^$/d'
name: monitor the fatal,error log in php-fpm log
type: any
index: logstash-*-fpm-error-*
num_events: 1
timeframe:
seconds: 5
filter:
- query:
query_string:
query: "message: \"PHP Fatal\"" #匹配Fatal關鍵字
alert:
- "elastalert_modules.dingtalk_alert.DingTalkAlerter"
dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token="
dingtalk_msgtype: text
alert_text: "
主機: {}\n
IP地址: {}\n
業務線: {}\n
日志類型: {}\n
完整日志: {}
"
alert_text_type: alert_text_only
alert_text_args:
- host.name
- host.ip
- fields.project
- fields.type
- message

啟動ElastAlert

開啟一個Screen然后,使用nohup掛起執行.

1
nohup python3  -m elastalert.elastalert --verbose &

釘釘告警二次開發

當前日志告警只是簡單的發送到告警群,由於沒有艾特相關人員,所以大家還是無法第一時間看到告警信息,所以需要增加這個功能,大致思路是根據業務線來艾特相關負責人.

但是中台的業務線有些復雜,因為不同的項目負責人不同.所以需要特殊對待.

准備工作:

日志告警中必須含有以下幾個字段:

  • 業務線
  • 日志類型
  • 如果是Nginx日志,則需要有Nginx的域名

修改原生的釘釘告警的alert動態方法,內容如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import re

def alert(self, matches):
headers = {
"Content-Type": "application/json",
"Accept": "application/json;charset=utf-8"
}
#body拿到的是告警內容字符串
body = self.create_alert_body(matches)
#利用正則找到告警日志中的type關鍵字,也就是日志類別.當前主要有Nginx日志和fpm日志
res_type = re.findall("type: ([a-z]+)", body)
#找到業務線,當前有hsq,iqg,msf.bbh,mg等業務線
res_Project = re.findall(r"業務線: ([a-z]+)", body)
#如果告警日志沒有相關字段,則拋出異常
if not res_type or not res_Project:
raise EAException("告警字段中日志類型或者業務線沒有配置")

#將正則匹配到的列表類型轉換為字符串
Type = "".join(res_type)
Project = "".join(res_Project)

#根據相關業務,艾特相關人員
if Project == "hsq":
at_list = ['1560xxxxxx']
elif Project == "iqg":
at_list = ["137xxxxxx"]
elif Project == "bbh":
at_list = ["176xxxxxx"]
elif Project == "msf":
at_list = ["180xxxxxx"]
#如果是中台業務線,並且是Nginx的告警,則需要艾特具體人員
elif Project == "mg":
if "nginx" in Type:
#匹配到域名
mg_Project = re.findall("domain: (.*)\.doweidu\.com", body)[0]
#如果是交易中台的域名
if mg_Project == "trade":
at_list = ["177xxxxxx"]
#如果是消息中台域名
elif mg_Project == "message.center":
at_list = ["170xxxxxx"]
#如果是商品中台域名
elif mg_Project == "goods.center":
at_list = ["186xxxxxx"]
#否則艾特中台負責人
else:
at_list = ["186xxxxx"]
else:
at_list = ["186xxxxx"]

#為了防止遺漏,如果沒有at_list變量,則艾特我本人.使用locals().keys()可以判斷某個變量是否被定義
if (not "at_list" in locals().keys()): at_list = ["17749739691"]
payload = {
"msgtype": self.dingtalk_msgtype,
"text": {
"content": body
},
"at": {
"atMobiles": at_list, #艾特相關人員
"isAtAll":False
}
}
try:
response = requests.post(self.dingtalk_webhook_url,
data=json.dumps(payload, cls=DateTimeEncoder),
headers=headers)
response.raise_for_status()
except RequestException as e:
raise EAException("Error request to Dingtalk: {0}".format(str(e)))

參考文檔:https://www.jesse.top/2020/08/25/elk/%E4%BD%BF%E7%94%A8ElastAlert+ELK%E5%AE%9E%E7%8E%B0%E6%97%A5%E5%BF%97%E7%9B%91%E6%8E%A7%E5%91%8A%E8%AD%A6/


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM