使用ElastAlert+ELK實現日志監控釘釘告警
介紹
目前公司使用ELK做日志收集和展示分析.所以想對一些關鍵日志進行監控告警.比如Nginx的5xx日志,比如php-fpm的Fatal嚴重錯誤日志等.通過監控ES的日志數據,然后使用Python調用釘釘接口來實現日志的告警
ElastAlert介紹
ElastAlert是一個開源的工具,用於從Elastisearch中檢索數據,並根據匹配模式發出告警.github項目地址如下:https://github.com/Yelp/elastalert
官方文檔如下:https://elastalert.readthedocs.io/en/latest/elastalert.html
它支持多種監控模式和告警方式,具體可以查閱Github項目介紹.但是自帶的ElastAlert並不支持釘釘告警,在github上有第三方的釘釘python項目.地址如下:https://github.com/xuyaoqiang/elastalert-dingtalk-plugin
第三方的釘釘告警插件並沒有艾特相關人員的功能,所以我再此基礎上進行了二次開發,增加了這個功能
ElastAlert安裝
新版的ElastAlert不支持python2了.所以需要安裝Python3環境
1
|
sudo yum -y install python3 python3-devel python3-libs python3-setuptools git gcc
|
如果是Ubuntu系統:
1 2 3 4 5 6
|
sudo apt update sudo apt -y upgrade sudo apt install -y python3.6-dev sudo apt install -y libffi-dev libssl-dev sudo apt install -y python3-pip sudo apt install -y python3-venv
|
1
|
pip3 install elastalert -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
|
1 2
|
git clone https://github.com/Yelp/elastalert.git cp -r elastalert /data/
|
1 2 3 4
|
cd /data/elastalert/ pip3 install "setuptools>=11.3"
pip install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com
|
1
|
sudo elastalert-create-index --index elastalert
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
|
cp config.yaml.example config.yaml vim config.yaml
rules_folder: rule #rule匹配模式的目錄,可以自定義一個/data/elastalert路徑下的相對目錄 run_every: #ElastAlert多久向Elasticsearch發送一次請求 minutes: 1 buffer_time: #如果某些日志源不是實時的,則ElastAlert將緩沖最近一段時間的結果.這個值默認是15,但是無法觸發告警,設置為1正常 minutes: 1 es_host: localhost #ES集群節點,隨便指定任意一台均可 es_port: 9200 #ES端口號 es_username: elastic # 如果ES使用了X-pack安全驗證,則需要配置此項,否則注釋 es_password: password # 同上 writeback_index: elastalert_status #ElastAlert索引名 alert_time_limit: #如果告警發送失敗,則會在下面時間范圍內嘗試重新發送 days: 2
|
1 2 3 4
|
git clone https://github.com/xuyaoqiang/elastalert-dingtalk-plugin.git cd elastalert-dingtalk-plugin pip3 install -r requirements.txt -i http://mirrors.aliyun.com/pypi/simple/ --trusted-host mirrors.aliyun.com cp -r elastalert_modules /data/elastalert/
|
Rule規則
官方支持很多Rule模式,在example_rules
目錄下也有很多參考Rule可以參考.一般常用的是類型(type)是frequence
rule的yaml配置要放在config.yml
配置文件中定義的目錄下,我這里是rule目錄.
下面這個rule是監控Nginx的5XX狀態碼,並且調用釘釘告警
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
|
#rule名字,必須唯一 name: the count of servnginx log that reponse status code is 5xx and it appears greater than 5 in the period 1 minute
#類型,官方提供多種類型 type: frequency
#ES索引,支持通配符 index: logstash-*-nginx-access-*
#在timeframe時間內,匹配到多少個結果便告警 num_events: 1
#監控周期.默認是minutes: 1 timeframe: seconds: 5 #匹配模式. filter: - range: status: from: 500 to: 599 #告警方式,下面是調用第三方的釘釘告警 alert: - "elastalert_modules.dingtalk_alert.DingTalkAlerter"
#釘釘的webhook dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=" #參考地址,需要自行配置 dingtalk_msgtype: text
#原生的告警信息不友好,自定義告警內容的格式 alert_text: " 域 名: {}\n 調用方式: {}\n 請求鏈接: {}\n 狀 態 碼: {}\n 后端服務器: {}\n 數 量: {} " alert_text_type: alert_text_only
#告警內容 alert_text_args: - domain - request_method - request - status - upstreamaddr - num_hits
|
測試Rule文件是否正確.在elastalert目錄下執行下個命令可以測試某個rule是否正常工作
1
|
/usr/local/bin/elastalert-test-rule --config config.yaml rule/nginx.yaml
|
這一步可能會有一些報錯情況.一般都是擴展模塊版本或者依賴關系的問題.比如下面這個問題,就需要執行pip3 install jira==2.0.0
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
|
Traceback (most recent call last): File "/usr/local/bin/elastalert-test-rule", line 11, in <module> load_entry_point('elastalert==0.1.20', 'console_scripts', 'elastalert-test-rule')() File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 476, in load_entry_point return get_distribution(dist).load_entry_point(group, name) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2700, in load_entry_point return ep.load() File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2318, in load return self.resolve() File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2324, in resolve module = __import__(self.module_name, fromlist=['__name__'], level=0) File "/usr/local/lib/python3.6/site-packages/elastalert/test_rule.py", line 20, in <module> import elastalert.config File "/usr/local/lib/python3.6/site-packages/elastalert/config.py", line 99 raise EAException("Could not import module %s: %s" % (module_name, e)), None, sys.exc_info()[2] ^ SyntaxError: invalid syntax
或者下面這個錯誤
[work@idc-function-elk10 elastalert]$ /usr/local/bin/elastalert-test-rule --config config.yaml rule/nginx.yaml Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 570, in _build_master ws.require(__requires__) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 888, in require needed = self.resolve(parse_requirements(requirements)) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 779, in resolve raise VersionConflict(dist, req).with_context(dependent_req) pkg_resources.VersionConflict: (elastalert 0.1.20 (/usr/local/lib/python3.6/site-packages), Requirement.parse('elastalert==0.2.4'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/bin/elastalert-test-rule", line 6, in <module> from pkg_resources import load_entry_point File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3095, in <module> @_call_aside File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3079, in _call_aside f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 3108, in _initialize_master_working_set working_set = WorkingSet._build_master() File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 572, in _build_master return cls._build_from_requirements(__requires__) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 585, in _build_from_requirements dists = ws.resolve(reqs, Environment()) File "/usr/lib/python3.6/site-packages/pkg_resources/__init__.py", line 774, in resolve raise DistributionNotFound(req, requirers) pkg_resources.DistributionNotFound: The 'jira>=2.0.0' distribution was not found and is required by elastalert
|
執行ElastAlert
一切沒問題后,就可以執行ElastAlert.如果是針對單個Rule執行就使用下列命令.(在ElastAlert目錄下)
1 2 3 4 5 6 7 8 9
|
[root@idc-function-elk10 elastalert]# python3 -m elastalert.elastalert --verbose --rule /data/elastalert/rule/nginx.yaml
1 rules loaded INFO:elastalert:Starting up INFO:elastalert:Disabled rules are: [] INFO:elastalert:Sleeping for 59.999906 seconds INFO:elastalert:Queried rule the count of servnginx log that reponse status code is 5xx and it appears greater than 5 in the period 1 minute from 2020-08-26 09:11 CST to 2020-08-26 09:26 CST: 10000 / 10000 hits (scrolling..) INFO:elastalert:Queried rule the count of servnginx log that reponse status code is 5xx and it appears greater than 5 in the period 1 minute from 2020-08-26 09:11 CST to 2020-08-26 09:26 CST: 20000 / 10000 hits (scrolling..)
|
等待幾秒鍾后,釘釘會收到告警(我這里用的是200狀態碼測試).報警內容是Rule配置文件中自定義的格式和內容

Rule2. 監控php-fpm的Fatal錯誤信息
fpm的錯誤日志也收集到了ELK中.我們期望只要pfm日志中出現”Fatal”關鍵字錯誤信息就立即告警.最初計划是用ElastAlert的黑名單(blacklist)類型的Rule.但是由於fpm的錯誤日志沒有解析,而是直接保存原始日志,所以不符合要求.
參考github上我提的ISSUE:balacklist query hits but no matches no alerts
也可以用Any類型的type.Rule文件如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
|
[root@idc-function-elk10 elastalert]# sed '/^#/d' rule/php-fpm.yaml | sed '/^$/d' name: monitor the fatal,error log in php-fpm log type: any index: logstash-*-fpm-error-* num_events: 1 timeframe: seconds: 5 filter: - query: query_string: query: "message: \"PHP Fatal\"" #匹配Fatal關鍵字 alert: - "elastalert_modules.dingtalk_alert.DingTalkAlerter" dingtalk_webhook: "https://oapi.dingtalk.com/robot/send?access_token=" dingtalk_msgtype: text alert_text: " 主機: {}\n IP地址: {}\n 業務線: {}\n 日志類型: {}\n 完整日志: {} " alert_text_type: alert_text_only alert_text_args: - host.name - host.ip - fields.project - fields.type - message
|
啟動ElastAlert
開啟一個Screen然后,使用nohup掛起執行.
1
|
nohup python3 -m elastalert.elastalert --verbose &
|
釘釘告警二次開發
當前日志告警只是簡單的發送到告警群,由於沒有艾特相關人員,所以大家還是無法第一時間看到告警信息,所以需要增加這個功能,大致思路是根據業務線來艾特相關負責人.
但是中台的業務線有些復雜,因為不同的項目負責人不同.所以需要特殊對待.
准備工作:
日志告警中必須含有以下幾個字段:
- 業務線
- 日志類型
- 如果是Nginx日志,則需要有Nginx的域名
修改原生的釘釘告警的alert動態方法,內容如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69
|
import re def alert(self, matches): headers = { "Content-Type": "application/json", "Accept": "application/json;charset=utf-8" } #body拿到的是告警內容字符串 body = self.create_alert_body(matches) #利用正則找到告警日志中的type關鍵字,也就是日志類別.當前主要有Nginx日志和fpm日志 res_type = re.findall("type: ([a-z]+)", body) #找到業務線,當前有hsq,iqg,msf.bbh,mg等業務線 res_Project = re.findall(r"業務線: ([a-z]+)", body) #如果告警日志沒有相關字段,則拋出異常 if not res_type or not res_Project: raise EAException("告警字段中日志類型或者業務線沒有配置") #將正則匹配到的列表類型轉換為字符串 Type = "".join(res_type) Project = "".join(res_Project) #根據相關業務,艾特相關人員 if Project == "hsq": at_list = ['1560xxxxxx'] elif Project == "iqg": at_list = ["137xxxxxx"] elif Project == "bbh": at_list = ["176xxxxxx"] elif Project == "msf": at_list = ["180xxxxxx"] #如果是中台業務線,並且是Nginx的告警,則需要艾特具體人員 elif Project == "mg": if "nginx" in Type: #匹配到域名 mg_Project = re.findall("domain: (.*)\.doweidu\.com", body)[0] #如果是交易中台的域名 if mg_Project == "trade": at_list = ["177xxxxxx"] #如果是消息中台域名 elif mg_Project == "message.center": at_list = ["170xxxxxx"] #如果是商品中台域名 elif mg_Project == "goods.center": at_list = ["186xxxxxx"] #否則艾特中台負責人 else: at_list = ["186xxxxx"] else: at_list = ["186xxxxx"] #為了防止遺漏,如果沒有at_list變量,則艾特我本人.使用locals().keys()可以判斷某個變量是否被定義 if (not "at_list" in locals().keys()): at_list = ["17749739691"] payload = { "msgtype": self.dingtalk_msgtype, "text": { "content": body }, "at": { "atMobiles": at_list, #艾特相關人員 "isAtAll":False } } try: response = requests.post(self.dingtalk_webhook_url, data=json.dumps(payload, cls=DateTimeEncoder), headers=headers) response.raise_for_status() except RequestException as e: raise EAException("Error request to Dingtalk: {0}".format(str(e)))
|
參考文檔:https://www.jesse.top/2020/08/25/elk/%E4%BD%BF%E7%94%A8ElastAlert+ELK%E5%AE%9E%E7%8E%B0%E6%97%A5%E5%BF%97%E7%9B%91%E6%8E%A7%E5%91%8A%E8%AD%A6/