藍鯨:安裝SaaS組件bk_monitor失敗分析解決


使用./bk_install saas-o 安裝發現bk_monitor(藍鯨監控)組件報錯“ERROR deploy failed: timeout”。

單獨嘗試安裝各個組件:

#故障自愈
[root@rbtnode1 install]# ./bk_install saas-o bk_fta_solutions

#日志檢索
[root@rbtnode1 install]# ./bk_install saas-o bk_log_search

#節點管理
[root@rbtnode1 install]# ./bk_install saas-o bk_nodeman

#標准運維
[root@rbtnode1 install]# ./bk_install saas-o bk_sops

#藍鯨監控
[root@rbtnode1 install]# ./bk_install saas-o bk_monitor

發現前面幾個bk_fta_solutions、bk_log_search、bk_nodeman、bk_sops都可以安裝成功,唯獨對bk_monitor安裝,依然報錯如下:

[root@rbtnode1 install]# ./bk_install saas-o bk_monitor
省略輸出..
2020-03-09 13:27:36 125  INFO    check deploy result. retry 132
2020-03-09 13:27:39 125  INFO    check deploy result. retry 133
2020-03-09 13:27:39 134  ERROR  deploy failed: timeout
[192.168.1.6]20200309-132739 153   Deploy saas bk_monitor failed.
[192.168.1.6]20200309-132739 47   Abort

進一步查看agent日志(/data/bkce/logs/paas_agent/agent.log),最終因為部署任務timeout而終止,未見其他明顯報錯:

2020/03/09 13:24:57 job.go:279: Building wheels for collected packages: gevent, netifaces, arrow, msgpack-python, wrapt, itypes, backports.shutil-get-terminal-size, simplegeneric, scandir

2020/03/09 13:24:57 job.go:279:   Running setup.py bdist_wheel for gevent: started

2020/03/09 13:27:32 job.go:279:   Running setup.py bdist_wheel for gevent: still running...

2020/03/09 13:27:38 job.go:297: Deployment task execution timeout

查了些網上資料,說是因為機器配置不夠,增加核數為6即可解決,但實際我測試無效,報錯不變;
在藍鯨官方群咨詢,客服給出一個解決方案:

但是實際這個Case和我這里遇到的情況並不一樣,因為我這沒有看到那個error。
晚上重新整理下思路,借鑒案例中清理環境的方式,然后重新部署,這次agent.log看到報錯信息了:

2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet

2020/03/10 02:29:54 job.go:279:     packet.check_error()

2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error

2020/03/10 02:29:54 job.go:279:     err.raise_mysql_exception(self._data)

2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception

2020/03/10 02:29:54 job.go:279:     _check_mysql_exception(errinfo)

2020/03/10 02:29:54 job.go:279:   File "/opt/py27_e/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception

2020/03/10 02:29:54 job.go:279:     raise InternalError(errno, errorvalue)

2020/03/10 02:29:54 job.go:279: django.db.utils.InternalError: (1049, u"Unknown database 'bkdata_monitor_alert'")

2020/03/10 02:29:55 job.go:304: error waiting for Cmd exit status 1

這提示居然是沒有這個名稱為bkdata_monitor_alert的數據庫??
結合之前的agent日志是確認有建表操作成功的,說明是環境清理操作很可能把對應組件的庫也給刪除了。

這里先不深究,直接查看當前的數據庫列表:

MySQL [(none)]> show databases;
+--------------------+
| Database           |
+--------------------+
| information_schema |
| bk_fta_solutions   |
| bk_log_search      |
| bk_monitor         |
| bk_nodeman         |
| bk_sops            |
| bksuite_common     |
| job                |
| jobLog             |
| mysql              |
| open_paas          |
| performance_schema |
| sys                |
+--------------------+
13 rows in set (0.00 sec)

果然沒有這個bkdata_monitor_alert庫,這里先直接嘗試創建一個空庫試下:

MySQL [(none)]> create database bkdata_monitor_alert character set utf8;
Query OK, 1 row affected (0.01 sec)

再次嘗試bk_monitor的安裝:

# 再次安裝bk_monitor
[root@rbtnode1 install]# ./bk_install saas-o bk_monitor

# 監控agent.log
[root@rbtnode1 paas_agent]# pwd
/data/bkce/logs/paas_agent
[root@rbtnode1 paas_agent]# tail -20f agent.log 

發現這次agent.log日志最終顯示Job正常完成了:

省略部分日志..

2020/03/10 02:45:38 job.go:279:   Applying sessions.0001_initial... OK

2020/03/10 02:45:38 job.go:279: ------change db success------

2020/03/10 02:47:25 job.go:279: ------ start app server ------

2020/03/10 02:47:25 job.go:279: su: ignore --preserve-environment, it's mutually exclusive to --login.

2020/03/10 02:47:25 job.go:279: /etc/profile: line 77: ulimit: open files: cannot modify limit: Operation not permitted

2020/03/10 02:47:25 job.go:279: /etc/profile: line 78: ulimit: open files: cannot modify limit: Operation not permitted

2020/03/10 02:47:25 job.go:279: /etc/profile: line 79: ulimit: open files: cannot modify limit: Operation not permitted

2020/03/10 02:47:25 job.go:279: /etc/profile: line 80: ulimit: open files: cannot modify limit: Operation not permitted

2020/03/10 02:47:26 job.go:279: Last login: Mon Mar  9 14:01:54 CST 2020

2020/03/10 02:47:28 job.go:279: Job Done

2020/03/10 02:47:28 job.go:306: RunJob end ... ...

趕緊去看下安裝的窗口,發現這次bk_monitor終於安裝成功了:

[root@rbtnode1 install]# ./bk_install saas-o bk_monitor
省略部分日志..

2020-03-10 02:47:24 125  INFO    check deploy result. retry 107
2020-03-10 02:47:26 125  INFO    check deploy result. retry 108
2020-03-10 02:47:29 125  INFO    check deploy result. retry 109
2020-03-10 02:47:30 131  INFO   bk_monitor have been deployed successfully
[192.168.1.6]20200310-024730 151   SaaS application bk_monitor has been deployed successfully
[192.168.1.6]20200310-024730 56   install saas-o(bk_monitor) done

登陸藍鯨的工作台,也確認這次藍鯨監控組件已經安裝成功,可以正常操作了。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM