使用./bk_install saas-o 安裝發現bk_monitor(藍鯨監控)組件報錯“ERROR deploy failed: timeout”。
單獨嘗試安裝各個組件:
#故障自愈
[root@rbtnode1 install]# ./bk_install saas-o bk_fta_solutions
#日志檢索
[root@rbtnode1 install]# ./bk_install saas-o bk_log_search
#節點管理
[root@rbtnode1 install]# ./bk_install saas-o bk_nodeman
#標准運維
[root@rbtnode1 install]# ./bk_install saas-o bk_sops
#藍鯨監控
[root@rbtnode1 install]# ./bk_install saas-o bk_monitor
發現前面幾個bk_fta_solutions、bk_log_search、bk_nodeman、bk_sops都可以安裝成功,唯獨對bk_monitor安裝,依然報錯如下:
[root@rbtnode1 install]# ./bk_install saas-o bk_monitor
省略輸出..
2020-03-09 13:27:36 125 INFO check deploy result. retry 132
2020-03-09 13:27:39 125 INFO check deploy result. retry 133
2020-03-09 13:27:39 134 ERROR deploy failed: timeout
[192.168.1.6]20200309-132739 153 Deploy saas bk_monitor failed.
[192.168.1.6]20200309-132739 47 Abort
進一步查看agent日志(/data/bkce/logs/paas_agent/agent.log),最終因為部署任務timeout而終止,未見其他明顯報錯:
2020/03/09 13:24:57 job.go:279: Building wheels for collected packages: gevent, netifaces, arrow, msgpack-python, wrapt, itypes, backports.shutil-get-terminal-size, simplegeneric, scandir
2020/03/09 13:24:57 job.go:279: Running setup.py bdist_wheel for gevent: started
2020/03/09 13:27:32 job.go:279: Running setup.py bdist_wheel for gevent: still running...
2020/03/09 13:27:38 job.go:297: Deployment task execution timeout
查了些網上資料,說是因為機器配置不夠,增加核數為6即可解決,但實際我測試無效,報錯不變;
在藍鯨官方群咨詢,客服給出一個解決方案:
但是實際這個Case和我這里遇到的情況並不一樣,因為我這沒有看到那個error。
晚上重新整理下思路,借鑒案例中清理環境的方式,然后重新部署,這次agent.log看到報錯信息了:
2020/03/10 02:29:54 job.go:279: File "/opt/py27_e/lib/python2.7/site-packages/pymysql/connections.py", line 906, in _read_packet
2020/03/10 02:29:54 job.go:279: packet.check_error()
2020/03/10 02:29:54 job.go:279: File "/opt/py27_e/lib/python2.7/site-packages/pymysql/connections.py", line 367, in check_error
2020/03/10 02:29:54 job.go:279: err.raise_mysql_exception(self._data)
2020/03/10 02:29:54 job.go:279: File "/opt/py27_e/lib/python2.7/site-packages/pymysql/err.py", line 120, in raise_mysql_exception
2020/03/10 02:29:54 job.go:279: _check_mysql_exception(errinfo)
2020/03/10 02:29:54 job.go:279: File "/opt/py27_e/lib/python2.7/site-packages/pymysql/err.py", line 115, in _check_mysql_exception
2020/03/10 02:29:54 job.go:279: raise InternalError(errno, errorvalue)
2020/03/10 02:29:54 job.go:279: django.db.utils.InternalError: (1049, u"Unknown database 'bkdata_monitor_alert'")
2020/03/10 02:29:55 job.go:304: error waiting for Cmd exit status 1
這提示居然是沒有這個名稱為bkdata_monitor_alert
的數據庫??
結合之前的agent日志是確認有建表操作成功的,說明是環境清理操作很可能把對應組件的庫也給刪除了。
這里先不深究,直接查看當前的數據庫列表:
MySQL [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| bk_fta_solutions |
| bk_log_search |
| bk_monitor |
| bk_nodeman |
| bk_sops |
| bksuite_common |
| job |
| jobLog |
| mysql |
| open_paas |
| performance_schema |
| sys |
+--------------------+
13 rows in set (0.00 sec)
果然沒有這個bkdata_monitor_alert
庫,這里先直接嘗試創建一個空庫試下:
MySQL [(none)]> create database bkdata_monitor_alert character set utf8;
Query OK, 1 row affected (0.01 sec)
再次嘗試bk_monitor的安裝:
# 再次安裝bk_monitor
[root@rbtnode1 install]# ./bk_install saas-o bk_monitor
# 監控agent.log
[root@rbtnode1 paas_agent]# pwd
/data/bkce/logs/paas_agent
[root@rbtnode1 paas_agent]# tail -20f agent.log
發現這次agent.log日志最終顯示Job正常完成了:
省略部分日志..
2020/03/10 02:45:38 job.go:279: Applying sessions.0001_initial... OK
2020/03/10 02:45:38 job.go:279: ------change db success------
2020/03/10 02:47:25 job.go:279: ------ start app server ------
2020/03/10 02:47:25 job.go:279: su: ignore --preserve-environment, it's mutually exclusive to --login.
2020/03/10 02:47:25 job.go:279: /etc/profile: line 77: ulimit: open files: cannot modify limit: Operation not permitted
2020/03/10 02:47:25 job.go:279: /etc/profile: line 78: ulimit: open files: cannot modify limit: Operation not permitted
2020/03/10 02:47:25 job.go:279: /etc/profile: line 79: ulimit: open files: cannot modify limit: Operation not permitted
2020/03/10 02:47:25 job.go:279: /etc/profile: line 80: ulimit: open files: cannot modify limit: Operation not permitted
2020/03/10 02:47:26 job.go:279: Last login: Mon Mar 9 14:01:54 CST 2020
2020/03/10 02:47:28 job.go:279: Job Done
2020/03/10 02:47:28 job.go:306: RunJob end ... ...
趕緊去看下安裝的窗口,發現這次bk_monitor終於安裝成功了:
[root@rbtnode1 install]# ./bk_install saas-o bk_monitor
省略部分日志..
2020-03-10 02:47:24 125 INFO check deploy result. retry 107
2020-03-10 02:47:26 125 INFO check deploy result. retry 108
2020-03-10 02:47:29 125 INFO check deploy result. retry 109
2020-03-10 02:47:30 131 INFO bk_monitor have been deployed successfully
[192.168.1.6]20200310-024730 151 SaaS application bk_monitor has been deployed successfully
[192.168.1.6]20200310-024730 56 install saas-o(bk_monitor) done
登陸藍鯨的工作台,也確認這次藍鯨監控組件已經安裝成功,可以正常操作了。