一、問題發現
命令行進入數據庫實例手動給某張表進行alter操作,發現如下報錯。
mysql> use xx_xxx; No connection. Trying to reconnect... Connection id: 5 Current database: *** NONE *** Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> show tables; ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id: 3 Current database: xx_xxx ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111) ERROR: Can't connect to the server
二、問題定位
上述這種錯誤常見是MySQL實例關閉或者非正常宕機、MySQL連接超時、MySQL請求線程被kill。根據現有的業務場景,審核平台能正常連接數據庫甚至連接有問題的數據庫建表,MySQL服務對外正常,錯誤日志沒有非正常輸出,沒有開發或者測試人員反映有問題的數據庫存在使用問題。但是通過Navicat連接打開問題數據庫發現報錯(MySQL server has gone away),通過命令行界面進入其他數據庫,執行數據庫命令都正常,進入問題數據庫連最基本的數據庫相應變量值和狀態值都無法show。
排查暴力破解數據庫嘗試連接的源頭,縮小問題來源(這里排查走偏了),發現問題依然存在。但是比較難理解的是通過審核平台使用問題庫卻能建表成功,與之前遇到的整庫數據文件損壞還不一樣,這里懷疑可能是某張表數據文件損壞導致了錯誤。
查看日志發現實例存在異常shutdown和崩潰恢復記錄,但還不能確定具體的原因,可以明確的是單個庫存在問題,可以從其他途徑去恢復。但是DBA存在即有價值,我們可以盡可能的先嘗試以最小的代價解決問題。雖之前遇到過多次硬件故障導致的數據文件損壞,可以通過集群的其他實例和備份完成恢復並不會很大影響業務,也遇到過自己測試發現單個庫文件壞得很徹底,但能通過dump出來數據文件進行恢復。
解決問題后追溯問題,發現日志記錄如下故障點(這個日志比較久遠,當問題來臨時可能沒有那么多時間給予分析,需要快速定位初步問題並解決,解決問題的時候不一定能發現這個重要的排錯依據,放上日志只是僅供參考和回溯故障原因)。
2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201 2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally! 2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery. 2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97 4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb- troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in nodb-troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened. 2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
[ERROR] [FATAL] InnoDB: Tablespace id is 974 in the data dictionary but in file ./xx_xxxxxx/xx_xxxxxx_fans_person.ibd it is 2760! 2018-12-18 20:30:29 0x7f014872b700 InnoDB: Assertion failure in thread 139643487172352 in file ut0ut.cc line 916 InnoDB: We intentionally generate a memory trap. 2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201 2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally! 2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery. 2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97 4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb- troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in nodb-troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened. 2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
三、問題解決
遇到數據文件損壞導致的數據無法正常存取,通常解決的辦法是通過備份進行恢復,包括對壞點進行備份恢復。嘗試過才知道有思路是好的,但是實踐起來不一定容易,果不其然當我想通過dump備份數據再嘗試修復的時候出現了錯誤 MySQL server has gone away。遇到好的問題就要分享,往往問題比較寬泛不好定位的時候容易忽略正確的處理方向。通過好朋友圈的提醒,發現use庫的時候輸出了-A選項,查詢得知可以不加載元數據信息就能進入數據庫。
-A選項意義
當我們打開數據庫,即use dbname時,要預讀數據庫信息。由於數據庫太大,即數據庫中表非常多,所以如果預讀數據庫信息,將非常慢,所以就卡住了,如果數據庫中表非常少,將不會出現問題
幸運的是通過不預讀數據可以正常查看當前數據庫所有表、系統變量值和狀態值,然后嘗試通過對InnoDB和MyISAM表進行批量修復,不過在此應該通過select...into的方式做好數據備份,這里因為是測試環境且有相應的冗余環境,就沒做備份處理再修復。通過如下命令查詢所有的base表並拼接SQL語句,果然發現了無法修復的壞表,印證了MySQL錯誤日志的信息。
##批量修復MyISAM表 select concat('repair table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'MyISAM'; #批量修復InnoDB表 select concat('optimize table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'InnoDB';
##### optimize table xx_xxxxxx_fans_person;
通過上述命令發現修復結果不OK的表,並通過查看表行數確認數據已無法導出,刪除相應的壞表並重新建立新表(drop table可能出現表不存在或者建表1068錯誤),導入最近的一次數據備份,重啟MySQL實例,發現問題解決,問題庫可以正常訪問。
1、刪除錯誤表xx_xxxxxx_fans_person 2、重建表 mysql> CREATE TABLE `xx_xxxxxx_fans_person` ( -> `person_id` int(20) NOT NULL AUTO_INCREMENT, -> `person_circle_id` int(20) NOT NULL, -> `person_user_id` int(20) NOT NULL, -> `person_time` datetime NOT NULL, -> `type` int(4) DEFAULT '1' COMMENT '1. 組長 2. 成員', -> `merchant_id` int(11) DEFAULT '0', -> `leave_type` int(11) DEFAULT '0' COMMENT '請假狀態 0.未請假 1.請假', -> `leave_start_time` datetime DEFAULT NULL COMMENT '請假開始時間', -> `leave_end_time` datetime DEFAULT NULL COMMENT '請假結束時間', -> `is_invalid` int(10) DEFAULT '0' COMMENT '是否失效 0有效 1失效', -> `invalid_id` int(10) DEFAULT '0' COMMENT '失效記錄關聯ID ', -> PRIMARY KEY (`person_id`), -> KEY `person_circle_id` (`person_circle_id`) USING BTREE, -> KEY `person_user_id` (`person_user_id`) USING BTREE -> ) ENGINE=InnoDB; ERROR 1030 (HY000): Got error 168 from storage engine 3、重啟 mysql> select count(*) from xx_xxxxxx_fans_person; ERROR 1812 (HY000): Tablespace is missing for table `xx_xxxxxx`.`xx_xxxxxx_fans_person`. mysql> drop table xx_xxxxxx_fans_person; Query OK, 0 rows affected (0.00 sec) mysql> CREATE TABLE `xx_xxxxxx_fans_person` ( -> `person_id` int(20) NOT NULL AUTO_INCREMENT, -> `person_circle_id` int(20) NOT NULL, -> `person_user_id` int(20) NOT NULL, -> `person_time` datetime NOT NULL, -> `type` int(4) DEFAULT '1' COMMENT '1. 組長 2. 成員', -> `merchant_id` int(11) DEFAULT '0', -> `leave_type` int(11) DEFAULT '0' COMMENT '請假狀態 0.未請假 1.請假', -> `leave_start_time` datetime DEFAULT NULL COMMENT '請假開始時間', -> `leave_end_time` datetime DEFAULT NULL COMMENT '請假結束時間', -> `is_invalid` int(10) DEFAULT '0' COMMENT '是否失效 0有效 1失效', -> `invalid_id` int(10) DEFAULT '0' COMMENT '失效記錄關聯ID ', -> PRIMARY KEY (`person_id`), -> KEY `person_circle_id` (`person_circle_id`) USING BTREE, -> KEY `person_user_id` (`person_user_id`) USING BTREE -> ) ENGINE=InnoDB; ERROR 1813 (HY000): Tablespace '`xx_xxxxxx`.`xx_xxxxxx_fans_person`' exists.
四、總結
1、數據庫需要定時備份,防止硬件或者其他問題導致的數據文件損壞
2、先分析問題,排查基本的不可能點,必需查看日志分析問題,注意查看命令報錯的輸出提示信息(可能幫助我們排查或者修復)
3、可使用-A選項不加載數據庫信息嘗試進行表修復,提前做好備份