MySQL實例多庫某張表數據文件損壞導致xxx庫無法訪問故障恢復


一、問題發現

  命令行進入數據庫實例手動給某張表進行alter操作,發現如下報錯。

mysql> use xx_xxx;
No connection. Trying to reconnect...
Connection id:    5
Current database: *** NONE ***

Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> show tables;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    3
Current database: xx_xxx

ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/tmp/mysql.sock' (111)
ERROR: 
Can't connect to the server

二、問題定位

  上述這種錯誤常見是MySQL實例關閉或者非正常宕機、MySQL連接超時、MySQL請求線程被kill。根據現有的業務場景,審核平台能正常連接數據庫甚至連接有問題的數據庫建表,MySQL服務對外正常,錯誤日志沒有非正常輸出,沒有開發或者測試人員反映有問題的數據庫存在使用問題。但是通過Navicat連接打開問題數據庫發現報錯(MySQL server has gone away),通過命令行界面進入其他數據庫,執行數據庫命令都正常,進入問題數據庫連最基本的數據庫相應變量值和狀態值都無法show。

  排查暴力破解數據庫嘗試連接的源頭,縮小問題來源(這里排查走偏了),發現問題依然存在。但是比較難理解的是通過審核平台使用問題庫卻能建表成功,與之前遇到的整庫數據文件損壞還不一樣,這里懷疑可能是某張表數據文件損壞導致了錯誤。

   查看日志發現實例存在異常shutdown和崩潰恢復記錄,但還不能確定具體的原因,可以明確的是單個庫存在問題,可以從其他途徑去恢復。但是DBA存在即有價值,我們可以盡可能的先嘗試以最小的代價解決問題。雖之前遇到過多次硬件故障導致的數據文件損壞,可以通過集群的其他實例和備份完成恢復並不會很大影響業務,也遇到過自己測試發現單個庫文件壞得很徹底,但能通過dump出來數據文件進行恢復。

  解決問題后追溯問題,發現日志記錄如下故障點(這個日志比較久遠,當問題來臨時可能沒有那么多時間給予分析,需要快速定位初步問題並解決,解決問題的時候不一定能發現這個重要的排錯依據,放上日志只是僅供參考和回溯故障原因)。

2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201
2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210
2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210
2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally!
2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery.
2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97
4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb-
troubleshooting-datadict.html for how to resolve the issue.
2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation.
2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified.
2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them.
2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in
nodb-troubleshooting-datadict.html for how to resolve the issue.
2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened.
2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1"
2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

[ERROR] [FATAL] InnoDB: Tablespace id is 974 in the data dictionary but in file ./xx_xxxxxx/xx_xxxxxx_fans_person.ibd it is 2760! 2018-12-18 20:30:29 0x7f014872b700 InnoDB: Assertion failure in thread 139643487172352 in file ut0ut.cc line 916 InnoDB: We intentionally generate a memory trap. 2018-12-18T12:30:29.409505Z 0 [Note] InnoDB: Log scan progressed past the checkpoint lsn 3514141201 2018-12-18T12:30:29.409520Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409677Z 0 [Note] InnoDB: Doing recovery: scanned up to log sequence number 3514141210 2018-12-18T12:30:29.409682Z 0 [Note] InnoDB: Database was not shutdown normally! 2018-12-18T12:30:29.409685Z 0 [Note] InnoDB: Starting crash recovery. 2018-12-18T12:30:30.026781Z 0 [ERROR] InnoDB: In file './xx_xxxxxx/xx_xxxxxx_fans_person.ibd', tablespace id and flags are 2760 and 33, but in the InnoDB data dictionary they are 97 4 and 33. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.7/en/innodb- troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026819Z 0 [ERROR] InnoDB: Operating system error number 2 in a file operation. 2018-12-18T12:30:30.026823Z 0 [ERROR] InnoDB: The error means the system cannot find the path specified. 2018-12-18T12:30:30.026827Z 0 [ERROR] InnoDB: If you are installing InnoDB, remember that you must create directories yourself, InnoDB does not create them. 2018-12-18T12:30:30.026831Z 0 [ERROR] InnoDB: Could not find a valid tablespace file for `xx_xxxxxx/xx_xxxxxx_fans_person`. Please refer to http://dev.mysql.com/doc/refman/5.7/en/in nodb-troubleshooting-datadict.html for how to resolve the issue. 2018-12-18T12:30:30.026841Z 0 [Warning] InnoDB: Ignoring tablespace `xx_xxxxxx/xx_xxxxxx_fans_person` because it could not be opened. 2018-12-18T12:30:32.199013Z 0 [Note] InnoDB: Removed temporary tablespace data file: "ibtmp1" 2018-12-18T12:30:32.199035Z 0 [Note] InnoDB: Creating shared tablespace for temporary tables 2018-12-18T12:30:32.199088Z 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ... 2018-12-18T12:30:32.286423Z 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.

三、問題解決

  遇到數據文件損壞導致的數據無法正常存取,通常解決的辦法是通過備份進行恢復,包括對壞點進行備份恢復。嘗試過才知道有思路是好的,但是實踐起來不一定容易,果不其然當我想通過dump備份數據再嘗試修復的時候出現了錯誤 MySQL server has gone away。遇到好的問題就要分享,往往問題比較寬泛不好定位的時候容易忽略正確的處理方向。通過好朋友圈的提醒,發現use庫的時候輸出了-A選項,查詢得知可以不加載元數據信息就能進入數據庫。

-A選項意義
    當我們打開數據庫,即use dbname時,要預讀數據庫信息。由於數據庫太大,即數據庫中表非常多,所以如果預讀數據庫信息,將非常慢,所以就卡住了,如果數據庫中表非常少,將不會出現問題

  幸運的是通過不預讀數據可以正常查看當前數據庫所有表、系統變量值和狀態值,然后嘗試通過對InnoDB和MyISAM表進行批量修復,不過在此應該通過select...into的方式做好數據備份,這里因為是測試環境且有相應的冗余環境,就沒做備份處理再修復。通過如下命令查詢所有的base表並拼接SQL語句,果然發現了無法修復的壞表,印證了MySQL錯誤日志的信息。

##批量修復MyISAM表
select concat('repair table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'MyISAM';
 
#批量修復InnoDB表
select concat('optimize table ',table_name,';') from information_schema.tables where table_schema = 'xx_xxxx' and table_type = 'BASE TABLE' and engine = 'InnoDB';

##### optimize table xx_xxxxxx_fans_person;  

  通過上述命令發現修復結果不OK的表,並通過查看表行數確認數據已無法導出,刪除相應的壞表並重新建立新表(drop table可能出現表不存在或者建表1068錯誤),導入最近的一次數據備份,重啟MySQL實例,發現問題解決,問題庫可以正常訪問。

1、刪除錯誤表xx_xxxxxx_fans_person
2、重建表
mysql>   CREATE TABLE `xx_xxxxxx_fans_person` (
    ->   `person_id` int(20) NOT NULL AUTO_INCREMENT,
    ->   `person_circle_id` int(20) NOT NULL,
    ->   `person_user_id` int(20) NOT NULL,
    ->   `person_time` datetime NOT NULL,
    ->   `type` int(4) DEFAULT '1' COMMENT '1. 組長 2. 成員',
    ->   `merchant_id` int(11) DEFAULT '0',
    ->   `leave_type` int(11) DEFAULT '0' COMMENT '請假狀態 0.未請假 1.請假',
    ->   `leave_start_time` datetime DEFAULT NULL COMMENT '請假開始時間',
    ->   `leave_end_time` datetime DEFAULT NULL COMMENT '請假結束時間',
    ->   `is_invalid` int(10) DEFAULT '0' COMMENT '是否失效 0有效 1失效',
    ->   `invalid_id` int(10) DEFAULT '0' COMMENT '失效記錄關聯ID ',
    ->   PRIMARY KEY (`person_id`),
    ->   KEY `person_circle_id` (`person_circle_id`) USING BTREE,
    ->   KEY `person_user_id` (`person_user_id`) USING BTREE
    -> ) ENGINE=InnoDB;
ERROR 1030 (HY000): Got error 168 from storage engine
3、重啟
mysql> select count(*) from xx_xxxxxx_fans_person;
ERROR 1812 (HY000): Tablespace is missing for table `xx_xxxxxx`.`xx_xxxxxx_fans_person`.

mysql> drop table xx_xxxxxx_fans_person;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `xx_xxxxxx_fans_person` (
    ->   `person_id` int(20) NOT NULL AUTO_INCREMENT,
    ->   `person_circle_id` int(20) NOT NULL,
    ->   `person_user_id` int(20) NOT NULL,
    ->   `person_time` datetime NOT NULL,
    ->   `type` int(4) DEFAULT '1' COMMENT '1. 組長 2. 成員',
    ->   `merchant_id` int(11) DEFAULT '0',
    ->   `leave_type` int(11) DEFAULT '0' COMMENT '請假狀態 0.未請假 1.請假',
    ->   `leave_start_time` datetime DEFAULT NULL COMMENT '請假開始時間',
    ->   `leave_end_time` datetime DEFAULT NULL COMMENT '請假結束時間',
    ->   `is_invalid` int(10) DEFAULT '0' COMMENT '是否失效 0有效 1失效',
    ->   `invalid_id` int(10) DEFAULT '0' COMMENT '失效記錄關聯ID ',
    ->   PRIMARY KEY (`person_id`),
    ->   KEY `person_circle_id` (`person_circle_id`) USING BTREE,
    ->   KEY `person_user_id` (`person_user_id`) USING BTREE
    -> ) ENGINE=InnoDB;
ERROR 1813 (HY000): Tablespace '`xx_xxxxxx`.`xx_xxxxxx_fans_person`' exists.

四、總結

  1、數據庫需要定時備份,防止硬件或者其他問題導致的數據文件損壞

  2、先分析問題,排查基本的不可能點,必需查看日志分析問題,注意查看命令報錯的輸出提示信息(可能幫助我們排查或者修復)

       3、可使用-A選項不加載數據庫信息嘗試進行表修復,提前做好備份


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM