MariaDB galera cluster 全部停止后如何再啟動


一、問題場景

1.正式環境下基本上不會出現此類情況

2.測試環境的時候可能會出現,如自己電腦上搞的幾個虛擬機上測試,后來全部關機了,再想啟動集群,報錯了

【系統環境】

CentOS7 + MariaDB10.1.22+galera cluster

【解決方式】

1.正常第一次啟動集群,使用命令:galera_new_cluster ,其他版本請另行參考

2.整個集群關閉后,再重新啟動,則打開任一主機,輸入命令:

vim /var/lib/mysql/grastate.dat

#GALERA savedd state
version:2.1
uuid: 自己的cluster id
seqno: -1
safe_to_bootstrap:0
#GALERA savedd state
version:2.1
uuid: 自己的cluster id
seqno: -1
safe_to_bootstrap:0

修改seqno:1

3.重新啟動集群命令:galera_new_cluster

4.其他節點:systemctl start mariadb

 

問題二:

[root@controller1 haproxy]# galera_new_cluster 
Job for mariadb.service failed because the control process exited with error code.
See "systemctl status mariadb.service" and "journalctl -xe" for details.
[root@controller1 haproxy]# tail /var/log/mariadb/mariadb.log 
2018-03-21 12:16:18 140168333977920 [Note] WSREP: GCache history reset: f84f94a1-2c38-11e8-8ede-96f87262fb85:0 -> f84f94a1-2c38-11e8-8ede-96f87262fb85:-1
2018-03-21 12:16:18 140168333977920 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
2018-03-21 12:16:18 140168333977920 [Note] WSREP: wsrep_sst_grab()
2018-03-21 12:16:18 140168333977920 [Note] WSREP: Start replication
2018-03-21 12:16:18 140168333977920 [Note] WSREP: 'wsrep-new-cluster' option used, bootstrapping the cluster
2018-03-21 12:16:18 140168333977920 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2018-03-21 12:16:18 140168333977920 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 .
2018-03-21 12:16:18 140168333977920 [ERROR] WSREP: wsrep::connect(gcomm://controller1,controller2,controller3) failed: 7
2018-03-21 12:16:18 140168333977920 [ERROR] Aborting

#解決辦法
[root@controller1 haproxy]# cat /var/lib/mysql/grastate.dat 
# GALERA saved state
version: 2.1
uuid:    d6aea58b-2cbe-11e8-9c9d-b72d8fdd0931
seqno:   -1
safe_to_bootstrap: 0  

把safe_to_bootstrap: 0   #修改成safe_to_bootstrap: 1

#再啟動集群
[root@controller1 haproxy]# galera_new_cluster 

#其他節點啟動服務:
systemctl start mariadb

 

二、mysql galera 集群常見問題處理

一、mysql HA集群在斷網過久或者所有節點都down了之后的恢復有以下的方法:
解決方案1:
1、等三台機器恢復網絡通訊后,因為此時的mysql已經異常無法加入集群,因此需要先保證所有的mysql都是down的,再上台執行/usr/libexec/mysqld --wsrep-new-cluster --wsrep-cluster-address='gcomm://' & 這條命令,並進入mysql(只有一台機器能夠成功執行,其他機器執行了過幾秒鍾都會異常退出這個進程,我們這里把能夠成功執行的機器稱為master)
2、此時三台只有一台能夠成功進入mysql(即執行mysql這條命令),在非master上的兩台上一台一台的執行systemctl start mysqld,必須等一台成功了,另一台才能執行。

3、在mysql中執行show status like "wsrep%";結果如下圖:

 我們需要保證圖中的第一項為synced,以及第二項必須為三個mysql的ip

4、保證3的結果是想要的說明集群已經恢復了,此時需要將master機器上面的/usr/libexec/mysqld --wsrep-new-cluster --wsrep-cluster-address='gcomm://'這個進程kill掉,然后再執行systemctl start mysqld即可

二、mysql HA集群某個節點無故down了並且有一段時間處於down的情況通過以下方式恢復:

1、 若日志里面出現以下日志

160119 14:11:05 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (eb9f50c6-bc95-11e5-a735-9f48e437dc03): 1 (Operation not permitted)

解決方法:刪除/var/lib/mysql/grastate.dat 文件(若還存在無法同步的情況則刪除galera.cache文件)

2、 若那個down了的節點出現以下日志

(異常情況集群掛了)[ERROR] Found 1 prepared transactions! It means that mysqld was not shut down properly last time and critical recovery information (last binlog or tc.log file) was manually deleted after a crash. You have to start mysqld with --tc-heuristic-recover switch to commit or rollback pending transactions

解決方法:
1、/usr/libexec/mysqld start --innodb_force_recovery=6
 1. (SRV_FORCE_IGNORE_CORRUPT):忽略檢查到的corrupt頁。
  2. (SRV_FORCE_NO_BACKGROUND):阻止主線程的運行,如主線程需要執行full purge操作,會導致crash。
  3. (SRV_FORCE_NO_TRX_UNDO):不執行事務回滾操作。
  4. (SRV_FORCE_NO_IBUF_MERGE):不執行插入緩沖的合並操作。
  5. (SRV_FORCE_NO_UNDO_LOG_SCAN):不查看重做日志,InnoDB存儲引擎會將未提交的事務視為已提交。
  6. (SRV_FORCE_NO_LOG_REDO):不執行前滾的操作。
如果配置后出現以下情況:
130507 14:14:01  InnoDB: Waiting for the background threads to start
130507 14:14:02  InnoDB: Waiting for the background threads to start
130507 14:14:03  InnoDB: Waiting for the background threads to start
130507 14:14:04  InnoDB: Waiting for the background threads to start
130507 14:14:05  InnoDB: Waiting for the background threads to start
130507 14:14:06  InnoDB: Waiting for the background threads to start
130507 14:14:07  InnoDB: Waiting for the background threads to start
130507 14:14:08  InnoDB: Waiting for the background threads to start
130507 14:14:09  InnoDB: Waiting for the background threads to start


需要在galera.cfg中添加這一下:
如果在設置 innodb_force_recovery >2 的同時innodb_purge_thread = 0
2、mysqld --tc-heuristic-recover=ROLLBACK
3、刪除/var/lib/mysql/ib_logfile*
4、當某個mysql節點掛了,並且存在三個mysql所在host有不同的網段,當mysql想重新加入需要一個sst的過程,sst時會需要知道集群中某個節點的ip因此需要制定參數--wsrep-sst-receive-address否則可能出現同步的ip不在三台機器所共有的網段
解決參考:
http://blog.itpub.net/22664653/viewspace-1441389/


三、一個mysql節點若down了一段時間。重新啟動的時候需要一些時間去同步數據,服務的啟動超時時間不夠,導致服務無法啟動,解決方法如下:
The correct way to adjust systemd settings so they don't get overwritten is to create a directory and file as such:
/etc/systemd/system/mariadb.service.d/timeout.conf
[Service]
 
TimeoutStartSec=12min


或者直接修改/usr/lib/systemd/system/mariadb.service
[Service]
 
TimeoutStartSec=12min
這里的時間最少要大於90s,默認是90s之后執行 systemctl daemon-reload再重啟服務即可
四、日志中出現類似如下錯誤:
160428 13:54:49 [ERROR] Slave SQL: Error 'Table 'manage_operations' already exists' on query. Default database: 'horizon'. Query: 'CREATE TABLE `manage_operations` (
    `id` integer AUTO_INCREMENT NOT NULL PRIMARY KEY,
    `name` varchar(50) NOT NULL,
    `type` varchar(20) NOT NULL,
    `operation` varchar(20) NOT NULL,
    `status` varchar(20) NOT NULL,
    `time` date NOT NULL,
    `operator` varchar(50) NOT NULL
) default charset=utf8', Error_code: 1050
160428 13:54:49 [Warning] WSREP: RBR event 1 Query apply warning: 1, 28585
160428 13:54:49 [Warning] WSREP: Ignoring error for TO isolated action: source: 752eecd1-0ce0-11e6-83fc-3e0502d0bdd2 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 24053 trx_id: -1 seqnos (l: 28668, g: 28585, s: 28584, d: 28584, ts: 80224119986850)
導致進程異常關閉,
此時可以通過執行mysqladmin flush-tables來刷新表項,這個問題的原因是三個節點之間的表同步存在問題,刷新一下表即可


五、日志出現以下錯誤:
160520 10:48:23 [Note] WSREP: COMMIT failed, MDL released: 367194
160520 10:48:23 [Note] WSREP: cert failure, thd: 358780 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 358784 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: COMMIT failed, MDL released: 367188
160520 10:48:23 [Note] WSREP: cert failure, thd: 359683 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 358808 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367191 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367196 is_AC: 0, retry: 0 - 1 SQL: commit
160520 10:48:23 [Note] WSREP: cert failure, thd: 367194 is_AC: 0, retry: 0 - 1 SQL: commit

160520 10:48:23 [Note] WSREP: cert failure, thd: 367188 is_AC: 0, retry: 0 - 1 SQL: commit

8、日志出現以下錯誤:

160820  3:13:41 [ERROR] Error in accept: Too many open files
160820  3:19:42 [ERROR] Error in accept: Too many open files
160827  3:16:24 [ERROR] Error in accept: Too many open files
160831 17:20:52 [ERROR] Error in accept: Too many open files
160831 19:54:29 [ERROR] Error in accept: Too many open files
160831 20:21:53 [ERROR] Error in accept: Too many open files
160901 11:25:57 [ERROR] Error in accept: Too many open files

解決方法

vim /usr/lib/systemd/system/mariadb.service

 [Service]
 LimitNOFILE=10000

默認的mysql的open_file_limits是1024將該項增大,並且修改vim /etc/my.cnf.d/server.cnf該文件的open_files_limit值

systemctl daemon-reload

systemctl restart mysqld

查看mysql的open_file_limits值是否調整成功

cat /proc/$pid/limit

其中$pid為mysql進程的pid看看值是否調整成功,並看看日志是否還會出現上述錯誤。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM