架構設計及必要配置
主機環境
IP 主機名 擔任角色
192.168.192.128 node_master MySQL-Master| MHA-Node
192.168.192.129 node_slave MySQL-Slave | MHA-Node(備選Master)
192.168.192.130 manager_slave MySQL-Slave | MHA-Manager
.....................................................................................................................................................
為了節省機器,這里選擇只讀的從庫192.168.192.129(從庫不對外提供讀的服務)作為候選主庫,即candicate master,或是專門用於備份
同樣,為了節省機器,這里選擇192.168.192.130這台從庫作為manager server(實際生產環節中,機器充足的情況下, 一般是專門選擇一台機器作為Manager server)
溫馨提示:
快速更改主機名(我的主機名是需要更改的)
[root@master1 ~]# hostnamectl set-hostname node_master
[root@node1 ~]# hostnamectl set-hostname node_slave
[root@node2 ~]# hostnamectl set-hostname manager_slave
.....................................................................................................................................................
必要配置
1. 添加每台機器的hosts文件實現主機名hostname登錄(3台機器都要添加,不過這一步可也是可以不做的) [root@node_master ~]# vim /etc/hosts .............. 192.168.192.128 node_master 192.168.192.129 node_slave 192.168.192.130 manager_slave 也可以添加完一台使用scp命令進行hosts文件拷貝(前提是其他機器沒有其他的hosts文件設置) [root@node_master ~]# scp /etc/hosts root@192.168.192.129:/etc/hosts [root@node_master ~]# scp /etc/hosts root@192.168.192.130:/etc/hosts 2. 做服務器免密登錄(3台機器都要添加,這一步是必須要做的,至關重要) [root@node_master ~]# ssh-keygen -t rsa -P "" -f /root/.ssh/id_rsa [root@node_master ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.192.128" [root@node_master ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.192.129" [root@node_master ~]# ssh-copy-id -i /root/.ssh/id_rsa.pub "root@192.168.192.130" 測試能否免密登錄(正常是不需要輸入密碼直接登錄的) [root@node_master ~]# ssh 192.168.192.128 Last login: Tue Dec 18 15:56:41 2018 from 192.168.192.1 [root@node_master ~]# [root@node_master ~]# ssh 192.168.192.129 Last login: Tue Dec 18 16:03:56 2018 from 192.168.192.128 [root@node_slave ~]# [root@node_master ~]# ssh 192.168.192.130 Last login: Tue Dec 18 16:04:02 2018 from 192.168.192.128 [root@manager_slave ~]# 為了確保萬無一失在最好也在其他的兩台機器上操作免密測試
安裝MySQL並實現主從
主從環境設計及安裝(MySQL5.7的版本)
環境設計(一主兩從):
192.168.192.128 MySQL-Master(主庫)
192.168.192.129 MySQL-Slave (從庫)
192.168.192.130 MySQL-Slave (從庫)
安裝教程:https://www.cnblogs.com/brianzhu/p/8575243.html (采用yum安裝方式)
配置主從
1.主從my.cnf配置文件設置 主從配置文件(yum安裝的MySQL配置文件是/etc/my.cnf)修改為如下配置 ------主庫(192.168.192.128)配置------ server-id=1 #數據庫唯一ID,主從的標識號絕對不能重復 log-bin=mysql-bin #開啟bin-log,並指定文件目錄和文件名前綴 binlog-ignore-db=mysql #不同步mysql系統數據庫。如果是多個不同步庫,就以此格式另寫幾行;也可以在一行,中間逗號隔開 sync_binlog=1 #確保binlog日志寫入后與硬盤同步 binlog_checksum=crc32 #跳過現有的采用checksum的事件,mysql5.6.5以后的版本中binlog_checksum=crc32,而低版本都是binlog_checksum=none binlog_format=mixed #bin-log日志文件格式,設置為MIXED可以防止主鍵重復 validate_password_policy=0 #指定密碼策略 validate_password = off #禁用密碼策略 配置完成后保存,並重啟MySQL服務 [root@node_master ~]# systemctl restart mysqld ------從庫1(192.168.192.129)配置------ server-id=2 #數據庫唯一ID,主從的標識號絕對不能重復 log-bin=mysql-bin #開啟bin-log,並指定文件目錄和文件名前綴 binlog-ignore-db=mysql #不同步mysql系統數據庫(千萬要注意:主從同步中的過濾字段要一致,否則后面使用masterha_check_repl 檢查復制時就會出錯!) slave-skip-errors=all #跳過所有的錯誤錯誤,繼續執行復制操作 validate_password_policy=0 #指定密碼策略 validate_password = off #禁用密碼策略 配置完成后保存,並重啟MySQL服務 [root@node_slave ~]# systemctl restart mysqld ------從庫2(192.168.192.130)配置------ server-id=3 #數據庫唯一ID,主從的標識號絕對不能重復 log-bin=mysql-bin #開啟bin-log,並指定文件目錄和文件名前綴 binlog-ignore-db=mysql #不同步mysql系統數據庫(千萬要注意:主從同步中的過濾字段要一致,否則后面使用masterha_check_repl 檢查復制時就會出錯!) slave-skip-errors=all #跳過所有的錯誤錯誤,繼續執行復制操作 validate_password_policy=0 #指定密碼策略 validate_password = off #禁用密碼策略 配置完成后保存,並重啟MySQL服務 [root@manager_slave ~]# systemctl restart mysqld 注意: 主從設置時,如果設置了binlog-ignore-db 和 replicate-ignore-db 過濾規則,則主從必須相同。即要使用binlog-ignore-db過濾字段,則主從配置都使用這個, 要是使用replicate-ignore-db過濾字段,則主從配置都使用這個,千萬不能主從配置使用的過濾字段不一樣!因為MHA 在啟動時候會檢測過濾規則,如果過濾規則不同,MHA 不啟動監控和故障轉移。 2.創建用戶mha管理的賬號(在三台節點上都需要執行) mysql> grant super,reload,replication client,select on *.* to manager@'192.168.192.%' identified by 'Manager_1234'; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> grant create,insert,update,delete,drop on *.* to manager@'192.168.192.%'; Query OK, 0 rows affected (0.00 sec) 3.創建主從賬號(在三台節點上都需要執行) mysql> grant reload,super,replication slave on *.* to 'slave'@'192.168.192.%' identified by 'Slave_1234'; Query OK, 0 rows affected, 1 warning (0.00 sec) mysql> flush privileges; Query OK, 0 rows affected (0.01 sec) 4.配置主從 在主服務器(192.168.192.128)上執行 mysql> show master status; +------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+-------------------+ | mysql-bin.000001 | 1169 | | mysql | | +------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) 在兩台從服務器(192.168.192.129,192.168.192.130)上執行 設置之前先把從庫停掉 mysql> stop slave; Query OK, 0 rows affected, 1 warning (0.00 sec) 配置主從 mysql> change master to master_host='192.168.192.128',master_port=3306,master_user='slave',master_password='Slave_1234',master_log_file='mysql-bin.000001',master_log_pos=1169; Query OK, 0 rows affected, 2 warnings (0.01 sec) 啟動主從 mysql> start slave; Query OK, 0 rows affected (0.00 sec) 查看同步狀態(Slave_IO_Running和Slave_SQL_Running為YES表示主從配置成功) mysql> show slave status\G; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 192.168.192.128 Master_User: slave Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000001 Read_Master_Log_Pos: 1169 Relay_Log_File: node_slave-relay-bin.000002 Relay_Log_Pos: 320 Relay_Master_Log_File: mysql-bin.000001 Slave_IO_Running: Yes Slave_SQL_Running: Yes
具體主從原理篇點擊鏈接:https://www.cnblogs.com/brianzhu/p/10153802.html
具體主從部署篇點擊鏈接:https://www.cnblogs.com/brianzhu/p/10154446.html
安裝及配置MHA
MHA下載地址
----------------------------------------------MHA下載地址---------------------------------------------- mha包括manager節點和data節點,其中: data節點包括原有的MySQL復制結構中的主機,至少3台,即1主2從,當master failover后,還能保證主從結構;只需安裝node包。 manager server:運行監控腳本,負責monitoring 和 auto-failover;需要安裝node包和manager包 下載地址: MHA Node下載: https://coding.net/u/brian_zhu/p/mha4/git/raw/master/mha4mysql-node-0.58.tar.gz MHA Manager下載:https://coding.net/u/brian_zhu/p/mha4/git/raw/master/mha4mysql-manager-0.58.tar.gz
安裝MHA
1. MHA Node安裝 ----------------------------------------------MHA Node安裝---------------------------------------------- 在所有data數據節點機上安裝安裝MHA Node(三台機器都要安裝MHA Node) 先安裝所需的perl模塊 [root@node_master ~]# yum -y install perl perl-DBD-MySQL perl-ExtUtils-CBuilder perl-ExtUtils-MakeMaker perl-CPAN 下載解壓 [root@node_master ~]# wget https://coding.net/u/brian_zhu/p/mha4/git/raw/master/mha4mysql-node-0.58.tar.gz [root@node_master ~]# tar zxf mha4mysql-node-0.58.tar.gz [root@node_master ~]# cd mha4mysql-node-0.58 編譯安裝 [root@node_master mha4mysql-node-0.58]# perl Makefile.PL [root@node_master mha4mysql-node-0.58]# make && make install 2. MHA Manager安裝 ----------------------------------------------MHA Manager安裝---------------------------------------------- 在manager節點(即192.168.192.130)上安裝MHA Manager(注意manager節點也要安裝MHA node) 安裝epel-release源 [root@manager_slave ~]# yum -y install epel-release 安裝perl的mysql包 [root@manager_slave ~]# yum install -y perl-DBD-MySQL perl-Config-Tiny perl-Log-Dispatch perl-Parallel-ForkManager perl-Config-IniFiles perl-Time-HiRes -y 下載解壓 [root@manager_slave ~]# wget https://coding.net/u/brian_zhu/p/mha4/git/raw/master/mha4mysql-manager-0.58.tar.gz [root@manager_slave ~]# tar zxf mha4mysql-manager-0.58.tar.gz 編譯安裝 [root@manager_slave ~]# cd mha4mysql-manager-0.58/ [root@manager_slave mha4mysql-manager-0.58]# perl Makefile.PL [root@manager_slave mha4mysql-manager-0.58]# make && make install 安裝完MHA Manager后,在/usr/local/bin目錄下會生成以下腳本 [root@manager_slave mha4mysql-manager-0.58]# ll /usr/local/bin/ total 39060 -r-xr-xr-x 1 root root 17639 Dec 18 16:56 apply_diff_relay_logs -rwxr-xr-x 1 root root 11739376 Oct 18 09:41 docker-compose -rwxr-xr-x 1 root root 28160480 Oct 23 16:42 docker-machine -r-xr-xr-x 1 root root 4807 Dec 18 16:56 filter_mysqlbinlog -r-xr-xr-x 1 root root 1995 Dec 18 17:00 masterha_check_repl -r-xr-xr-x 1 root root 1779 Dec 18 17:00 masterha_check_ssh -r-xr-xr-x 1 root root 1865 Dec 18 17:00 masterha_check_status -r-xr-xr-x 1 root root 3201 Dec 18 17:00 masterha_conf_host -r-xr-xr-x 1 root root 2517 Dec 18 17:00 masterha_manager -r-xr-xr-x 1 root root 2165 Dec 18 17:00 masterha_master_monitor -r-xr-xr-x 1 root root 2373 Dec 18 17:00 masterha_master_switch -r-xr-xr-x 1 root root 5172 Dec 18 17:00 masterha_secondary_check -r-xr-xr-x 1 root root 1739 Dec 18 17:00 masterha_stop -r-xr-xr-x 1 root root 8337 Dec 18 16:56 purge_relay_logs -r-xr-xr-x 1 root root 7525 Dec 18 16:56 save_binary_logs 其中: masterha_check_repl 檢查MySQL復制狀況 masterha_check_ssh 檢查MHA的SSH配置狀況 masterha_check_status 檢測當前MHA運行狀態 masterha_conf_host 添加或刪除配置的server信息 masterha_manager 啟動MHA masterha_stop 停止MHA masterha_master_monitor 檢測master是否宕機 masterha_master_switch 控制故障轉移(自動或者手動) masterha_secondary_check 多種線路檢測master是否存活 另外: 在../mha4mysql-manager-0.58/samples/scripts/下還有以下腳本,需要將其復制到/usr/local/bin [root@manager_slave mha4mysql-manager-0.58]# ll ../mha4mysql-manager-0.58/samples/scripts/ total 32 -rwxr-xr-x 1 1000 1000 3648 Mar 23 2018 master_ip_failover #自動切換時VIP管理腳本,不是必須,如果我們使用keepalived的,我們可以自己編寫腳本完成對vip的管理,比如監控mysql,如果mysql異常,我們停止keepalived就行,這樣vip就會自動漂移 -rwxr-xr-x 1 1000 1000 9870 Mar 23 2018 master_ip_online_change #在線切換時VIP腳本,不是必須,同樣可以可以自行編寫簡單的shell完成 -rwxr-xr-x 1 1000 1000 11867 Mar 23 2018 power_manager #故障發生后關閉master腳本,不是必須 -rwxr-xr-x 1 1000 1000 1360 Mar 23 2018 send_report #故障切換發送報警腳本,不是必須,可自行編寫簡單的shell完成 [root@manager_slave mha4mysql-manager-0.58]# cp ../mha4mysql-manager-0.58/samples/scripts/* /usr/local/bin/
配置MHA
1.MHA Manager配置(MHA的配置文件) ----------------------------------------------MHA Manager配置---------------------------------------------- 在管理節點(192.168.192.130)上進行下面配置 [root@manager_slave mha4mysql-manager-0.58]# mkdir -p /etc/masterha [root@manager_slave mha4mysql-manager-0.58]# cp samples/conf/app1.cnf /etc/masterha/ [root@manager_slave mha4mysql-manager-0.58]# vim /etc/masterha/app1.cnf [server default] manager_workdir=/var/log/masterha/app1 #設置manager的工作目錄 manager_log=/var/log/masterha/app1/manager.log #設置manager的日志 ssh_user=root #ssh免密鑰登錄的帳號名 user=manager #manager用戶 password=Manager_1234 #manager用戶的密碼 repl_user=slave #mysql復制帳號,用來在主從機之間同步二進制日志等 repl_password=Slave_1234 #設置mysql中root用戶的密碼,這個密碼是前文中創建監控用戶的那個密碼 ping_interval=1 #設置監控主庫,發送ping包的時間間隔,用來檢查master是否正常,默認是3秒,嘗試三次沒有回應的時候自動進行railover master_ip_failover_script= /usr/local/bin/master_ip_failover #設置自動failover時候的切換腳本 master_ip_online_change_script= /usr/local/bin/master_ip_online_change #設置手動切換時候的切換腳本 [server1] hostname=192.168.192.128 port=3306 master_binlog_dir=/var/lib/mysql/ #設置master 保存binlog的位置,以便MHA可以找到master的日志,我這里的也就是mysql的數據目錄 [server2] hostname=192.168.192.129 port=3306 candidate_master=1 #設置為候選master,即master機宕掉后,優先啟用這台作為新master,如果設置該參數以后,發生主從切換以后將會將此從庫提升為主庫,即使這個主庫不是集群中事件最新的slave check_repl_delay=0 #默認情況下如果一個slave落后master 100M的relay logs的話,MHA將不會選擇該slave作為一個新的master,因為對於這個slave的恢復需要花費很長時間通過設置check_repl_delay=0,MHA觸發切換在選擇一個新的master的時候將會忽略復制延時,這個參數對於設置了candidate_master=1的主機非常有用,因為這個候選主在切換的過程中一定是新的master master_binlog_dir=/var/lib/mysql/ [server3] hostname=192.168.192.130 port=3306 #candidate_master=1 master_binlog_dir=/var/lib/mysql/ #[server4] #hostname=host4 #no_master=1 2.設置relay log的清除方式(在兩台slave節點上) 溫馨提示: MHA在發生切換的過程中,從庫的恢復過程中依賴於relay log的相關信息,所以這里要將relay log的自動清除設置為OFF,采用手動清除relay log的方式。 在默認情況下,從服務器上的中繼日志會在SQL線程執行完畢后被自動刪除。但是在MHA環境中,這些中繼日志在恢復其他從服務器時可能會被用到,因此需要禁用 中繼日志的自動刪除功能。定期清除中繼日志需要考慮到復制延時的問題。在ext3的文件系統下,刪除大的文件需要一定的時間,會導致嚴重的復制延時。為了避 免復制延時,需要暫時為中繼日志創建硬鏈接,因為在linux系統中通過硬鏈接刪除大文件速度會很快。(在mysql數據庫中,刪除大表時,通常也采用建立硬鏈接的方式) MHA節點中包含了pure_relay_logs命令工具,它可以為中繼日志創建硬鏈接,執行SET GLOBAL relay_log_purge=1,等待幾秒鍾以便SQL線程切換到新的中繼日志, 再執行SET GLOBAL relay_log_purge=0 pure_relay_logs腳本參數如下所示: --user mysql 用戶名 --password mysql 密碼 --port 端口號 --workdir 指定創建relay log的硬鏈接的位置,默認是/var/tmp,由於系統不同分區創建硬鏈接文件會失敗,故需要執行硬鏈接具體位置,成功執行腳本后,硬鏈接的中繼日志文件被刪除 --disable_relay_log_purge 默認情況下,如果relay_log_purge=1,腳本會什么都不清理,自動退出,通過設定這個參數,當relay_log_purge=1的情況下會將relay_log_purge設置為0。清理relay log之后,最后將參數設置為OFF。 清除relay log的方法(兩台slave節點) [root@node_slave ~]# mysql -uroot -p12345 -e 'set global relay_log_purge=0' [root@manager_slave ~]# mysql -uroot -p12345 -e 'set global relay_log_purge=0' 為了定時提供一個定期清理relay的腳本 設置定期清理relay腳本(在兩台slave節點上操作) [root@node_slave ~]# vim /root/purge_relay_log.sh #!/bin/bash user=root passwd=12345 port=3306 host=localhost log_dir='/data/masterha/log' work_dir='/data' purge='/usr/local/bin/purge_relay_logs' if [ ! -d $log_dir ] then mkdir -p $log_dir fi $purge --user=$user --passwd=$passwd --disable_relay_log_purge --port=$port --workdir=$work_dir >> $log_dir/purge_relay_logs.log 2>&1 給腳本添加執行權限 [root@node_slave ~]# chmod 755 /root/purge_relay_log.sh 添加到crontab定期執行 [root@node_slave ~]# crontab -e 0 5 * * * /bin/bash /root/purge_relay_log.sh 測試腳本執行 purge_relay_logs腳本刪除中繼日志不會阻塞SQL線程 下面我們手動執行以下purge_relay_log看看具體的情況 [root@node_slave ~]# /usr/local/bin/purge_relay_logs --user=root --host=localhost --password=12345 --disable_relay_log_purge --port=3306 --workdir=/data 2018-12-18 17:48:25: purge_relay_logs script started. Found relay_log.info: /var/lib/mysql/relay-log.info Opening /var/lib/mysql/node_slave-relay-bin.000001 .. Opening /var/lib/mysql/node_slave-relay-bin.000002 .. Executing SET GLOBAL relay_log_purge=1; FLUSH LOGS; sleeping a few seconds so that SQL thread can delete older relay log files (if it keeps up); SET GLOBAL relay_log_purge=0; .. ok. 2018-12-18 17:48:28: All relay log purging operations succeeded. 執行腳本看看 [root@node_slave ~]# sh purge_relay_log.sh 生成的了一個日志文件 [root@node_slave ~]# ll /data/masterha/log/ total 4 -rw-r--r-- 1 root root 234 Dec 18 17:49 purge_relay_logs.log
測試MHA
檢查MHA集群的各個狀態 1.檢查MHA集群SSH ------------------------------檢查SSH配置------------------------------ 檢查SSH配置(在manager節點(即192.168.192.130服務器上執行)) 檢查MHA Manger到所有MHA Node的SSH連接狀態: [root@manager_slave ~]# masterha_check_ssh --conf=/etc/masterha/app1.cnf Tue Dec 18 17:53:34 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Dec 18 17:53:34 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Tue Dec 18 17:53:34 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Tue Dec 18 17:53:34 2018 - [info] Starting SSH connection tests.. Tue Dec 18 17:53:35 2018 - [debug] Tue Dec 18 17:53:34 2018 - [debug] Connecting via SSH from root@192.168.192.128(192.168.192.128:22) to root@192.168.192.129(192.168.192.129:22).. Tue Dec 18 17:53:34 2018 - [debug] ok. Tue Dec 18 17:53:34 2018 - [debug] Connecting via SSH from root@192.168.192.128(192.168.192.128:22) to root@192.168.192.130(192.168.192.130:22).. Tue Dec 18 17:53:35 2018 - [debug] ok. Tue Dec 18 17:53:36 2018 - [debug] Tue Dec 18 17:53:35 2018 - [debug] Connecting via SSH from root@192.168.192.129(192.168.192.129:22) to root@192.168.192.128(192.168.192.128:22).. Tue Dec 18 17:53:35 2018 - [debug] ok. Tue Dec 18 17:53:35 2018 - [debug] Connecting via SSH from root@192.168.192.129(192.168.192.129:22) to root@192.168.192.130(192.168.192.130:22).. Tue Dec 18 17:53:35 2018 - [debug] ok. Tue Dec 18 17:53:36 2018 - [debug] Tue Dec 18 17:53:35 2018 - [debug] Connecting via SSH from root@192.168.192.130(192.168.192.130:22) to root@192.168.192.128(192.168.192.128:22).. Tue Dec 18 17:53:35 2018 - [debug] ok. Tue Dec 18 17:53:35 2018 - [debug] Connecting via SSH from root@192.168.192.130(192.168.192.130:22) to root@192.168.192.129(192.168.192.129:22).. Tue Dec 18 17:53:36 2018 - [debug] ok. Tue Dec 18 17:53:36 2018 - [info] All SSH connection tests passed successfully. 出現上面的結果表示SSH配置成功 2.檢查MySQL復制狀況 ------------------------------檢查MySQL復制狀況------------------------------ 使用mha工具check檢查repl環境(在manager節點(即192.168.192.130服務器上執行)) [root@manager_slave ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf 這里出現了一個小插曲 執行后出現了下面的錯誤 ...................... Bareword "FIXME_xxx" not allowed while "strict subs" in use at /usr/local/bin/master_ip_failover line 93. Execution of /usr/local/bin/master_ip_failover aborted due to compilation errors. Tue Dec 18 18:28:28 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln229] Failed to get master_ip_failover_script status with return code 255:0. Tue Dec 18 18:28:28 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. at /usr/local/bin/masterha_check_repl line 48. Tue Dec 18 18:28:28 2018 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers. Tue Dec 18 18:28:28 2018 - [info] Got exit code 1 (Not master dead). 還是出現如上報錯,原因是: 原來Failover兩種方式:一種是虛擬IP地址,一種是全局配置文件。MHA並沒有限定使用哪一種方式,而是讓用戶自己選擇,虛擬IP地址的方式會牽扯到其它的軟件,比如keepalive軟件,而且還要修改腳本master_ip_failover 解決方法: 先暫時注釋掉管理節點的/etc/masterha/app1.cnf文件中的master_ip_failover_script= /usr/local/bin/master_ip_failover這個選項。 后面引入keepalived后和修改該腳本以后再開啟該選項 [root@manager_slave /]# cat /etc/masterha/app1.cnf | grep master_ip_failover_script #master_ip_failover_script= /usr/local/bin/master_ip_failover 最后在通過masterha_check_repl腳本查看整個mysql集群的復制狀態 [root@manager_slave /]# masterha_check_repl --conf=/etc/masterha/app1.cnf Tue Dec 18 18:32:43 2018 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping. Tue Dec 18 18:32:43 2018 - [info] Reading application default configuration from /etc/masterha/app1.cnf.. Tue Dec 18 18:32:43 2018 - [info] Reading server configuration from /etc/masterha/app1.cnf.. Tue Dec 18 18:32:43 2018 - [info] MHA::MasterMonitor version 0.58. Tue Dec 18 18:32:45 2018 - [info] GTID failover mode = 0 Tue Dec 18 18:32:45 2018 - [info] Dead Servers: Tue Dec 18 18:32:45 2018 - [info] Alive Servers: Tue Dec 18 18:32:45 2018 - [info] 192.168.192.128(192.168.192.128:3306) Tue Dec 18 18:32:45 2018 - [info] 192.168.192.129(192.168.192.129:3306) Tue Dec 18 18:32:45 2018 - [info] 192.168.192.130(192.168.192.130:3306) ...................... Tue Dec 18 18:32:48 2018 - [info] Checking replication health on 192.168.192.129.. Tue Dec 18 18:32:48 2018 - [info] ok. Tue Dec 18 18:32:48 2018 - [info] Checking replication health on 192.168.192.130.. Tue Dec 18 18:32:48 2018 - [info] ok. Tue Dec 18 18:32:48 2018 - [warning] master_ip_failover_script is not defined. Tue Dec 18 18:32:48 2018 - [warning] shutdown_script is not defined. Tue Dec 18 18:32:48 2018 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK. 這個時候,發現整個復制環境狀況是ok的了!!! 3.檢查MHA Manager的狀態 ------------------------------檢查MHA Manager的狀態------------------------------ [root@manager_slave /]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 is stopped(2:NOT_RUNNING). 注意:如果正常,會顯示"PING_OK",否則會顯示"NOT_RUNNING",這代表MHA監控沒有開啟 開啟MHA Manager監控 使用下面命令放在后台執行啟動動作 [root@manager_slave /]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 & 啟動參數介紹: --remove_dead_master_conf 該參數代表當發生主從切換后,老的主庫的ip將會從配置文件中移除。 --manger_log 日志存放位置 --ignore_last_failover 在缺省情況下,如果MHA檢測到連續發生宕機,且兩次宕機間隔不足8小時的話,則不會進行Failover,之所以這樣限制是為了 避免ping-pong效應。該參數代表忽略上次MHA觸發切換產生的文件,默認情況下,MHA發生切換后會在日志目錄,也就是上面我 設置的/data產生app1.failover.complete文件,下次再次切換的時候如果發現該目錄下存在該文件將不允許觸發切換,除非 在第一次切換后收到刪除該文件,為了方便,這里設置為--ignore_last_failover。 再次查看MHA Manager監控是否正常: [root@manager_slave /]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 (pid:8162) is running(0:PING_OK), master:192.168.192.128 可以看見已經在監控了,而且master的主機為192.168.192.128 查看啟動日志 [root@manager_slave /]# tail -n20 /var/log/masterha/app1/manager.log Relay log found at /var/lib/mysql, up to manager_slave-relay-bin.000002 Temporary relay log file is /var/lib/mysql/manager_slave-relay-bin.000002 Checking if super_read_only is defined and turned on.. not present or turned off, ignoring. Testing mysql connection and privileges.. mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Tue Dec 18 18:35:11 2018 - [info] Slaves settings check done. Tue Dec 18 18:35:11 2018 - [info] 192.168.192.128(192.168.192.128:3306) (current master) +--192.168.192.129(192.168.192.129:3306) +--192.168.192.130(192.168.192.130:3306) Tue Dec 18 18:35:11 2018 - [warning] master_ip_failover_script is not defined. Tue Dec 18 18:35:11 2018 - [warning] shutdown_script is not defined. Tue Dec 18 18:35:11 2018 - [info] Set master ping interval 1 seconds. Tue Dec 18 18:35:11 2018 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes. Tue Dec 18 18:35:11 2018 - [info] Starting ping health check on 192.168.192.128(192.168.192.128:3306).. Tue Dec 18 18:35:11 2018 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. 其中"Ping(SELECT) succeeded, waiting until MySQL doesn't respond.."說明整個系統已經開始監控了。 4.關閉MHA Manager監控 ------------------------------關閉MHA Manager監控------------------------------ 關閉就很簡單了,直接使用masterha_stop命令就可以完成了 [root@manager_slave ~]# masterha_stop --conf=/etc/masterha/app1.cnf MHA Manager is not running on app1(2:NOT_RUNNING). 查看MHA Manager監控,發現已關閉 [root@manager_slave ~]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 is stopped(2:NOT_RUNNING).
配置VIP
配置keeplived實現vip
vip配置可以采用兩種方式,一種通過keepalived的方式管理虛擬ip浮動;另外一種通過腳本方式啟動虛擬ip的方式(即不需要keepalived或者heartbeat類似的軟件)
通過keepalive的方式管理vip
---------------------------------------------------------第一種方式:通過keepalive的方式管理vip--------------------------------------------------------- 下載軟件進行並進行安裝(在兩台master上都要安裝,准確的說一台是master(192.168.192.128);另外一台是備選master(192.168.192.129),在沒有切換以前是slave) [root@node_master ~]# yum -y install openssl-devel [root@node_master ~]# wget http://www.keepalived.org/software/keepalived-2.0.10.tar.gz [root@node_master ~]# tar zxf keepalived-2.0.10.tar.gz [root@node_master ~]# cd keepalived-2.0.10/ [root@node_master keepalived-2.0.10]# ./configure --prefix=/usr/local/keepalived [root@node_master keepalived-2.0.10]# make && make install [root@node_master keepalived-2.0.10]# cp keepalived/etc/init.d/keepalived /etc/init.d/ [root@node_master keepalived-2.0.10]# cp /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/ [root@node_master keepalived-2.0.10]# mkdir /etc/keepalived [root@node_master keepalived-2.0.10]# cp /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/ [root@node_master keepalived-2.0.10]# cp /usr/local/keepalived/sbin/keepalived /usr/sbin/ keepalived配置 ------------在master上配置(192.168.192.128節點上的配置)------------------ [root@node_master keepalived-2.0.10]# cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak [root@node_master keepalived-2.0.10]# vim /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { notification_email { 1024331014@qq.com } notification_email_from 1024331014@qq.com smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id MySQL-HA } vrrp_instance VI_1 { state BACKUP interface ens37 virtual_router_id 51 priority 150 advert_int 1 nopreempt authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 192.168.192.131 } } 其中router_id MySQL HA表示設定keepalived組的名稱,將192.168.192.131這個虛擬ip綁定到該主機的ens37網卡上,並且設置了狀態為backup模式, 將keepalived的模式設置為非搶占模式(nopreempt),priority 150表示設置的優先級為150。 ------------在candicate master上配置(192.168.192.129節點上的配置)------------------ [root@node_slave keepalived-2.0.10]# vim /etc/keepalived/keepalived.conf ! Configuration File for keepalived global_defs { notification_email { 1024331014@qq.com } notification_email_from 1024331014@qq.com smtp_server 127.0.0.1 smtp_connect_timeout 30 router_id MySQL-HA } vrrp_instance VI_1 { state BACKUP interface ens37 virtual_router_id 51 priority 120 advert_int 1 nopreempt authentication { auth_type PASS auth_pass 1111 } virtual_ipaddress { 192.168.192.131 } } 啟動keepalived服務 --------------在master上啟動並查看日志(192.168.192.128節點上)------------------------------ [root@node_master keepalived-2.0.10]# /etc/init.d/keepalived start Starting keepalived (via systemctl): [ OK ] 查看VIP是否成功配置(下面有個192.168.192.131的地址) [root@node_master keepalived-2.0.10]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:19:14:db brd ff:ff:ff:ff:ff:ff inet 192.168.52.129/24 brd 192.168.52.255 scope global noprefixroute dynamic ens33 valid_lft 1250sec preferred_lft 1250sec 3: ens37: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:19:14:e5 brd ff:ff:ff:ff:ff:ff inet 192.168.192.128/24 brd 192.168.192.255 scope global noprefixroute ens37 valid_lft forever preferred_lft forever inet 192.168.192.131/32 scope global ens37 valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:4b:9c:ed:04 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever [root@node_master keepalived-2.0.10]# tail -10 /var/log/messages Dec 19 10:43:05 node_master Keepalived_vrrp[6856]: Sending gratuitous ARP on ens37 for 192.168.192.131 Dec 19 10:43:05 node_master Keepalived_vrrp[6856]: (VI_1) Sending/queueing gratuitous ARPs on ens37 for 192.168.192.131 Dec 19 10:43:05 node_master Keepalived_vrrp[6856]: Sending gratuitous ARP on ens37 for 192.168.192.131 Dec 19 10:43:05 node_master Keepalived_vrrp[6856]: Sending gratuitous ARP on ens37 for 192.168.192.131 Dec 19 10:43:05 node_master Keepalived_vrrp[6856]: Sending gratuitous ARP on ens37 for 192.168.192.131 Dec 19 10:43:05 node_master Keepalived_vrrp[6856]: Sending gratuitous ARP on ens37 for 192.168.192.131 Dec 19 10:43:07 node_master crond: sendmail: fatal: parameter inet_interfaces: no local interface found for ::1 Dec 19 10:44:02 node_master systemd: Started Session 90 of user root. Dec 19 10:44:02 node_master systemd: Starting Session 90 of user root. Dec 19 10:44:08 node_master crond: sendmail: fatal: parameter inet_interfaces: no local interface found for ::1 發現vip資源已經綁定到192.168.192.128這個master節點機上了 --------------在candicate master上啟動(192.168.192.129節點上)---------------------------- [root@node_slave keepalived-2.0.10]# /etc/init.d/keepalived start Starting keepalived (via systemctl): [ OK ] [root@node_slave keepalived-2.0.10]# ip addr 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever 2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:f8:41:0f brd ff:ff:ff:ff:ff:ff inet 192.168.52.130/24 brd 192.168.52.255 scope global noprefixroute dynamic ens33 valid_lft 1083sec preferred_lft 1083sec 3: ens37: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000 link/ether 00:0c:29:f8:41:19 brd ff:ff:ff:ff:ff:ff inet 192.168.192.129/24 brd 192.168.192.255 scope global noprefixroute ens37 valid_lft forever preferred_lft forever 4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default link/ether 02:42:b0:14:23:57 brd ff:ff:ff:ff:ff:ff inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0 valid_lft forever preferred_lft forever 從上面的信息可以看到keepalived已經配置成功。 注意: 上面兩台服務器的keepalived都設置為了BACKUP模式,在keepalived中2種模式,分別是master->backup模式和backup->backup模式。這兩種模式有很大區別。 在master->backup模式下,一旦主庫宕機,虛擬ip會自動漂移到從庫,當主庫修復后,keepalived啟動后,還會把虛擬ip搶占過來,即使設置了非搶占模式(nopreempt)搶占ip的動作也會發生。 在backup->backup模式下,當主庫宕機后虛擬ip會自動漂移到從庫上,當原主庫恢復和keepalived服務啟動后,並不會搶占新主的虛擬ip,即使是優先級高於從庫的優先級別,也不會發生搶占。 為了減少ip漂移次數,通常是把修復好的主庫當做新的備庫。
MHA引入keepalived
MHA引入keepalived(MySQL服務進程掛掉時通過MHA 停止keepalived) 要想把keepalived服務引入MHA,只需要修改切換是觸發的腳本文件master_ip_failover即可,在該腳本中添加在master發生宕機時對keepalived的處理。 編輯腳本/usr/local/bin/master_ip_failover,修改后如下: [root@manager_slave ~]# cp /usr/local/bin/master_ip_failover /usr/local/bin/master_ip_failover.bak [root@manager_slave ~]# vim /usr/local/bin/master_ip_failover #這里有個需要注意的點就是腳本中的vip的地址需要做下修改 #!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Getopt::Long; my ( $command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port ); my $vip = '192.168.192.131'; my $ssh_start_vip = "/etc/init.d/keepalived start"; my $ssh_stop_vip = "/etc/init.d/keepalived stop"; GetOptions( 'command=s' => \$command, 'ssh_user=s' => \$ssh_user, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, ); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; #`ssh $ssh_user\@cluster1 \" $ssh_start_vip \"`; exit 0; } else { &usage(); exit 1; } } # A simple system call that enable the VIP on the new master sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } # A simple system call that disable the VIP on the old_master sub stop_vip() { return 0 unless ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; } 現在已經修改這個腳本了,現在打開/etc/masterha/app1.cnf文件中的master_ip_failover_script注釋,再檢查集群狀態,看是否會報錯 [root@manager_slave ~]# grep 'master_ip_failover_script' /etc/masterha/app1.cnf master_ip_failover_script= /usr/local/bin/master_ip_failover [root@manager_slave ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf .................... Checking the Status of the script.. OK Wed Dec 19 10:55:37 2018 - [info] OK. Wed Dec 19 10:55:37 2018 - [warning] shutdown_script is not defined. Wed Dec 19 10:55:37 2018 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK. 可以看出復制情況正常! /usr/local/bin/master_ip_failover添加或者修改的內容意思是當主庫數據庫發生故障時,會觸發MHA切換,MHA Manager會停掉主庫上的keepalived服務, 觸發虛擬ip漂移到備選從庫,從而完成切換。當然可以在keepalived里面引入腳本,這個腳本監控mysql是否正常運行,如果不正常,則調用該腳本殺掉keepalived進程。
使用腳本實現vip
通過腳本的方式管理VIP
---------------------------------------------------------第二種方式:通過腳本的方式管理VIP--------------------------------------------------------- 為了測試第二種方式,我把上keepalived停掉了 這里是修改/usr/local/bin/master_ip_failover,修改完成后內容如下。還需要手動在master服務器上綁定一個vip 先在master節點(192.168.192.128)上綁定vip [root@node_master ~]# ifconfig ens37:0 192.168.192.131/24 #這里要注意的是網卡名和地址 [root@node_master ~]# ifconfig ens33: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.52.129 netmask 255.255.255.0 broadcast 192.168.52.255 ether 00:0c:29:19:14:db txqueuelen 1000 (Ethernet) RX packets 7850 bytes 8852518 (8.4 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 2478 bytes 176378 (172.2 KiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens37: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.192.128 netmask 255.255.255.0 broadcast 192.168.192.255 ether 00:0c:29:19:14:e5 txqueuelen 1000 (Ethernet) RX packets 4200 bytes 448286 (437.7 KiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 5620 bytes 2671664 (2.5 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 ens37:0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500 inet 192.168.192.131 netmask 255.255.255.0 broadcast 192.168.192.255 ether 00:0c:29:19:14:e5 txqueuelen 1000 (Ethernet) lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 loop txqueuelen 1000 (Local Loopback) RX packets 0 bytes 0 (0.0 B) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 0 bytes 0 (0.0 B) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 在manager(192.168.192.130)節點修改/usr/local/bin/master_ip_failover [root@manager_slave ~]# cp /usr/local/bin/master_ip_failover /usr/local/bin/master_ip_failover.bak.keep [root@manager_slave ~]# vim /usr/local/bin/master_ip_failover #這里有個需要注意的點就是腳本中的vip的地址需要做下修改 #!/usr/bin/env perl use strict; use warnings FATAL => 'all'; use Getopt::Long; my ( $command, $ssh_user, $orig_master_host, $orig_master_ip, $orig_master_port, $new_master_host, $new_master_ip, $new_master_port ); my $vip = '192.168.192.131/24'; my $key = '1'; my $ssh_start_vip = "/sbin/ifconfig eth0:$key $vip"; my $ssh_stop_vip = "/sbin/ifconfig eth0:$key down"; GetOptions( 'command=s' => \$command, 'ssh_user=s' => \$ssh_user, 'orig_master_host=s' => \$orig_master_host, 'orig_master_ip=s' => \$orig_master_ip, 'orig_master_port=i' => \$orig_master_port, 'new_master_host=s' => \$new_master_host, 'new_master_ip=s' => \$new_master_ip, 'new_master_port=i' => \$new_master_port, ); exit &main(); sub main { print "\n\nIN SCRIPT TEST====$ssh_stop_vip==$ssh_start_vip===\n\n"; if ( $command eq "stop" || $command eq "stopssh" ) { my $exit_code = 1; eval { print "Disabling the VIP on old master: $orig_master_host \n"; &stop_vip(); $exit_code = 0; }; if ($@) { warn "Got Error: $@\n"; exit $exit_code; } exit $exit_code; } elsif ( $command eq "start" ) { my $exit_code = 10; eval { print "Enabling the VIP - $vip on the new master - $new_master_host \n"; &start_vip(); $exit_code = 0; }; if ($@) { warn $@; exit $exit_code; } exit $exit_code; } elsif ( $command eq "status" ) { print "Checking the Status of the script.. OK \n"; exit 0; } else { &usage(); exit 1; } } sub start_vip() { `ssh $ssh_user\@$new_master_host \" $ssh_start_vip \"`; } sub stop_vip() { return 0 unless ($ssh_user); `ssh $ssh_user\@$orig_master_host \" $ssh_stop_vip \"`; } sub usage { print "Usage: master_ip_failover --command=start|stop|stopssh|status --orig_master_host=host --orig_master_ip=ip --orig_master_port=port --new_master_host=host --new_master_ip=ip --new_master_port=port\n"; } 檢查MHA的復制情況是否正常 [root@manager_slave ~]# masterha_check_repl --conf=/etc/masterha/app1.cnf Checking the Status of the script.. OK Wed Dec 19 11:06:36 2018 - [info] OK. Wed Dec 19 11:06:36 2018 - [warning] shutdown_script is not defined. Wed Dec 19 11:06:36 2018 - [info] Got exit code 0 (Not master dead). MySQL Replication Health is OK. 注意: 要將/etc/masterha/app1.cnf文件中的master_ip_failover_script注釋打開 為了防止腦裂發生,推薦生產環境采用腳本的方式來管理虛擬ip,而不是使用keepalived來完成。到此為止,基本MHA集群已經配置完畢。 接下來就是實際的測試環節了。通過一些測試來看一下MHA到底是如何進行工作的。
failover故障切換
自動切換
自動切換(必須先啟動MHA Manager,否則無法自動切換(當然手動切換不需要開啟MHA Manager監控))
開啟MHA Manager監控(在192.168.192.130上執行,當然如果已經啟動了則不需要再次執行) 使用下面命令放在后台執行啟動動作 [root@manager_slave /]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 & 停掉主庫(192.168.192.128)mysql服務,模擬主庫發生故障,進行自動failover操作 [root@node_master ~]# systemctl stop mysqld 查看MHA切換日志,了解整個切換過程。在manager管理節點(192.168.192.130)上查看日志 [root@manager_slave ~]# cat /var/log/masterha/app1/manager.log ................ ................ ----- Failover Report ----- app1: MySQL Master failover 192.168.192.128(192.168.192.128:3306) to 192.168.192.129(192.168.192.129:3306) succeeded Master 192.168.192.128(192.168.192.128:3306) is down! Check MHA Manager logs at manager_slave:/var/log/masterha/app1/manager.log for details. Started automated(non-interactive) failover. Invalidated master IP address on 192.168.192.128(192.168.192.128:3306) The latest slave 192.168.192.129(192.168.192.129:3306) has all relay logs for recovery. Selected 192.168.192.129(192.168.192.129:3306) as a new master. 192.168.192.129(192.168.192.129:3306): OK: Applying all logs succeeded. 192.168.192.129(192.168.192.129:3306): OK: Activated master IP address. 192.168.192.130(192.168.192.130:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.192.130(192.168.192.130:3306): OK: Applying all logs succeeded. Slave started, replicating from 192.168.192.129(192.168.192.129:3306) 192.168.192.129(192.168.192.129:3306): Resetting slave info succeeded. Master failover to 192.168.192.129(192.168.192.129:3306) completed successfully. 看到最后的Master failover to 192.168.192.129(192.168.192.129:3306) completed successfully.說明備選master現在已經上位了 從上面的輸出可以看出整個MHA的切換過程,共包括以下的步驟: 1)配置文件檢查階段,這個階段會檢查整個集群配置文件配置 2)宕機的master處理,這個階段包括虛擬ip摘除操作,主機關機操作(這個我這里還沒有實現,需要研究) 3)復制dead maste和最新slave相差的relay log,並保存到MHA Manger具體的目錄下 4)識別含有最新更新的slave 5)應用從master保存的二進制日志事件(binlog events) 6)提升一個slave為新的master進行復制 7)使其他的slave連接新的master進行復制 最后啟動MHA Manger監控,查看集群里面現在誰是master [root@manager_slave ~]# nohup masterha_manager --conf=/etc/masterha/app1.cnf --remove_dead_master_conf --ignore_last_failover < /dev/null > /var/log/masterha/app1/manager.log 2>&1 & [root@manager_slave ~]# masterha_check_status --conf=/etc/masterha/app1.cnf app1 (pid:7036) is running(0:PING_OK), master:192.168.192.129
手動切換
手動Failover(MHA Manager必須沒有運行) 手動failover,這種場景意味着在業務上沒有啟用MHA自動切換功能,當主服務器故障時,人工手動調用MHA來進行故障切換操作,具體命令如下: 確保mha manager關閉 [root@manager_slave ~]# masterha_stop --conf=/etc/masterha/app1.cnf 注意:如果MHA manager檢測到沒有dead的server,將報錯,並結束failover [root@manager_slave ~]# masterha_master_switch --master_state=dead --conf=/etc/masterha/app1.cnf --dead_master_host=192.168.192.128 --dead_master_port=3306 --new_master_host=192.168.192.129 --new_master_port=3306 --ignore_last_failover 輸出的信息會詢問你是否進行切換(yes/NO): 輸入yes ............. ............. ----- Failover Report ----- app1: MySQL Master failover 192.168.192.128(192.168.192.128:3306) to 192.168.192.129(192.168.192.129:3306) succeeded Master 192.168.192.128(192.168.192.128:3306) is down! Check MHA Manager logs at manager_slave for details. Started manual(interactive) failover. Invalidated master IP address on 192.168.192.128(192.168.192.128:3306) The latest slave 192.168.192.129(192.168.192.129:3306) has all relay logs for recovery. Selected 192.168.192.129(192.168.192.129:3306) as a new master. 192.168.192.129(192.168.192.129:3306): OK: Applying all logs succeeded. 192.168.192.129(192.168.192.129:3306): OK: Activated master IP address. 192.168.192.130(192.168.192.130:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. 192.168.192.130(192.168.192.130:3306): OK: Applying all logs succeeded. Slave started, replicating from 192.168.192.129(192.168.192.129:3306) 192.168.192.129(192.168.192.129:3306): Resetting slave info succeeded. Master failover to 192.168.192.129(192.168.192.129:3306) completed successfully. 我們看到上面的輸出已經切換成功了,這樣即模擬了master(192.168.192.128)宕機的情況下手動把192.168.192.129提升為主庫的操作過程。
在線切換
在許多情況下, 需要將現有的主服務器遷移到另外一台服務器上,比如主服務器硬件故障,RAID 控制卡需要重建,將主服務器移到性能更好的服務器上等等。維護主服務器引起性能下降,導致停機時間至少無法寫入數據。另外,阻塞或殺掉當前運行的會話會導致主主之間數據不一致的問題發生。
MHA提供快速切換和優雅的阻塞寫入,這個切換過程只需要 0.5-2s 的時間,這段時間內數據是無法寫入的。在很多情況下,0.5-2s 的阻塞寫入是可以接受的。因此切換主服務器不需要計划分配維護時間窗口。
MHA在線切換的大概過程:
1)檢測復制設置和確定當前主服務器
2)確定新的主服務器
3)阻塞寫入到當前主服務器
4)等待所有從服務器趕上復制
5)授予寫入到新的主服務器
6)重新設置從服務器
注意,在線切換的時候應用架構需要考慮以下兩個問題:
1)自動識別master和slave的問題(master的機器可能會切換),如果采用了vip的方式,基本可以解決這個問題。
2)負載均衡的問題(可以定義大概的讀寫比例,每台機器可承擔的負載比例,當有機器離開集群時,需要考慮這個問題)
為了保證數據完全一致性,在最快的時間內完成切換,MHA的在線切換必須滿足以下條件才會切換成功,否則會切換失敗。
1)所有slave的IO線程都在運行
2)所有slave的SQL線程都在運行
3)所有的show slave status的輸出中Seconds_Behind_Master參數小於或者等於running_updates_limit秒,如果在切換過程中不指定running_updates_limit,那么默認情況下running_updates_limit為1秒。
4)在master端,通過show processlist輸出,沒有一個更新花費的時間大於running_updates_limit秒。
在線切換步驟如下:
首先,manager節點上停掉MHA監控: [root@manager_slave ~]# masterha_stop --conf=/etc/masterha/app1.cnf 其次,進行在線切換操作(模擬在線切換主庫操作,原主庫192.168.192.128變為slave,192.168.192.129提升為新的主庫) [root@manager_slave ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.192.129 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000 執行后出現了下面的錯誤 .......... Starting master switch from 192.168.192.128(192.168.192.128:3306) to 192.168.192.129(192.168.192.129:3306)? (yes/NO): yes Thu Dec 20 12:04:47 2018 - [info] Checking whether 192.168.192.129(192.168.192.129:3306) is ok for the new master.. Thu Dec 20 12:04:47 2018 - [info] ok. Thu Dec 20 12:04:47 2018 - [info] 192.168.192.128(192.168.192.128:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Thu Dec 20 12:04:47 2018 - [info] 192.168.192.128(192.168.192.128:3306): Resetting slave pointing to the dummy host. Thu Dec 20 12:04:47 2018 - [info] ** Phase 1: Configuration Check Phase completed. Thu Dec 20 12:04:47 2018 - [info] Thu Dec 20 12:04:47 2018 - [info] * Phase 2: Rejecting updates Phase.. Thu Dec 20 12:04:47 2018 - [info] Thu Dec 20 12:04:47 2018 - [info] Executing master ip online change script to disable write on the current master: Thu Dec 20 12:04:47 2018 - [info] /usr/local/bin/master_ip_online_change --command=stop --orig_master_host=192.168.192.128 --orig_master_ip=192.168.192.128 --orig_master_port=3306 --orig_master_user='manager' --new_master_host=192.168.192.129 --new_master_ip=192.168.192.129 --new_master_port=3306 --new_master_user='manager' --orig_master_ssh_user=root --new_master_ssh_user=root --orig_master_is_new_slave --orig_master_password=xxx --new_master_password=xxx Thu Dec 20 12:04:47 2018 811566 Set read_only on the new master.. ok. Thu Dec 20 12:04:47 2018 815337 Drpping app user on the orig master.. Got Error: Undefined subroutine &main::FIXME_xxx_drop_app_user called at /usr/local/bin/master_ip_online_change line 152. Thu Dec 20 12:04:47 2018 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/bin/masterha_master_switch line 53. 報錯原因是:這是由於無法找到對FIXME_xxx_drop_app_user定義,由於perl不熟,我暫時注釋掉相關drop user的行或FIXME_xxx等,不會影響其他過程 解決方法: [root@manager_slave ~]# cp /usr/local/bin/master_ip_online_change /usr/local/bin/master_ip_online_change.bak #備份這個文件因為要對這個文件進行修改 [root@manager_slave ~]# vim /usr/local/bin/master_ip_online_change #編輯這個文件 找到下面這兩條將其使用#注釋掉 FIXME_xxx_drop_app_user($orig_master_handler); FIXME_xxx_create_app_user($new_master_handler); 重新執行在線切換 [root@manager_slave ~]# masterha_master_switch --conf=/etc/masterha/app1.cnf --master_state=alive --new_master_host=192.168.192.129 --new_master_port=3306 --orig_master_is_new_slave --running_updates_limit=10000 ........ Thu Dec 20 12:11:06 2018 - [info] -- Slave switch on host 192.168.192.130(192.168.192.130:3306) succeeded. Thu Dec 20 12:11:06 2018 - [info] Unlocking all tables on the orig master: Thu Dec 20 12:11:06 2018 - [info] Executing UNLOCK TABLES.. Thu Dec 20 12:11:06 2018 - [info] ok. Thu Dec 20 12:11:06 2018 - [info] Starting orig master as a new slave.. Thu Dec 20 12:11:06 2018 - [info] Resetting slave 192.168.192.128(192.168.192.128:3306) and starting replication from the new master 192.168.192.129(192.168.192.129:3306).. Thu Dec 20 12:11:06 2018 - [info] Executed CHANGE MASTER. Thu Dec 20 12:11:06 2018 - [info] Slave started. Thu Dec 20 12:11:06 2018 - [info] All new slave servers switched successfully. Thu Dec 20 12:11:06 2018 - [info] Thu Dec 20 12:11:06 2018 - [info] * Phase 5: New master cleanup phase.. Thu Dec 20 12:11:06 2018 - [info] Thu Dec 20 12:11:06 2018 - [info] 192.168.192.129: Resetting slave info succeeded. Thu Dec 20 12:11:06 2018 - [info] Switching master to 192.168.192.129(192.168.192.129:3306) completed successfully. 看到上面的輸出結果表示在線切換成功 其中參數的意思: --orig_master_is_new_slave 切換時加上此參數是將原 master 變為 slave 節點,如果不加此參數,原來的 master 將不啟動 --running_updates_limit=10000 故障切換時,候選master 如果有延遲的話, mha 切換不能成功,加上此參數表示延遲在此時間范圍內都可切換(單位為s),但是切換的時間長短是由recover 時relay 日志的大小決定
至此整個了MySQL高可用方案--MHA部署完畢!!!