Ⅰ、MGR基本參數說明
- binlog/relaylog
server_id = 1
log_bin = bin
log_slave_updates = 1
relay_log = relay
binlog_format = row #必須row,用row的理由太多,這里不細述
binlog_checksum = NONE #5.7.17不支持帶checksum的binlog event
master_info_repository = table #gr用到多源復制,slave的channel信息必須保存在表中
relay_log_info_repository = table
- gtid
gtid_mode = 1
enforce_gtid_consistency = 1
- 並行復制
slave_parallel_type = 'logical_clock'
slave_parallel_workers = N
slave_preserve_commit_order = 1
- 主鍵信息采集
transaction_write_set_extraction = XXHASH64 #server層采集主鍵信息,hash后存儲起來,所有節點保持一致
- gr插件相關
plugin_load_add = 'group_replication.so' #引入gr插件
loose-group_replication_group_name = "745c2bc6-9fe4-11ea-8c89-fa163e98606f" #組名稱
loose-group_replication_start_on_boot = 0 #服務啟動時不啟動gr
loose-group_replication_local_address = "192.168.0.65:33061" #本地成員地址
loose-group_replication_group_seeds = "192.168.0.32:33061,192.168.0.65:33061,192.168.0.185:33061" #配置種子成員地址
loose-group_replication_bootstrap_group = 0 #默認設置節點為非引導節點,初始化集群時在一個節點上臨時開啟,初始化后立即關閉
loose-group_replication_whitelist = 'ip或者網段' #白名單配置,默認不配置,只允許和本機同網段的機器加入
'loose-'多用於插件參數,表示不支持該參數時也可以正常啟動MySQL服務,不報錯
Ⅱ、MGR模式切換
默認情況下,mgr部署好是單主模式,這里我們做模式切換
單主還是多主由一個參數控制
group_replication_single_primary_mode
不支持在線模式切換,需要把所有成員退出,在所有節點調整上面這個參數再重新初始化組為需要的模式,再把成員加進來
擼一把看
(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1 | 3306 | ONLINE | SECONDARY | 8.0.19 |
| group_replication_applier | a7dbfa5f-a4b0-11ea-b21b-fa163e98606f | slave2 | 3306 | ONLINE | SECONDARY | 8.0.19 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
目前看集群處於單主模式,一個PRIMARY,兩個SECONDARY,操作起來
所有節點做如下操作:
(root@localhost) [(none)]> stop group_replication;
Query OK, 0 rows affected (4.41 sec)
(root@localhost) [(none)]> set global group_replication_single_primary_mode = 0;
Query OK, 0 rows affected (0.00 sec)
任意一個節點上操作重新初始化集群:
(root@localhost) [(none)]> SET GLOBAL group_replication_bootstrap_group=1;
Query OK, 0 rows affected (0.00 sec)
(root@localhost) [(none)]> start group_replication;
Query OK, 0 rows affected (3.12 sec)
(root@localhost) [(none)]> SET GLOBAL group_replication_bootstrap_group=0;
Query OK, 0 rows affected (0.00 sec)
剩余節點加入集群:
(root@localhost) [(none)]> start group_replication;
Query OK, 0 rows affected (3.58 sec)
查看當前集群狀態
(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1 | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a7dbfa5f-a4b0-11ea-b21b-fa163e98606f | slave2 | 3306 | ONLINE | PRIMARY | 8.0.19 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
仨primary,nice,多主變單主也是一樣操作即可~
Ⅲ、集群故障處理
聽說,不管是單主還是雙主,但凡集群中大於半數節點出現故障,則整個集群無法對外提供服務,我們這里討論的就是這種情況,而不是正常的增加節點和刪除節點,這里需要注意下。
我們用kill -9 mysqld服務來模擬現場
此時集群是多主模式
先停一個節點slave2
[root@slave2 ~]# ps auxwf |grep mysql |grep -v grep |awk '{print $2}' |tail -n 1 |xargs kill -9
這里我服務器上kill最后一個mysql,是因為我用mysqld_safe啟動的mysqld,所以前面還有一個守護進程,我們保留不殺
剩余倆節點上查看集群
(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1 | 3306 | ONLINE | PRIMARY | 8.0.19 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
(root@localhost) [(none)]> insert into t.t values(6);
Query OK, 1 row affected (0.01 sec)
(root@localhost) [(none)]> select * from t.t;
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
+----+
6 rows in set (0.00 sec)
好,slave2已經沒了,不過還可以插入數據
再把slave1殺掉
[root@slave1 ~]# ps auxwf |grep mysql |grep -v grep |awk '{print $2}' |tail -n 1 |xargs kill -9
master上看集群狀態
(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1 | 3306 | UNREACHABLE | PRIMARY | 8.0.19 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)
(root@localhost) [(none)]> insert into t.t values(7);
~hang住
顯示slave1無法通信,插入數據也插不進去,集群無法對外服務了
測試過程中發現,若短時間殺掉倆節點,此處會顯示兩個節點無法通信
新開一個master會話看看
innodb status
---TRANSACTION 3123, ACTIVE (PREPARED) 19 sec
mysql tables in use 1, locked 1
1 lock struct(s), heap size 1136, 0 row lock(s), undo log entries 1
MySQL thread id 8, OS thread handle 139976409593600, query id 49 localhost root waiting for handler commit
insert into t.t values(7)
----------------
innodb_trx
(root@localhost) [(none)]> select * from information_schema.innodb_trx\G
*************************** 1. row ***************************
trx_id: 3123
trx_state: RUNNING
trx_started: 2020-06-07 11:52:01
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 2
trx_mysql_thread_id: 8
trx_query: insert into t.t values(7)
trx_operation_state: NULL
trx_tables_in_use: 1
trx_tables_locked: 1
trx_lock_structs: 1
trx_lock_memory_bytes: 1136
trx_rows_locked: 0
trx_rows_modified: 1
trx_concurrency_tickets: 0
trx_isolation_level: REPEATABLE READ
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.00 sec)
----------------
processlist
(root@localhost) [(none)]> show processlist;
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
| 4 | event_scheduler | localhost | NULL | Daemon | 2162 | Waiting on empty queue | NULL |
| 8 | root | localhost | NULL | Query | 106 | waiting for handler commit | insert into t.t values(7) |
| 10 | system user | | NULL | Connect | 560 | waiting for handler commit | Group replication applier module |
| 13 | system user | | NULL | Query | 560 | Slave has read all relay log; waiting for more updates | NULL |
| 22 | root | localhost | NULL | Query | 0 | starting | show processlist |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
5 rows in set (0.01 sec)
(root@localhost) [(none)]> kill 8;
Query OK, 0 rows affected (0.00 sec)
原來hang住的session
(root@localhost) [(none)]> insert into t.t values(7);
ERROR 2013 (HY000): Lost connection to MySQL server during query
新開的session上觀察
(root@localhost) [(none)]> show processlist;
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
| 4 | event_scheduler | localhost | NULL | Daemon | 2202 | Waiting on empty queue | NULL |
| 8 | root | localhost | NULL | Killed | 146 | waiting for handler commit | insert into t.t values(7) |
| 10 | system user | | NULL | Connect | 600 | waiting for handler commit | Group replication applier module |
| 13 | system user | | NULL | Query | 600 | Slave has read all relay log; waiting for more updates | NULL |
| 22 | root | localhost | NULL | Query | 0 | starting | show processlist |
+----+-----------------+-----------+------+---------+------+--------------------------------------------------------+----------------------------------+
5 rows in set (0.00 sec)
innodb status和innodb_trx中事務狀態還是和之前保持一致,這里就不貼了
kill這個hang住的session竟然沒有成功,只是顯示killed,這個為什么一直顯示killed呢,
先放着
看master的錯誤日志
2020-06-06T14:04:52.234039Z 212 [Warning] [MY-010056] [Server] Host name 'hn.kd.jz.adsl' could not be resolved: Name or service not known
2020-06-06T15:54:42.751689Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address slave2:3306 has become unreachable.'
2020-06-06T15:54:44.753896Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: slave2:3306'
2020-06-06T15:59:53.600136Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address slave1:3306 has become unreachable.'
2020-06-06T15:59:53.600173Z 0 [ERROR] [MY-011495] [Repl] Plugin group_replication reported: 'This server is not able to reach a majority of members in the group. This server will now block all updates. The server will remain blocked until contact with the majority is restored. It is possible to use group_replication_force_members to force a new group membership.'
最后一句啥意思懂吧?
此服務器無法訪問組中的大多數成員。
此服務器現在將阻止所有更新。
在恢復與大多數成員的通信之前,服務器將保持被阻塞狀態。
可以使用group_replication_force_members參數強制更新組成員
言下之意,將活着的節點,重新組成新集群
那我們不妨來一下嘛
(root@localhost) [(none)]> set global group_replication_force_members = "192.168.0.65:33061";
Query OK, 0 rows affected (9.74 sec)
(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master | 3306 | ONLINE | PRIMARY | 8.0.19 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
1 row in set (0.00 sec)
(root@localhost) [(none)]> insert into t.t values(7);
ERROR 1062 (23000): Duplicate entry '7' for key 't.PRIMARY'
(root@localhost) [(none)]> select * from t.t;
+----+
| id |
+----+
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
| 7 |
+----+
7 rows in set (0.00 sec)
(root@localhost) [(none)]> insert into t.t values(8);
Query OK, 1 row affected (0.01 sec)
我們服務現在已經恢復了,后續把節點再加進去即可
需要注意的是:
①加節點,注意每個節點的模式一致,單主還是多主,譬如我們這里是多主,加節點之前需要先配好group_replication_single_primary_mode為0,默認是1,單主。
②事后最好把group_replication_force_members重新置空,不處理也問題不大
③直接在不可用集群中加節點是加不了的,必須先恢復集群可用才能加節點
這里發現一個奇怪的現象,剛才killed的事務竟然提交了,並沒有回滾
但是,既然我kill了,那服務恢復后就應該回滾掉這個事務,所以這里是不是有問題
測試過程中發現如果不kill,直接重啟服務這個事務會被回滾
這里很可能是我個人對killed的理解有問題,先放一下,畢竟不是本節的重點。
Ⅳ、MGR監控
這里說是監控,其實就是一些集群中常用的運維用的元數據信息吧
- 查看節點狀態
這是我學習過程中用的最多的一個sql,這里每個字段一目了然,這里不用再說了
(root@localhost) [(none)]> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | a50df330-a4b0-11ea-bfd3-fa163ed6ef91 | master | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a6a83b07-a4b0-11ea-ace2-fa163e6f3efc | slave1 | 3306 | ONLINE | PRIMARY | 8.0.19 |
| group_replication_applier | a7dbfa5f-a4b0-11ea-b21b-fa163e98606f | slave2 | 3306 | ONLINE | PRIMARY | 8.0.19 |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)
還是說下member_state字段吧
| - | 含義 |
|---|---|
| offline | gr插件未啟動 |
| recovering | 復制加入集群之前產生的數據 |
| online | 完成后,可對外服務 |
| error | 節點出現錯誤,gr無法正常運行 |
| unreachable | 無法與其他成員通信,網絡問題或其他成員非正常退出 |
- 查看成員詳細信息
這里注釋一些必要字段說明
(root@localhost) [(none)]> select * from performance_schema.replication_group_member_stats limit 1\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 15915014697349018:6 組視圖id,后面寫原理部分解釋
MEMBER_ID: a50df330-a4b0-11ea-bfd3-fa163ed6ef91
COUNT_TRANSACTIONS_IN_QUEUE: 0 隊列中等待做全局事務認證的事務數量
COUNT_TRANSACTIONS_CHECKED: 2 做了全局事務認證的事務總數量,從加入組開始
COUNT_CONFLICTS_DETECTED: 0 全局事務認證時,有沖突的事務總數量
COUNT_TRANSACTIONS_ROWS_VALIDATING: 0 沖突檢測的記錄總行數
TRANSACTIONS_COMMITTED_ALL_MEMBERS: 745c2bc6-9fe4-11ea-8c89-fa163e98606f:1-69:1000003-1000005:2000005-2000006 所有事務的gtid集合,相當於gtid_executed交集,非實時,隔段時間更新一次
LAST_CONFLICT_FREE_TRANSACTION: 745c2bc6-9fe4-11ea-8c89-fa163e98606f:67 最后一個沒有沖突的事務gtid
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0 當前節點從gr中接收到的正在等待被應用的事務數量
COUNT_TRANSACTIONS_REMOTE_APPLIED: 5 當前節點從gr中接收的已經應用的事務數量
COUNT_TRANSACTIONS_LOCAL_PROPOSED: 2 當前節點產生並發送給gr的事務數量
COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0 當前節點產生並被gr回滾的事務數量
1 row in set (0.00 sec)
- 其他表
replication_connection_status
新節點加入集群,先通過異步復制拉取加入組之前產生的數據,通過這個表監控這個過程
replication_applier_status
group_replication_applier通道來執行binlog event,通過這個表監控這個過程
threads
該表監控gr組件的線程
