當kudu有tserver下線或者遷移或者修改hostname之后,舊的tserver會一直以dead狀態出現,並且tserver日志中會有大量的連接重試日志,一天的錯誤日志會有幾個G,
W0322 22:13:59.202749 16927 tablet_service.cc:290] Invalid argument: UpdateConsensus: Wrong destination UUID requested. Local UUID: e2f80a1fcf0c47f6b7f220a44d69297f. Requested UUID: 45bfb5b3e3ff41d9b1b1d2afab78d65c: from {username='kudu'} at 192.168.0.1:34724: tablet_id: "9933f18e59554ae6b5354e2a948469e9" caller_uuid: "9b164f37d04a484c8634ea86eae1b048" caller_term: 3 preceding_id { term: 2 index: 1873 } ops { id { term: 3 index: 1874 } timestamp: 6359719759241142272 op_type: NO_OP noop_request { } } dest_uuid: "45bfb5b3e3ff41d9b1b1d2afab78d65c" committed_index: 1874 all_replicated_index: 0 safe_timestamp: 6359719761707556864 last_idx_appended_to_leader: 1874
這時如果想要把這些dead狀態的tserver去掉,並沒有直接的命令,官方給出的方法如下:
Kudu does not currently have an automated way to remove a tablet server from a cluster permanently. Instead, use the following steps:
- 1 Ensure the cluster is in good health using ksck. See Checking Cluster Health with ksck.
- 首先保證集群是健康的(通過ksck命令)
- 2 If the tablet server contains any replicas of tables with replication factor 1, these replicas must be manually moved off the tablet server prior to shutting it down. The kudu tablet change_config move_replica tool can be used for this.
- 將dead狀態的server上的副本進行遷移,如果有replication factor設置為1的數據,必須在下線前手工移動數據;
- 3 Shut down the tablet server. After -follower_unavailable_considered_failed_sec, which defaults to 5 minutes, Kudu will begin to re-replicate the tablet server’s replicas to other servers. Wait until the process is finished. Progress can be monitored using ksck.
- 只要tserver處於下線狀態超過5分鍾以上會自動進行副本遷移;
- 4 Once all the copies are complete, ksck will continue to report the tablet server as unavailable. The cluster will otherwise operate fine without the tablet server. To completely remove it from the cluster so ksck shows the cluster as completely healthy, restart the masters. In the case of a single master, this will cause cluster downtime. With multimaster, restart the masters in sequence to avoid cluster downtime.
- 當所有副本都遷移完之后,ksck依然會顯示有tserver不可用,如果想完全去掉這些dead狀態的server,需要重啟master;
Do not shut down multiple tablet servers at once. To remove multiple tablet servers from the cluster, follow the above instructions for each tablet server, ensuring that the previous tablet server is removed from the cluster and ksck is healthy before shutting down the next.
最后,重啟master之后在保證集群健康的前提下逐一重啟tserver;
如果這樣操作之后還是報錯,說明可能有leader副本丟失,比如ksck報錯
Tablet c58cef3f36a846b4bdf58447f77a6bcf of table 'impala::impala.test_kudu' is unavailable: 2 replica(s) not RUNNING a46f0fd38eba4a5286098ff7fe260eb1: TS unavailable 45bfb5b3e3ff41d9b1b1d2afab78d65c: TS unavailable 9b164f37d04a484c8634ea86eae1b048 (server02:7050): RUNNING [LEADER] All reported replicas are: A = a46f0fd38eba4a5286098ff7fe260eb1 B = 45bfb5b3e3ff41d9b1b1d2afab78d65c C = 9b164f37d04a484c8634ea86eae1b048 The consensus matrix is: Config source | Replicas | Current term | Config index | Committed? ---------------+------------------------+--------------+--------------+------------ master | A B C* | | | Yes A | [config not available] | | | B | [config not available] | | | C | [config not available] | | |
可用的副本可能存在同步延遲會丟失部分數據,這時如果已經確定leader副本不可恢復,則可以強制指定剩下的可用副本為leader,恢復tablet到健康狀態;
The remaining replica is not the leader, so the leader replica failed as well. This means the chance of data loss is higher since the remaining replica on tserver-00
may have been lagging.
$ sudo -u kudu kudu remote_replica unsafe_change_config tserver-00:7150 <tablet-id> <tserver-00-uuid>
where <tablet-id>
is e822cab6c0584bc0858219d1539a17e6
and <tserver-00-uuid>
is the uuid of tserver-00
,638a20403e3e4ae3b55d4d07d920e6de
.
<tablet-id>為非健康的tablet,tserver-00:7150為可用副本所在的tserver,<tserver-00-uuid>為可用副本所在的tserver的uuid,這樣就可以在可能丟失少量數據的情況下恢復tablet;
如果有問題的tablet非常多,可以參考如下命令:
$ kudu cluster ksck localhost|grep -e '^Tablet '|awk '{print $2}'|xargs -i echo "sudo -u kudu kudu remote_replica unsafe_change_config tserver-00:7150 {} <tserver-00-uuid>"
參考:
https://kudu.apache.org/docs/administration.html#tablet_server_decommissioning
https://kudu.apache.org/docs/administration.html#tablet_majority_down_recovery