解決 Redis Cluster 擴容故障

本文轉載自查看原文 2016-11-07 18:05 9513 NoSQL/ redis cluster/ redis/ migrate

雙11啦，為了給商品詳細redis進行擴容，擴容動作就放在了今天晚上進行，很不巧，今天晚上是個多事之秋；

做了次數據恢復，做了次集群遷移,在遷移的時候還踩了個坑！

集群中有個節點掛掉了，並且報錯信息如下:
------ STACK TRACE ------

EIP:
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](migrateCloseSocket+0x52)[0x4644f2]

Backtrace:
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](logStackTrace+0x3c)[0x45bd5c]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](sigsegvHandler+0xa1)[0x45cc41]
/lib64/libpthread.so.0[0x336b60f710]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](migrateCloseSocket+0x52)[0x4644f2]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](migrateCommand+0x7cd)[0x46744d]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](call+0x72)[0x424192]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](processCommand+0x365)[0x428d75]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](processInputBuffer+0x109)[0x435089]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](aeProcessEvents+0x13d)[0x41f86d]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](aeMain+0x2b)[0x41fb6b]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster](main+0x370)[0x427220]
/lib64/libc.so.6(__libc_start_main+0xfd)[0x336b21ed1d]
/usr/local/bin/redis-server 0.0.0.0:6380 [cluster][0x41d039]

掛掉了之后，我們用redis_tribe 這個腳本進行對我們的redis 集群進行狀態檢查，發現有個槽很久都處於import狀態和migrate狀態之間。
[WARNING] Node 10.112.142.21:7210 has slots in importing state (45).
[WARNING] Node 10.112.142.20:6380 has slots in migrating state (45).
之后我們用 fix 對這個集群進行修復，然后整個集群才ok了。

以下是我們開始嘗試着用 rebalance ，讓redis 自己來幫我們調整整個集群的solt 分配情況：

[root@GZ-JSQ-JP-REDIS-CLUSTER-142-21 ~]# /usr/local/bin/redis-trib.rb rebalance 10.112.142.21:7211
>>> Performing Cluster Check (using node 10.112.142.21:7211)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Rebalancing across 7 nodes. Total weight = 7
Moving 892 slots from 10.112.142.20:6380 to 10.112.142.21:7210
[ERR] Calling MIGRATE: ERR Target instance replied with error: BUSYKEY Target key name already exists.
>>> Check for open slots...
[WARNING] Node 10.112.142.20:6380 has slots in migrating state (45).
[WARNING] The following slots are open: 45
>>> Fixing open slot 45
*** Found keys about slot 45 in node 10.112.142.21:7210!
Set as migrating in: 10.112.142.20:6380
Set as importing in: 10.112.142.21:7210
Moving slot 45 from 10.112.142.20:6380 to 10.112.142.21:7210: 
*** Target key exists. Replacing it for FIX.
/usr/local/lib/ruby/gems/2.3.0/gems/redis-3.3.1/lib/redis/client.rb:121:in `call': MOVED 45 10.112.142.21:6380 (Redis::CommandError)
from /usr/local/bin/redis-trib.rb:942:in `rescue in move_slot'
from /usr/local/bin/redis-trib.rb:937:in `move_slot'
from /usr/local/bin/redis-trib.rb:607:in `fix_open_slot'
from /usr/local/bin/redis-trib.rb:422:in `block in check_open_slots'
from /usr/local/bin/redis-trib.rb:422:in `each'
from /usr/local/bin/redis-trib.rb:422:in `check_open_slots'
from /usr/local/bin/redis-trib.rb:360:in `check_cluster'
from /usr/local/bin/redis-trib.rb:1140:in `fix_cluster_cmd'
from /usr/local/bin/redis-trib.rb:1696:in `<main>'

那么很明顯，這樣是行不通的，報了一些奇奇怪怪的問題，說是我們的key 已經存在了，那么開到這里是不是我們的的集群里邊有臟數據了呢？於是我們把這兩個節點的所有key 拿出來對比了一番，發現並沒有重復的key出現，也就是沒有脹數據啦，那怎么辦呢？我們集群是一定要擴容的，不然雙11肯定是抗不住的。

ok，還好這個工具提供了另外一種人工遷移solt 的方式【reshard】，知道了這個方式后，我們很愉快的遷移了大部分節點，但是在遷移第45個solt 的時候又出問題了。出的問題就和上邊的類似。

看樣子問題是出現在這第45 號solt 身上，如果我們不解決這個問題，這個節點上的負載就會很大，雙11 就可能會成為瓶頸。我們看一下這個reshard 都做了哪些動作：
通過redis-trib.rb 源碼可以看到：
cluster setslot imporing

cluster setslot migrating

cluster getkeysinslot

migrate

setslot node

我們進行重新分片的時候我們會進行這幾個操作。那么從上邊重新分片失敗的情況看，有可能是由於在遷移過程中超時導致的，或者說某個很大的key堵塞了redis導致的。
OK，我們先猜想到這里，那我們接下來開始驗證我們的猜想，看看第45 號solt 是不是有比較大的solt。
首先利用上邊給出的信息，我們看一下這個solt 里邊都有哪些key：

10.205.142.21:6380> CLUSTER GETKEYSINSLOT 45 100
1) "JIUKUIYOU_COM_GetCouponInfo_3479num"
2) "com.juanpi.api.user_hbase_type"
3) "t2526767"
4) "t2593793"
............

然后呢，我們看一下每一個key序列化后都占了多大的空間：
10.205.142.21:6380> DEBUG OBJECT com.juanpi.api.user_hbase_type
Value at:0x7f973b7226d0 refcount:1 encoding:hashtable serializedlength:489435339 lru:1802371 lru_seconds_idle:3013 【466MB！！！！】
(7.11s)
到此，我們看到了一個蠻大的key，看一下量：

10.205.142.21:6380> HLEN com.juanpi.api.user_hbase_type
-> Redirected to slot [45] located at 10.205.142.20:6380
(integer) 6589164

10.205.142.20:6380> HSCAN com.juanpi.api.user_hbase_type 0
1) "3670016"
2) 1) "6581cc3950e071873763e4b016b66914"
2) "{\"type\":\"A1\",\"time\":1434625093}"
3) "4baca6b94be68d704f348ee0a3e45915"
4) "{\"type\":{\"A\":\"A3\",\"C\":\"C1\"},\"time\":1442105611}"
5) "b83ce222b54890bf4de03cfad2362e9e"
6) "{\"type\":\"A1\",\"time\":1434672804}"
7) "821d1f1a27c63d6d40bc1d969bcec5f6"
8) "{\"type\":{\"A\":\"A6\",\"C\":\"C3\"},\"time\":1435705301}"
9) "cbce79067ad2773b87360fb91b9a325c"
10) "{\"type\":\"A3\",\"time\":1433925484}"
11) "8eef2bc24fd819687e017f7bd1ad8e1c"
12) "{\"type\":{\"A\":\"A6\",\"C\":\"C2\"},\"time\":1435543716}"
13) "d4112308e47066f2e3d35dbcf96ba092"
14) "{\"type\":{\"A\":\"A1\",\"C\":\"C4\"},\"time\":1435141321}"
15) "2ca72205e82f5d9188cc9c56285ea161"
16) "{\"type\":{\"A\":\"A2\",\"C\":\"C3\"},\"time\":1434946176}"
17) "b259ce800a0112dbd316f54aae7679a6"
18) "{\"type\":{\"A\":\"A6\",\"C\":\"\"},\"time\":1441892174}"
19) "039b9011d1470791143563a2660d8dc2"
20) "{\"type\":{\"A\":\"A6\",\"C\":\"C3\"},\"time\":1435442319}"
21) "82def11dfe206501074eb200558fb8a5"
22) "{\"type\":\"C2\",\"time\":1434466702}"

到此，問題我們基本鎖定就是這個solt里邊有個非正常大小的key了，那么到底是不是這個key導致的呢？如果是我們又改如何驗證呢？
首先我們想到的就是能不能跳過這個solt 的遷移，或者說遷移指定的solt呢？
那么對於遷移指定的solt，對於原始的這個工具里邊是沒有支持的，而如果要實現的話，我們需要手動拆解這幾個步驟，自己實現邏輯
那么跳過這個solt 呢？看起來稍微的比較容易實現一點，那么我們就修改一下redis-trib.rb 源碼好了：

首先我們找到函數入口： def reshard_cluster_cmd(argv,opt) 【大約在1200行左右】
然后找到這句話：print "Do you want to proceed with the proposed reshard plan (yes/no)? "
在它下邊幾行添加個邏輯：判讀當前solt是否是 45 solt，如果是則跳過，如果不是則遷移

if !opt['yes']
        print "Do you want to proceed with the proposed reshard plan (yes/no)? "
        yesno = STDIN.gets.chop
        exit(1) if (yesno != "yes")
    end 
    reshard_table.each{|e|
    xputs "------------------------> #{e[:slot]}"
    case e[:slot] 
         when 45
             puts "sb 45"
         else 
             move_slot(e[:source],target,e[:slot],
                   :dots=>true,
                   :pipeline=>opt['pipeline'])
         END
    END
END

Ok，我們就來試一把吧，看看跳過這個solt之后，我們重新分片是否成功！那么結果符合我們的猜測，就是由於這個solt 里邊有個超大的key導致的。事后通過更業務方商量，這個key是個無效的key，可以刪掉的！呵呵

最后，問題到此已經順利解決了。

總結一下：
有時候報錯的信息不一定能夠准確的反應問題所在，我們需要清理在報錯期間我們執行了什么操作，這個操作的具體步驟有哪些，涉及道德這個系統的原理又是怎么樣的。通過一步步的推理、猜測、驗證最終得到問題的解答。

上邊那種粗暴的方法來修改源碼其實還有沒有考慮到的地方，
比如有些除了這個solt之外又沒有其它solt 有類似這個大的key呢？
是不是應該在遷移solt前，對整個solt的key大小進行一次掃描，檢查呢？

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Redis-Cluster分片擴容 Docker搭建Redis Cluster集群及擴容和收容 Redis集群Redis-cluster搭建及故障、性能測試 redis:CLUSTER cluster is down 解決方法 REDIS CLUSTER 搭建，擴容縮容基本原理 Redis Cluster 自動化安裝，擴容和縮容 redis 分片集群(cluster)的擴容、縮容、管理 redis cluster 服務器更換ip 解決方法 MariaDB Galera Cluster集群故障恢復 Redis Cluster 架構優化