問題描述
MongoDB Cluster測試環境部署完后無異常,誰知過了幾天報如下的錯誤:
mongos> sh.status()
--- Sharding Status ---
sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("61e13befe1b0433c21305391")
}
shards:
{ "_id" : "rs_shardsvr0", "host" : "rs_shardsvr0/10.150.57.13:37031,10.150.57.13:37032,10.150.57.13:37033", "state" : 1, "topologyTime" : Timestamp(1642151681, 2) }
{ "_id" : "rs_shardsvr1", "host" : "rs_shardsvr1/10.150.57.13:37041,10.150.57.13:37042,10.150.57.13:37043", "state" : 1, "topologyTime" : Timestamp(1642151714, 1) }
{ "_id" : "rs_shardsvr2", "host" : "rs_shardsvr2/10.150.57.13:37051,10.150.57.13:37052,10.150.57.13:37053", "state" : 1, "topologyTime" : Timestamp(1642151722, 2) }
active mongoses:
"5.0.5" : 3
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
Currently running: no
Failed balancer rounds in last 5 attempts: 5
Last reported error: Could not find host matching read preference { mode: "primary" } for set rs_shardsvr2
Time of reported error: Fri Feb 25 2022 22:00:25 GMT+0800 (CST)
Migration results for the last 24 hours:
No recent migrations
databases:
{ "_id" : "config", "primary" : "config", "partitioned" : true }
config.system.sessions
shard key: { "_id" : 1 }
unique: false
balancing: true
chunks:
rs_shardsvr0 342
rs_shardsvr1 341
rs_shardsvr2 341
too many chunks to print, use verbose if you want to force print
截取主要報錯信息:
Last reported error: Could not find host matching read preference { mode: "primary" } for set rs_shardsvr2
Time of reported error: Fri Feb 25 2022 22:00:25 GMT+0800 (CST)
注釋:
從字面意思上看是連接不到rs_shardsrv2復制集的PRIMARY節點。
我的MongoDB Replica Set是三個DB節點,允許一個DB節點故障,有可能是兩個DB節點網絡異常導致無法投票選擇新PRIMARY節點。
排查思路
1. 查看網絡、keyfile是否異常。
2. 排查rs_shardsvr2復制集是否異常:rs.status()。
3. 創建新hash集合驗證數據是否能平均分布到各個分片節點上:
1. 創建damocles庫
> use damocles
2.對damocles庫啟用分片
> sh.enableSharding("damocles")
3.對damocles.order表_id字段進行哈希分片
> sh.shardCollection("damocles.order", {"_id": "hashed" })
4.插入10000條測試數據
> use damocles
> for (i = 1; i <= 10000; i=i+1){db.order.insert({'price': 1})}
5.分別到每個分片上驗證數據
> rs_shardsvr0:PRIMARY> db.order.find().count()
3315
> rs_shardsvr1:PRIMARY> db.order.find().count()
3318
> rs_shardsvr2:PRIMARY> db.order.find().count()
3367
我的環境沒有異常,當前集群正常。