記錄一次mongodb因網絡問題導致shard節點異常

本文轉載自查看原文 2018-12-10 13:33 1398 每天一點點

現象：

機房反饋9點左右，機房交換機故障，導致網絡出現問題

業務人員反饋某個接口超時

初查：通過業務日志查看分析發現，在連接mongo的某個collections時候，報錯錯誤如下：

在寫入數據的時候報錯：

Mongo::Error::OperationFailure - no progress was made executing batch write op in jdb3.images after 5 rounds (0 ops completed in 6 rounds total) (82):

因此初步確定問題出在mongo分片集群上

進入mongos節點，進行findOne操作，提示如下：

"errmsg" : "None of the hosts for replica set configReplSet could be contacted."

查看shard信息：

--- Sharding Status ---
  sharding version: {
 "_id" : 1,
 "minCompatibleVersion" : 5,
 "currentVersion" : 6,
 "clusterId" : ObjectId("58c99a8257905f85f1828f52")
}
  shards:
 {  "_id" : "shard01",  "host" : "shard01/100.106.23.22:27017,100.106.23.32:27017,100.111.9.19:27017" }
 {  "_id" : "shard02",  "host" : "shard02/100.106.23.23:27017,100.106.23.33:27017,100.111.9.20:27017" }
 {  "_id" : "shard03",  "host" : "shard03/100.106.23.24:27017,100.106.23.34:27017,100.111.17.3:27017" }
 {  "_id" : "shard04",  "host" : "shard04/100.106.23.25:27017,100.106.23.35:27017,100.111.17.4:27017" }
  active mongoses:
 "3.2.7" : 6
  balancer:
 Currently enabled:  yes
 Currently running:  no
  Balancer active window is set between 2:00 and 6:00 server local time
 Failed balancer rounds in last 5 attempts:  0
 Migration Results for the last 24 hours:
  9 : Success
  databases:
 {  "_id" : "jdb3",  "primary" : "shard01",  "partitioned" : true }
  jdb3.images
   shard key: { "uuid" : 1 }
   unique: false
   balancing: true
   chunks:
    shard01 41109
    shard02 41109
    shard03 41108
    shard04 41108
   too many chunks to print, use verbose if you want to force print
 {  "_id" : "gongan",  "primary" : "shard02",  "partitioned" : true }
 {  "_id" : "tmp",  "primary" : "shard03",  "partitioned" : false }
 {  "_id" : "1_n",  "primary" : "shard04",  "partitioned" : true }
 {  "_id" : "upload",  "primary" : "shard04",  "partitioned" : true }
  upload.images
   shard key: { "uuid" : 1 }
   unique: false
   balancing: true
   chunks:
    shard01 259
    shard02 258
    shard03 258
    shard04 259
   too many chunks to print, use verbose if you want to force print
 {  "_id" : "test",  "primary" : "shard03",  "partitioned" : false }

沒有發現異常，然后挨個檢查shard節點日志

發現在shard4節點的100.106.23.25副本上，找不到master，然后在shard4的master上查看錯誤日志

100.106.23.25日志報錯信息：

2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.

master100.106.23.35日志報錯信息：

2018-12-10T09:12:02.282+0800 W SHARDING [conn7204619] could not remotely refresh metadata for jdb3.images :: caused by :: None of the hosts for replica set configReplSet could be contacted.

並且在35服務器上進行查詢的時候，跟在mongos上查詢報的錯誤是一樣的：

"errmsg" : "None of the hosts for replica set configReplSet could be contacted."

定位問題：

在其他shard1-3上查詢一條數據，然后通過索引在mongos節點進行查詢，均可查詢到數據，從shard04節點上查詢到的所有信息，在mongos上均報錯，

解決：重啟slave，25，觀察日志，已經沒有了報錯，

　　　重啟master，35服務器，報錯消失了，並且查看狀態，master已經切換到了25服務器上，

業務反饋，故障已經解決。

疑點：

1、網絡問題導致，為何在網絡恢復后，還是報如下錯誤：

2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.

難道mongo shard連接mongos用的是長連接么？

有知道的大神歡迎告知！萬分感謝

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 記錄一次並發導致數據重復創建的問題記錄一次idea的git導致的macpro風扇狂轉問題記錄一次TraceId的問題記錄一次問題排查記錄一次句柄泄漏的異常解決記錄一次springsecurity5.0.6做權限登錄異常緩慢的問題解決記錄記一次SpringAOP環繞通知導致全局異常抓取失效的問題記一次ElasticSearch重啟之后shard未分配問題的解決記一次網絡原因導致的mysql連接中斷問題(druid) 一次由重復索引導致的問題