現象:
機房反饋9點左右,機房交換機故障,導致網絡出現問題
業務人員反饋某個接口超時
初查:通過業務日志查看分析發現,在連接mongo的某個collections時候,報錯錯誤如下:
在寫入數據的時候報錯:
Mongo::Error::OperationFailure - no progress was made executing batch write op in jdb3.images after 5 rounds (0 ops completed in 6 rounds total) (82):
因此初步確定問題出在mongo分片集群上
進入mongos節點,進行findOne操作,提示如下:
"errmsg" : "None of the hosts for replica set configReplSet could be contacted."
查看shard信息:
--- Sharding Status --- sharding version: { "_id" : 1, "minCompatibleVersion" : 5, "currentVersion" : 6, "clusterId" : ObjectId("58c99a8257905f85f1828f52") } shards: { "_id" : "shard01", "host" : "shard01/100.106.23.22:27017,100.106.23.32:27017,100.111.9.19:27017" } { "_id" : "shard02", "host" : "shard02/100.106.23.23:27017,100.106.23.33:27017,100.111.9.20:27017" } { "_id" : "shard03", "host" : "shard03/100.106.23.24:27017,100.106.23.34:27017,100.111.17.3:27017" } { "_id" : "shard04", "host" : "shard04/100.106.23.25:27017,100.106.23.35:27017,100.111.17.4:27017" } active mongoses: "3.2.7" : 6 balancer: Currently enabled: yes Currently running: no Balancer active window is set between 2:00 and 6:00 server local time Failed balancer rounds in last 5 attempts: 0 Migration Results for the last 24 hours: 9 : Success databases: { "_id" : "jdb3", "primary" : "shard01", "partitioned" : true } jdb3.images shard key: { "uuid" : 1 } unique: false balancing: true chunks: shard01 41109 shard02 41109 shard03 41108 shard04 41108 too many chunks to print, use verbose if you want to force print { "_id" : "gongan", "primary" : "shard02", "partitioned" : true } { "_id" : "tmp", "primary" : "shard03", "partitioned" : false } { "_id" : "1_n", "primary" : "shard04", "partitioned" : true } { "_id" : "upload", "primary" : "shard04", "partitioned" : true } upload.images shard key: { "uuid" : 1 } unique: false balancing: true chunks: shard01 259 shard02 258 shard03 258 shard04 259 too many chunks to print, use verbose if you want to force print { "_id" : "test", "primary" : "shard03", "partitioned" : false }
沒有發現異常,然后挨個檢查shard節點日志
發現在shard4節點的100.106.23.25副本上,找不到master,然后在shard4的master上查看錯誤日志
100.106.23.25日志報錯信息:
2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.
master100.106.23.35日志報錯信息:
2018-12-10T09:12:02.282+0800 W SHARDING [conn7204619] could not remotely refresh metadata for jdb3.images :: caused by :: None of the hosts for replica set configReplSet could be contacted.
並且在35服務器上進行查詢的時候,跟在mongos上查詢報的錯誤是一樣的:
"errmsg" : "None of the hosts for replica set configReplSet could be contacted."
定位問題:
在其他shard1-3上查詢一條數據,然后通過索引在mongos節點進行查詢,均可查詢到數據,從shard04節點上查詢到的所有信息,在mongos上均報錯,
解決:重啟slave,25,觀察日志,已經沒有了報錯,
重啟master,35服務器,報錯消失了,並且查看狀態,master已經切換到了25服務器上,
業務反饋,故障已經解決。
疑點:
1、網絡問題導致,為何在網絡恢復后,還是報如下錯誤:
2018-12-10T11:40:53.546+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ReplicaSetNotFound: None of the hosts for replica set configReplSet could be contacted.
難道mongo shard連接mongos用的是長連接么?
有知道的大神歡迎告知!萬分感謝