======================================================
標題遇到的問題是我要解決的問題的中間環節。
原本問題是:需要在之前standlone的MongoDB數據庫中搭建replicaSet集群,發現該集合同步一半就導致本MongoDB實例掛掉
怎么搭建MongoDB relicaSet集群,參考另一篇博客:http://www.cnblogs.com/zhzhang/p/6783425.html
======================================================
服務器虛擬機:6核8G。
問題闡述:
mongodb版本3.2.7(yum安裝)
需要mongodump出一個collection 如下
mongodump --collection abc --db db
abc 為接近2億條,單條大概200B
每次執行mongodum命令,進度到52.5%就會報錯如下,並且mongo服務掛掉,必須重啟
2017-05-02T17:08:51.663+0800 [############............] db.abc 91363661/177602822 (51.4%) 2017-05-02T17:08:54.663+0800 [############............] db.abc 91744632/177602822 (51.7%) 2017-05-02T17:08:57.663+0800 [############............] db.abc 92279192/177602822 (52.0%) 2017-05-02T17:09:00.663+0800 [############............] db.abc 92629211/177602822 (52.2%) 2017-05-02T17:09:03.663+0800 [############............] db.abc 93112828/177602822 (52.4%) 2017-05-02T17:09:05.619+0800 [############............] db.abc 93288043/177602822 (52.5%) 2017-05-02T17:09:09.823+0800 Failed: error reading collection: EOF You have mail in /var/spool/mail/admin [admin@syslog-1.dev.abc-inc.com /abc_log_nas] $ps aux | grep mongo admin 30931 0.0 0.0 103244 860 pts/2 S+ 17:14 0:00 grep mongo You have mail in /var/spool/mail/admin [root@syslog-1.dev.abc-inc.com /home/admin/bin] #/etc/init.d/mongod status mongod dead but subsys locked [root@syslog-1.dev.abc-inc.com /home/admin/bin] #/etc/init.d/mongod restart Stopping mongod: [ OK ] Starting mongod: [ OK ] [root@syslog-1.dev.abc-inc.com /home/admin/bin] #tail -n 10 /var/spool/mail/admin X-Cron-Env: <PATH=/usr/bin:/bin> X-Cron-Env: <LOGNAME=admin> X-Cron-Env: <USER=admin> Message-Id: <20160601115558.741D2601FD@syslog-1.dev.abc-inc.com> Date: Mon, 11 Apr 2016 05:16:11 +0800 (CST) ssh: Could not resolve hostname syslog-1: Temporary failure in name resolution rsync: connection unexpectedly closed (0 bytes received so far) [receiver] rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6] rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6] rsync: connection unexpectedly closed (0 bytes received so far) [receiver] rsync error: unexplained error (code 255) at io.c(600) [receiver=3.0.6]
目前已經找到解決的方法。。。。。。同時多謝網絡中的各位匿名大神。。。。。。總結文稿,希望對其他人有幫助
=====================================================================
嘗試方法一(無效):增大連接數,傳送門:http://www.cnblogs.com/zhzhang/p/6762239.html參考文件數打開限制的配置
嘗試方法二(無效):將oplogSize增大,之前為1G,修改完之后10G,傳送門MongoDB官網:
https://docs.mongodb.com/manual/tutorial/change-oplog-size/
嘗試方法三(無效):因為這個collection上只有_id一個索引,嘗試建索引,結果也是建在50%左右的時候掛掉。
而且MongoDB重啟默認還是執行建索引的操作,必須在配置文件中顯示指定
storage: indexBuildRetry: false
這樣重啟,才不至於導致重蹈建索引。。。死機。。。重啟。。。建索引。。。的死循環
嘗試方法四(無效):想着怎么把該集合分開,該集合恰好沒有人為建索引。哎哎哎,索引?忽的發現了貌似的救命稻草:mongo自帶的ObjectId索引
在該服務器或者其他服務器啟動mongod實例,利用mongo自帶的索引(ObjectId使用12字節的存儲空間,每個字節兩位十六進制數字,是一個24位的字符串)clone出按照時間區分的幾個集合。
clone方法:在新建的mongo實例中
db.runCommand({cloneCollection: "db.abc", from: "syslog-1:27017", query: {"_id": {$gt: ObjectId("583aa21d382653813be7c18d"),$lte: ObjectId("587aa21d382653813be7c18d")}}})
db.getCollection("abc").renameCollection("abc_587")
在原來的實例中:
db.runCommand({cloneCollection: "db.abc_587", from: "syslog-3:37017", query: {}})
檢查下沒問題就可以刪除那個巨大的集合中對應的數據了:
db.abc.remove({"_id": {$gt: ObjectId("583aa21d382653813be7c18d"),$lte: ObjectId("587aa21d382653813be7c18d")}})
不得不吐槽下,mongo的集合的文檔批量刪除實在是太慢了。。。。幾千萬的數據估計得刪幾十個小時。。。平均每分鍾5w左右的樣子。
貌似機器負載不重的時候,3-4w/秒
等等等。。。。。。
嘗試方法五(實測有效,也是問題根本原因):
再看一下詳細的報錯信息
2017-05-04T19:14:06.533+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56728 #35 (11 connections now open) 2017-05-04T19:14:06.545+0800 I NETWORK [conn35] end connection 127.0.0.1:56728 (10 connections now open) 2017-05-04T19:14:06.550+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56730 #36 (11 connections now open) 2017-05-04T19:14:06.563+0800 I NETWORK [conn36] end connection 127.0.0.1:56730 (10 connections now open) 2017-05-04T19:14:06.818+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56731 #37 (11 connections now open) 2017-05-04T19:14:06.831+0800 I NETWORK [conn37] end connection 127.0.0.1:56731 (10 connections now open) 2017-05-04T19:14:06.837+0800 I NETWORK [initandlisten] connection accepted from 127.0.0.1:56732 #38 (11 connections now open) 2017-05-04T19:14:06.870+0800 I NETWORK [conn38] end connection 127.0.0.1:56732 (10 connections now open) 2017-05-04T19:14:24.465+0800 I COMMAND [conn24] query service.client_agent query: { $query: {}, $orderby: { _id: 1 } } planSummary: IXSCAN { _id: 1 } cursorid:51274086361 ntoreturn:0 ntoskip:11350867 keysExamined:11350968 docsExamined:11350968 keyUpdates:0 writeConflicts:0 numYields:88679 nreturned:101 reslen:12747 locks:{ Global: { acquireCount: { r: 177360 } }, Database: { acquireCount: { r: 88680 } }, Collection: { acquireCount: { r: 88680 } } } 10756ms 2017-05-04T19:14:24.531+0800 E STORAGE [conn24] WiredTiger (0) [1493896464:531510][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: read checksum error for 8192B block at offset 3412144128: block header checksum of 943205936 doesn't match expected checksum of 3037857471 2017-05-04T19:14:24.531+0800 E STORAGE [conn24] WiredTiger (0) [1493896464:531635][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: collection-4--1812812328855925336.wt: encountered an illegal file format or internal value 2017-05-04T19:14:24.531+0800 E STORAGE [conn24] WiredTiger (-31804) [1493896464:531656][14803:0x7f40712c3700], file:collection-4--1812812328855925336.wt, WT_CURSOR.search: the process must exit and restart: WT_PANIC: WiredTiger library panic 2017-05-04T19:14:24.531+0800 I - [conn24] Fatal Assertion 28558 2017-05-04T19:14:24.531+0800 I - [conn24] ***aborting after fassert() failure 2017-05-04T19:14:24.599+0800 I - [WTJournalFlusher] Fatal Assertion 28559 2017-05-04T19:14:24.599+0800 I - [WTJournalFlusher] ***aborting after fassert() failure 2017-05-04T19:14:24.609+0800 F - [conn24] Got signal: 6 (Aborted). 0x1304482 0x13033a9 0x1303bb2 0x7f40828dd7e0 0x7f408256c625 0x7f408256de05 0x128a472 0x1072bb3 0x1a7945c 0x1a7991d 0x1a79d04 0x19acfb7 0x19c9c85 0x19cf380 0x19f0207 0x19ba7a8 0x1a0c71c 0x1067a83 0xbdc2c9 0xb9916e 0xbbf2b5 0xdee255 0xdee919 0xdaaf72 0xdab66d 0xc82da9 0xc89075 0x94ed5c 0x12aea65 0x7f40828d5aa1 0x7f408262293d ----- BEGIN BACKTRACE ----- {"backtrace":[{"b":"400000","o":"F04482","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F033A9"},{"b":"400000","o":"F03BB2"},{"b":"7F40828CE000","o":"F7E0"},{"b":"7F408253A000","o":"32625","s":"gsignal"},{"b":"7F408253A000","o":"33E05","s":"abort"},{"b":"400000","o":"E8A472","s":"_ZN5mongo13fassertFailedEi"},{"b":"400000","o":"C72BB3"},{"b":"400000","o":"167945C","s":"__wt_eventv"},{"b":"400000","o":"167991D","s":"__wt_err"},{"b":"400000","o":"1679D04","s":"__wt_panic"},{"b":"400000","o":"15ACFB7","s":"__wt_bm_read"},{"b":"400000","o":"15C9C85","s":"__wt_bt_read"},{"b":"400000","o":"15CF380","s":"__wt_page_in_func"},{"b":"400000","o":"15F0207","s":"__wt_row_search"},{"b":"400000","o":"15BA7A8","s":"__wt_btcur_search"},{"b":"400000","o":"160C71C"},{"b":"400000","o":"C67A83","s":"_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE"},{"b":"400000","o":"7DC2C9","s":"_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE"},{"b":"400000","o":"79916E","s":"_ZN5mongo10FetchStage4workEPm"},{"b":"400000","o":"7BF2B5","s":"_ZN5mongo9SkipStage4workEPm"},{"b":"400000","o":"9EE255","s":"_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE"},{"b":"400000","o":"9EE919","s":"_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE"},{"b":"400000","o":"9AAF72"},{"b":"400000","o":"9AB66D","s":"_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_"},{"b":"400000","o":"882DA9","s":"_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE"},{"b":"400000","o":"889075","s":"_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE"},{"b":"400000","o":"54ED5C","s":"_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE"},{"b":"400000","o":"EAEA65","s":"_ZN5mongo17PortMessageServer17handleIncomingMsgEPv"},{"b":"7F40828CE000","o":"7AA1"},{"b":"7F408253A000","o":"E893D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.7", "gitVersion" : "4249c1d2b5999ebbf1fdf3bc0e0e3b3ff5c0aaf2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.el6.x86_64", "version" : "#1 SMP Wed Oct 15 04:27:16 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "384A822B93AE1E4CFE393F0CECA08575DF6EB381" }, { "b" : "7FFF67CB8000", "elfType" : 3, "buildId" : "08E42C6C3D2CD1E5D68A43B717C9EB3D310F2DF0" }, { "b" : "7F4083775000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "B84C31B86733DE212F6886FE6F55630FE56180A9" }, { "b" : "7F4083391000", "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "A30A68D2F579614CBEA988BDAAC20CD56D8C48FC" }, { "b" : "7F4083189000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "95159178F1A4A3DBDC7819FBEA2C80E5FCDD6BAC" }, { "b" : "7F4082F85000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "29B61382141595ECBA6576232E44F2310C3AAB72" }, { "b" : "7F4082D01000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "989FE3A42CA8CEBDCC185A743896F23A0CF537ED" }, { "b" : "7F4082AEB000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "9350579A4970FA47F3144AD8F40B183B0954497D" }, { "b" : "7F40828CE000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C56DD1B811FC0D9263248EBB308C73FCBCD80FC1" }, { "b" : "7F408253A000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E6FA4C4B0594C355C1B90C1D49990368C81A040" }, { "b" : "7F40839E1000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "959C5E10A47EE8A633E7681B64B4B9F74E242ED5" }, { "b" : "7F40822F6000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "441FA45097A11508E50D55A3D1FF169BF2BE7C62" }, { "b" : "7F408200F000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "F62622218875795666E08B92D176A50791183EEC" }, { "b" : "7F4081E0B000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "152E2C18A7A2145021A8A879A01A82EE134E3946" }, { "b" : "7F4081BDF000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "B8DEDADC140347276164C729418C7A37B7224135" }, { "b" : "7F40819C9000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "b" : "7F40817BE000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "4BDFC7A19C1F328EB4FCFBCE7A1E27606928610D" }, { "b" : "7F40815BB000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "b" : "7F40813A1000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C39D7FFB49DFB1B55AD09D1D711AD802123F6623" }, { "b" : "7F4081182000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44FD512FF8B805CEAF1" } ] }} mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x1304482] mongod(+0xF033A9) [0x13033a9] mongod(+0xF03BB2) [0x1303bb2] libpthread.so.0(+0xF7E0) [0x7f40828dd7e0] libc.so.6(gsignal+0x35) [0x7f408256c625] libc.so.6(abort+0x175) [0x7f408256de05] mongod(_ZN5mongo13fassertFailedEi+0x82) [0x128a472] mongod(+0xC72BB3) [0x1072bb3] mongod(__wt_eventv+0x42C) [0x1a7945c] mongod(__wt_err+0x8D) [0x1a7991d] mongod(__wt_panic+0x24) [0x1a79d04] mongod(__wt_bm_read+0x77) [0x19acfb7] mongod(__wt_bt_read+0x85) [0x19c9c85] mongod(__wt_page_in_func+0x180) [0x19cf380] mongod(__wt_row_search+0x677) [0x19f0207] mongod(__wt_btcur_search+0xB08) [0x19ba7a8] mongod(+0x160C71C) [0x1a0c71c] mongod(_ZN5mongo21WiredTigerRecordStore6Cursor9seekExactERKNS_8RecordIdE+0x53) [0x1067a83] mongod(_ZN5mongo16WorkingSetCommon5fetchEPNS_16OperationContextEPNS_10WorkingSetEmNS_11unowned_ptrINS_20SeekableRecordCursorEEE+0x99) [0xbdc2c9] mongod(_ZN5mongo10FetchStage4workEPm+0x2FE) [0xb9916e] mongod(_ZN5mongo9SkipStage4workEPm+0x45) [0xbbf2b5] mongod(_ZN5mongo12PlanExecutor11getNextImplEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0x275) [0xdee255] mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x39) [0xdee919] mongod(+0x9AAF72) [0xdaaf72] mongod(_ZN5mongo7getMoreEPNS_16OperationContextEPKcixPbS4_+0x52D) [0xdab66d] mongod(_ZN5mongo15receivedGetMoreEPNS_16OperationContextERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1A9) [0xc82da9] mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xE35) [0xc89075] mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortE+0xEC) [0x94ed5c] mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x325) [0x12aea65] libpthread.so.0(+0x7AA1) [0x7f40828d5aa1] libc.so.6(clone+0x6D) [0x7f408262293d] ----- END BACKTRACE -----
納尼,竟然文件壞了,我特么什么時候動過這個文件了。。。。。。
沒辦法,繼續修問題。。。
下面解決問題的最核心的步驟即將到來。。。
1. 下載安裝必要的軟件
wget http://source.wiredtiger.com/releases/wiredtiger-2.7.0.tar.bz2 tar xvf wiredtiger-2.7.0.tar.bz2 cd wiredtiger-2.7.0 sudo apt-get install libsnappy-dev build-essential ./configure --enable-snappy make
2. 將出問題的wt文件拷貝一份出來(至於怎么查看可以查看通過對應的集合查看,如下)
db.abc.stats().wiredTiger.uri
3. 拯救壞掉的集合
./wt -v -h ../mongo-bak -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R salvage collection-2657--1723320556100349955.wt
4. 通過dump/load導入wt文件到MongoDB集合
./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R dump -f ../collection.dump collection-2657--1723320556100349955
5. 創建一個新的mongo實例,目的是獲取一個空的集合實例,以方便將load出的文件導入該集合
mongod --dbpath tmp-mongo --storageEngine wiredTiger --nojournal
use Recovery db.borkedCollection.insert({test: 1}) db.borkedCollection.remove({}) db.borkedCollection.stats()
6. 將第4步生成的collection.dump文件導入剛啟動的mongo實例的數據目錄下(對應的mongo實例需要停掉)
對應的collection-******,參考第五步創建的集合的db.abc.stats().wiredTiger.uri狀態
./wt -v -h ../data -C "extensions=[./ext/compressors/snappy/.libs/libwiredtiger_snappy.so]" -R load -f ../collection.dump -r collection-2-880383588247732034
7. 啟動mongo實例,並且登錄進去,發現文檔為空
db.borkedCollection.count() 0
8.但是通過利用如下語句查詢,確實是有內容的:
db.borkedCollection.find({}, {_id: 1})
9.利用mongodump將集合數據dum出來
mongodump
10.利用mongorestore將集合數據導入進mongo中
mongorestore --drop
11.登錄MongoDB,發現數據恢復
由於是數據文件損壞,可能會少一些數據,像我這個例子,1.7億條少了10w左右,可以接受
本次問題解決,參考網址:http://www.alexbevi.com/blog/2016/02/10/recovering-a-wiredtiger-collection-from-a-corrupt-mongodb-installation/
希望對各位能有所幫助。
PS:發現通過排查問題確實是學習的好方法。
通過本次問題排查,MongoDB摸索的東西不少。。。