Solr4.8.0源碼分析(20)之SolrCloud的Recovery策略(一)

本文轉載自查看原文 2014-12-05 00:07 3324 JAVA/ Solr/ 搜索引擎

Solr4.8.0源碼分析(20)之SolrCloud的Recovery策略(一)

題記：

我們在使用SolrCloud中會經常發現會有備份的shard出現狀態Recoverying，這就表明SolrCloud的數據存在着不一致性，需要進行Recovery，這個時候的SolrCloud建索引是不會寫入索引文件中的(每個shard接受到update后寫入自己的ulog中)。關於Recovery的內容包含三篇，本文是第一篇介紹Recovery的原因以及總體流程。

1. Recovery的起因

Recovery一般發生在以下三個時候：

SolrCloud啟動的時候，主要由於在建索引的時候發生意外關閉，導致一些shard的數據與leader不一致，那么在啟動的時候剛起的shard就會從leader那里同步數據。
SolrCloud在進行leader選舉中出現錯誤，一般出現在leader宕機引起replica進行選舉成leader過程中。
SolrCloud在進行update時候，由於某種原因leader轉發update至replica沒有成功，會迫使replica進行recoverying進行數據同步。

前面兩種情況暫時不介紹，本文先介紹下第三種情況。大致原理如下圖所示：

之前在<Solr4.8.0源碼分析(15) 之 SolrCloud索引深入(2)>中講到，不管update請求發送到哪個shard 分片中，最后在solrcloud里面進行分發的順序都是從Leader發往Replica。Leader接受到update請求后先將document放入自己的索引文件以及update寫入ulog中，然后將update同時轉發給各個Replica分片。這就流程在就是之前講到的add的索引鏈過程。

那么在索引鏈的add過程完畢后，SolrCloud會再依次調用finish()函數用來接受每一個Replica的響應，檢查Replica的update操作是否成功。如果一旦有一個Replica沒有成功，就會向update失敗的Replica發送RequestRecovering命令強迫該分片進行Recoverying。

 1 private void doFinish() {
 2     // TODO: if not a forward and replication req is not specified, we could
 3     // send in a background thread
 4 
 5     cmdDistrib.finish();
 6     List<Error> errors = cmdDistrib.getErrors();
 7     // TODO - we may need to tell about more than one error...
 8     
 9     // if its a forward, any fail is a problem - 
10     // otherwise we assume things are fine if we got it locally
11     // until we start allowing min replication param
12     if (errors.size() > 0) {
13       // if one node is a RetryNode, this was a forward request
14       if (errors.get(0).req.node instanceof RetryNode) {
15         rsp.setException(errors.get(0).e);
16       } else {
17         if (log.isWarnEnabled()) {
18           for (Error error : errors) {
19             log.warn("Error sending update", error.e);
20           }
21         }
22       }
23       // else
24       // for now we don't error - we assume if it was added locally, we
25       // succeeded 
26     }
27    
28     
29     // if it is not a forward request, for each fail, try to tell them to
30     // recover - the doc was already added locally, so it should have been
31     // legit
32 
33     for (final SolrCmdDistributor.Error error : errors) {
34       if (error.req.node instanceof RetryNode) {
35         // we don't try to force a leader to recover
36         // when we cannot forward to it
37         continue;
38       }
39       // TODO: we should force their state to recovering ??
40       // TODO: do retries??
41       // TODO: what if its is already recovering? Right now recoveries queue up -
42       // should they?
43       final String recoveryUrl = error.req.node.getBaseUrl();
44       
45       Thread thread = new Thread() {
46         {
47           setDaemon(true);
48         }
49         @Override
50         public void run() {
51           log.info("try and ask " + recoveryUrl + " to recover");
52           HttpSolrServer server = new HttpSolrServer(recoveryUrl);
53           try {
54             server.setSoTimeout(60000);
55             server.setConnectionTimeout(15000);
56             
57             RequestRecovery recoverRequestCmd = new RequestRecovery();
58             recoverRequestCmd.setAction(CoreAdminAction.REQUESTRECOVERY);
59             recoverRequestCmd.setCoreName(error.req.node.getCoreName());
60             try {
61               server.request(recoverRequestCmd);
62             } catch (Throwable t) {
63               SolrException.log(log, recoveryUrl
64                   + ": Could not tell a replica to recover", t);
65             }
66           } finally {
67             server.shutdown();
68           }
69         }
70       };
71       ExecutorService executor = req.getCore().getCoreDescriptor().getCoreContainer().getUpdateShardHandler().getUpdateExecutor();
72       executor.execute(thread);
73       
74     }
75   }

2. Recovery的總體流程

Replica接收到來自Leader的RequestRecovery命令后就會開始進行RecoveryStrategy線程，然后進行Recovery。總體流程如下圖索引：

在RequestRecovery請求判斷中，我例舉了一部分(不是全部)請求命令，這是正常的索引鏈過程。
如果接受到的是RequestRecovery命令，那么本分片就會啟動RecoveryStrategy線程來進行Recovery。

1       // if true, we are recovering after startup and shouldn't have (or be receiving) additional updates (except for local tlog recovery)
2       boolean recoveringAfterStartup = recoveryStrat == null;
3 
4       recoveryStrat = new RecoveryStrategy(cc, cd, this);
5       recoveryStrat.setRecoveringAfterStartup(recoveringAfterStartup);
6       recoveryStrat.start();
7       recoveryRunning = true;

分片會設置分片的狀態recoverying。需要指出的是如果一旦檢測到本分片成為了leader，那么Recovery過程就會退出。因為Recovery是從leader中同步數據的。

1         zkController.publish(core.getCoreDescriptor(), ZkStateReader.RECOVERING);

這里要判斷下firsttime是否為true(在重啟分片的時候會檢查之前是否進行replication且沒做完就被關閉了)，firsttime是控制是否先進入PeerSync Recovery策略的，如果為false則跳過PeerSync進入Replicate。

 1     if (recoveringAfterStartup) {
 2       // if we're recovering after startup (i.e. we have been down), then we need to know what the last versions were
 3       // when we went down.  We may have received updates since then.
 4       recentVersions = startingVersions;
 5       try {
 6         if ((ulog.getStartingOperation() & UpdateLog.FLAG_GAP) != 0) {
 7           // last operation at the time of startup had the GAP flag set...
 8           // this means we were previously doing a full index replication
 9           // that probably didn't complete and buffering updates in the
10           // meantime.
11           log.info("Looks like a previous replication recovery did not complete - skipping peer sync. core="
12               + coreName);
13           firstTime = false; // skip peersync
14         }
15       } catch (Exception e) {
16         SolrException.log(log, "Error trying to get ulog starting operation. core="
17             + coreName, e);
18         firstTime = false; // skip peersync
19       }
20     }

最后進行選擇進入是PeerSync策略和Replicate策略，在<Solr In Action 筆記(4) 之 SolrCloud分布式索引基礎>中簡單提到過兩者的區別。關於具體的不同將在后面兩節詳細介紹。
- Peer sync，如果中斷的時間較短，recovering node只是丟失少量update請求，那么它可以從leader的update log中獲取。這個臨界值是100個update請求，如果大於100，就會從leader進行完整的索引快照恢復。
- Replication，如果節點下線太久以至於不能從leader那進行同步，它就會使用solr的基於http進行索引的快照恢復。
最后設置分片的狀態為active。並判斷是否是sucessfulrrecovery，如果否則會多出嘗試Recovery。

總結：

本文主要介紹了Recovery的起因以及Recovery過程，由於是簡述所以內容較簡單，主要提到了兩種不同的Recovery策略，后續兩文種將分別詳細介紹。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Solr4.8.0源碼分析(16)之SolrCloud索引深入(3) Solr4.8.0源碼分析(17)之SolrCloud索引深入(4) Solr4.8.0源碼分析(1)之Solr的Servlet Solr4.8.0源碼分析(5)之查詢流程分析總述 solrCloud源碼分析之CloudSolrClient solr集群solrCloud的搭建 solr集群搭建(SolrCloud) solr集群solrCloud的搭建 solr源碼分析之solrclound Solr In Action 筆記(3) 之 SolrCloud基礎