etcd raft如何實現Linearizable Read

本文轉載自查看原文 2017-07-13 17:56 2428 go/ 分布式/ 一致性協議

Linearizable Read通俗來講，就是讀請求需要讀到最新的已經commit的數據，不會讀到老數據。

對於使用raft協議來保證多副本強一致的系統中，讀寫請求都可以通過走一次raft協議來滿足。然后，現實系統中，讀請求通常會占很大比重，如果每次讀請求都要走一次raft落盤，性能可想而知。所以優化讀性能至關重要。

從raft協議可知，leader擁有最新的狀態，如果讀請求都走leader，那么leader可以直接返回結果給客戶端。然而，在出現網絡分區和時鍾快慢相差比較大的情況下，這有可能會返回老的數據，即stale read，這違反了Linearizable Read。例如，leader和其他followers之間出現網絡分區，其他followers已經選出了新的leader，並且新的leader已經commit了一堆數據，然而由於不同機器的時鍾走的快慢不一，原來的leader可能並沒有發覺自己的lease過期，仍然認為自己還是合法的leader直接給客戶端返回結果，從而導致了stale read。

Raft作者提出了一種叫做ReadIndex的方案：

當leader接收到讀請求時，將當前commit index記錄下來，記作read index，在返回結果給客戶端之前，leader需要先確定自己到底還是不是真的leader，確定的方法就是給其他所有peers發送一次心跳，如果收到了多數派的響應，說明至少這個讀請求到達這個節點時，這個節點仍然是leader，這時只需要等到commit index被apply到狀態機后，即可返回結果。

func (n *node) ReadIndex(ctx context.Context, rctx []byte) error {
	return n.step(ctx, pb.Message{Type: pb.MsgReadIndex, Entries: []pb.Entry{{Data: rctx}}})
}

處理讀請求時，應用的goroutine會調用這個函數，其中rctx參數相當於讀請求id，全局保證唯一。step會往recvc中塞進一個MsgReadIndex消息，而運行node入口函數

func (n *node) run(r *raft)

的goroutine會從recvc中拿出這個message，並進行處理：

case m := <-n.recvc:
			// filter out response message from unknown From.
			if _, ok := r.prs[m.From]; ok || !IsResponseMsg(m.Type) {
				r.Step(m) // raft never returns an error
			}

Step(m)最終會調用到raft結構體的step(m)，step是個函數指針，根據node的角色，運行stepLeader()/stepFollower()/stepCandidate()。

如果node是leader，stepLeader()主要代碼片段:

	case pb.MsgReadIndex:
	    if r.raftLog.zeroTermOnErrCompacted(r.raftLog.term(r.raftLog.committed)) != r.Term {
                // Reject read only request when this leader has not committed any log entry at its term.
                return
        }
        
		if r.quorum() > 1 {
			switch r.readOnly.option {
			case ReadOnlySafe:
				r.readOnly.addRequest(r.raftLog.committed, m)
				r.bcastHeartbeatWithCtx(m.Entries[0].Data)
			case ReadOnlyLeaseBased:
				var ri uint64
				if r.checkQuorum {
					ri = r.raftLog.committed
				}
				if m.From == None || m.From == r.id { // from local member
					r.readStates = append(r.readStates, ReadState{Index: r.raftLog.committed, RequestCtx: m.Entries[0].Data})
				} else {
					r.send(pb.Message{To: m.From, Type: pb.MsgReadIndexResp, Index: ri, Entries: m.Entries})
				}
			}
		}

首先，r.raftLog.zeroTermOnErrCompacted需要檢查leader是否在當前term有過commit entry，小論文5.4節關於Safety中給出了解釋，以及不這么做會有什么問題，並且給出了反例。

其次，本文討論的ReadIndex方案對應的是ReadOnlySafe這個option分支，其中addRequest(...)會把這個讀請求到達時的commit index保存起來，並且維護一些狀態信息，而bcastHeartbeatWithCtx(...)准備好需要發送給peers的心跳消息MsgHeartbeat。當node收到心跳響應消息MsgHeartbeatResp時處理如下:

只保留邏輯相關代碼：

case pb.MsgHeartbeatResp:

		if r.readOnly.option != ReadOnlySafe || len(m.Context) == 0 {
			return
		}

		ackCount := r.readOnly.recvAck(m)
		if ackCount < r.quorum() {
			return
		}

		rss := r.readOnly.advance(m)
		for _, rs := range rss {
			req := rs.req
			if req.From == None || req.From == r.id { // from local member
				r.readStates = append(r.readStates, ReadState{Index: rs.index, RequestCtx: req.Entries[0].Data})
			} else {
				r.send(pb.Message{To: req.From, Type: pb.MsgReadIndexResp, Index: rs.index, Entries: req.Entries})
			}
		}

首先只有ReadOnlySafe這個方案時，才會繼續往下走。如果接收到了多數派的心跳響應，則會從剛才保存的信息中將對應讀請求當時的commit index和請求id拿出來，填充到ReadState中，ReadState結構如下:

type ReadState struct {
	Index      uint64
	RequestCtx []byte
}

可以看出ReadState實際上包含了一個讀請求到達node時，當前raft的狀態commit index和請求id。

然后將ReadState append到raft結構體中的readStates數組中，readStates數組會被包含在Ready結構體中從readyc中pop出來供應用使用。

看看etcdserver是怎么使用的:

首先，在消費Ready的goroutine中：

if len(rd.ReadStates) != 0 {
					select {
					case r.readStateC <- rd.ReadStates[len(rd.ReadStates)-1]:
					case <-time.After(internalTimeout):
						plog.Warningf("timed out sending read state")
					case <-r.stopped:
						return
					}
				}

這里重點是把Ready中的ReadState放入readStateC中,readStateC是一個buffer大小為1的channel

然后，在etcdserver跑linearizableReadLoop()的另外一個goroutine中:

// 執行ReadIndex，ctx是request id
if err := s.r.ReadIndex(cctx, ctx); err != nil {
			cancel()
			if err == raft.ErrStopped {
				return
			}
			plog.Errorf("failed to get read index from raft: %v", err)
			nr.notify(err)
			continue
}

//等待request id對應的ReadState從readStateC中pop出來
for !timeout && !done {
			select {
			case rs = <-s.r.readStateC:
				done = bytes.Equal(rs.RequestCtx, ctx)
				if !done {
					// a previous request might time out. now we should ignore the response of it and
					// continue waiting for the response of the current requests.
					plog.Warningf("ignored out-of-date read index response (want %v, got %v)", rs.RequestCtx, ctx)
				}
			case <-time.After(s.Cfg.ReqTimeout()):
				plog.Warningf("timed out waiting for read index response")
				nr.notify(ErrTimeout)
				timeout = true
			case <-s.stopping:
				return
			}
}

if !done {
			continue
		}

		// 等待當前apply index大於等於commit index
		if ai := s.getAppliedIndex(); ai < rs.Index {
			select {
			case <-s.applyWait.Wait(rs.Index):
			case <-s.stopping:
				return
			}
}

至此，ReadIndex流程結束，總結一下，就四步:

leader check自己是否在當前term commit過entry
leader記錄下當前commit index，然后leader給所有peers發心跳廣播
收到多數派響應代表讀請求到達時還是leader，然后等待apply index大於等於commit index
返回結果

etcd不僅實現了leader上的read only query，同時也實現了follower上的read only query，原理是一樣的，只不過讀請求到達follower時，commit index是需要向leader去要的，leader返回commit index給follower之前，同樣，需要走上面的ReadIndex流程，因為leader同樣需要check自己到底還是不是leader，代碼不贅述。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Etcd中Raft linearizable read實現 etcd學習(6)-etcd實現raft源碼解讀 Etcd中Raft log replication的實現深入淺出etcd之raft實現 Etcd中Raft joint consensus的實現 etcd raft如何實現成員變更 ETCD 添加節點報錯 tocommit(2314438) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost? 什么是Etcd? Nacos 實現 AP+CP原理[Raft 算法 NO] etcd中用lease租約實現過期