LevelDB源碼剖析

本文轉載自查看原文 2015-12-05 11:08 1905

LevelDB的公共部件並不復雜，但為了更好的理解其各個核心模塊的實現，此處挑幾個關鍵的部件先行備忘。

Arena(內存領地)

Arena類用於內存管理，其存在的價值在於：

提高程序性能，減少Heap調用次數，由Arena統一分配后返回到應用層。
分配后無需執行dealloc，當Arena對象釋放時，統一釋放由其創建的所有內存。

便於內存統計，如Arena分配的整體內存大小等信息。

 1     class Arena {
 2     public:
 3         Arena();
 4         ~Arena();
 5 
 6         // Return a pointer to a newly allocated memory block of "bytes" bytes.
 7         char* Allocate(size_t bytes);
 8 
 9         // Allocate memory with the normal alignment guarantees provided by malloc
10         char* AllocateAligned(size_t bytes);
11 
12         // Returns an estimate of the total memory usage of data allocated
13         // by the arena (including space allocated but not yet used for user
14         // allocations).
15         size_t MemoryUsage() const {
16             return blocks_memory_ + blocks_.capacity() * sizeof(char*);
17         }
18 
19     private:
20         char* AllocateFallback(size_t bytes);
21         char* AllocateNewBlock(size_t block_bytes);
22 
23         // Allocation state
24         char* alloc_ptr_;                //當前block當前位置指針
25         size_t alloc_bytes_remaining_;    //當前block可用內存大小
26 
27         // Array of new[] allocated memory blocks
28         std::vector<char*> blocks_;        //創建的全部內存塊
29 
30         // Bytes of memory in blocks allocated so far
31         size_t blocks_memory_;            //目前為止分配的內存總量
32 
33         // No copying allowed
34         Arena(const Arena&);
35         void operator=(const Arena&);
36     };

Slice(數據塊)

Slice的含義和其名稱一致，代表了一個數據塊，data_為數據地址，size_為數據長度。

Slice一般和Arena配合使用，其僅保持了數據信息，並未擁有數據的所有權。而數據在Arena對象的整個聲明周期內有效。

Slice在LevelDB中一般用於傳遞Key、Value或編解碼處理后的數據塊。

和string相比，Slice具有的明顯好處包括：避免不必要的拷貝動作、具有比string更豐富的語義(可包含任意內容)。

```
1 class Slice {
2     public:
3             ......
4     private:
5         const char* data_;
6         size_t size_;
7     };
```
LevelDB源碼之一SkipList

SkipList稱之為跳表，可實現Log(n)級別的插入、刪除。跳表是平衡樹的一種替代方案，和平衡樹不同的是，跳表並不保證嚴格的“平衡性”，而是采用更為隨性的方法：隨機平衡算法。

關於SkipList的完整介紹請參見跳表(SkipList)，這里借用幾幅圖做簡要說明：

圖1.1 跳表

圖1.2 查找、插入

圖1.3 查找、刪除

圖1.1中紅色部分為初始化狀態，即head各個level中next節點均為NULL。
跳表是分層的，由下往上分別為1、2、3...，因此需要分層算法。
跳表中每一層的數據都是按順序存儲的，因此需要Compactor。
查找動作由最上層開始依序查找，直到找到數據或查找失敗。
插入動作僅影響插入位置前后節點，對其他節點無影響。
刪除動作僅影響插入位置前后節點，對其他節點無影響。

分層算法

分層算法決定了數據插入的Level，SkipList的平衡性如何全權由分層算法決定。極端情況下，假設SkipList只有Level-0層，SkipList將弱化成自排序List。此時查找、插入、刪除的時間復雜度均為O(n)，而非O(Log(n))。

LevelDB中的分層算法實現如下(leveldb::skiplist::RandomHeight())

 1     // enum { kMaxHeight = 12 };
 2 template<typename Key, class Comparator>
 3     int SkipList<Key, Comparator>::RandomHeight() 
 4     {
 5         // Increase height with probability 1 in kBranching
 6         static const unsigned int kBranching = 4;
 7         int height = 1;
 8         while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
 9             height++;
10         }
11         assert(height > 0);
12         assert(height <= kMaxHeight);
13         return height;
14     }

代碼1.1 RandomHeight

kMaxHeight 代表Skiplist的最大高度，即最多允許存在多少層，為關鍵參數，與性能直接相關。修改kMaxHeight ，在數值變小時，性能上有明顯下降，但當數值增大時，甚至增大到10000時，和默認的kMaxHeight =12相比仍舊無明顯差異，內存使用上也是如此。

為何如此？關鍵在於while循環中的判定條件：height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)。除了kMaxHeight 判定外，(rnd_.Next() % kBranching) == 0)判定使得上層節點的數量約為下層的1/4。那么，當設定MaxHeight=12時，根節點為1時，約可均勻容納Key的數量為4^11=4194304(約為400W)。因此，當單獨增大MaxHeight時，並不會使得SkipList的層級提升。MaxHeight=12為經驗值，在百萬數據規模時，尤為適用。

Compactor

如同二叉樹，Skiplist也是有序的，鍵值比較需由比較器(Compactor)完成。

SkipList對Compactor的要求只有一點：()操作符重載，格式如下：

//a<b返回值小於0，a>b返回值大於0，a==b返回值為0

int operator()(const Key& a, const Key& b) const;

Key與Compactor均為模板參數，因而Compactor亦由使用者實現。LevelDB中，存在一個Compactor抽象類，但該抽象類並沒有重載()操作符，至於Compacotr如何使用及Compactor抽象類和此處的Compactor的關系如何請參見MemTable一節。

查找、插入、刪除

LevelDB中實現的SkipList並無刪除行為，這由其業務特性決定，故此處不提。

查找、插入亦即讀、寫行為。由圖1.2可知，插入首先需完成一次查找動作，隨后在指定位置上完成一次插入行為。

LevelDB中的查找、插入行為幾乎做到了“無鎖”並發，這一點是非常可取的。關於這一點，是本次備忘的重點。先來看查找：

 1     template<typename Key, class Comparator>
 2     typename SkipList<Key, Comparator>::Node* 
 3         SkipList<Key, Comparator>::FindGreaterOrEqual(const Key& key, Node** prev) const 
 4     {
 5         Node* x = head_;
 6         int level = GetMaxHeight() - 1;
 7         while (true) {
 8             Node* next = x->Next(level);
 9             if (KeyIsAfterNode(key, next)) {
10                 // Keep searching in this list
11                 x = next;
12             }
13             else {
14                 if (prev != NULL) prev[level] = x;
15                 if (level == 0) {
16                     return next;
17                 }
18                 else {
19                     // Switch to next list
20                     level--;
21                 }
22             }
23         }
24     }

代碼1.2 FindGreaterOrEqual

實現並無特別之處：由最上層開始查找，一直查找到Level-0。找到大於等於指定Key值的數據，如不存在返回NULL。來看SkipList的Node結構：

 1     template<typename Key, class Comparator>
 2     struct SkipList<Key, Comparator>::Node {
 3         explicit Node(const Key& k) : key(k) { }
 4 
 5         Key const key;
 6 
 7         // Accessors/mutators for links.  Wrapped in methods so we can
 8         // add the appropriate barriers as necessary.
 9         Node* Next(int n) {
10             assert(n >= 0);
11             // Use an 'acquire load' so that we observe a fully initialized
12             // version of the returned Node.
13             return reinterpret_cast<Node*>(next_[n].Acquire_Load());
14         }
15         void SetNext(int n, Node* x) {
16             assert(n >= 0);
17             // Use a 'release store' so that anybody who reads through this
18             // pointer observes a fully initialized version of the inserted node.
19             next_[n].Release_Store(x);
20         }
21 
22         // No-barrier variants that can be safely used in a few locations.
23         Node* NoBarrier_Next(int n) {
24             assert(n >= 0);
25             return reinterpret_cast<Node*>(next_[n].NoBarrier_Load());
26         }
27         void NoBarrier_SetNext(int n, Node* x) {
28             assert(n >= 0);
29             next_[n].NoBarrier_Store(x);
30         }
31 
32     private:
33         // Array of length equal to the node height.  next_[0] is lowest level link.
34         port::AtomicPointer next_[1];    //看NewNode代碼，實際大小為node height
35     };

代碼1.3 Node

Node有兩個成員變量，Key及next_數組。Key當然是節點數據，next_數組(注意其類型為AtomicPointer )則指向了其所在層及之下各個層中的下一個節點(參見圖1.1)。Next_數組的實際大小和該節點的height一致，來看Node的工廠方法NewNode:

1     template<typename Key, class Comparator>
2     typename SkipList<Key, Comparator>::Node*
3         SkipList<Key, Comparator>::NewNode(const Key& key, int height) 
4     {
5         char* mem = arena_->AllocateAligned( sizeof(Node) + 
6                  sizeof(port::AtomicPointer) * (height - 1));
7         return new (mem) Node(key);    //顯示調用構造函數，並不常見。
8     }

代碼1.4 NewNode

再來看Node的兩組方法：SetNext/Next、NoBarrier_SetNext/NoBarrier_Next。這兩組方法用於讀寫指定層的下一節點指針，前者並發安全、后者非並發安全。來看插入操作實現：

    template<typename Key, class Comparator>
    void SkipList<Key, Comparator>::Insert(const Key& key) 
    {
        // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
        // here since Insert() is externally synchronized.
        Node* prev[kMaxHeight];
        Node* x = FindGreaterOrEqual(key, prev);

        // Our data structure does not allow duplicate insertion
        assert(x == NULL || !Equal(key, x->key));

        int height = RandomHeight();
        if (height > GetMaxHeight()) 
        {
            for (int i = GetMaxHeight(); i < height; i++) {
                prev[i] = head_;
            }
            //fprintf(stderr, "Change height from %d to %d\n", max_height_, height);

            // It is ok to mutate max_height_ without any synchronization
            // with concurrent readers.  A concurrent reader that observes
            // the new value of max_height_ will see either the old value of
            // new level pointers from head_ (NULL), or a new value set in
            // the loop below.  In the former case the reader will
            // immediately drop to the next level since NULL sorts after all
            // keys.  In the latter case the reader will use the new node.
            max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
        }

        x = NewNode(key, height);
        for (int i = 0; i < height; i++) {
            // NoBarrier_SetNext() suffices since we will add a barrier when
            // we publish a pointer to "x" in prev[i].
            x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
            prev[i]->SetNext(i, x);
        }
    }

代碼1.5 Insert

插入行為主要修改兩類數據：max_height_及所有level中前一節點的next指針。

max_height_沒有任何並發保護，關於此處作者注釋講的很清楚：讀線程在讀到新的max_height_同時，對應的層級指針(new level pointer from head_)可能是原有的NULL，也有可能是部分更新的層級指針。如果是前者將直接跳到下一level繼續查找，如果是后者，新插入的節點將被啟用。

隨后節點插入方是將無鎖並發變為現實：

首先更新插入節點的next指針，此處無並發問題。
修改插入位置前一節點的next指針，此處采用SetNext處理並發。
由最下層向上插入可以保證當前層一旦插入后，其下層已更新完畢並可用。
當然，多個寫之間的並發SkipList時非線程安全的，在LevelDB的MemTable中采用了另外的技巧來處理寫並發問題。

LevelDB源碼之二MemTable

MemTable是內存表，在LevelDB中最新插入的數據存儲於內存表中，內存表大小為可配置項（默認為4M）。當MemTable中數據大小超限時，將創建新的內存表並將原有的內存表Compact(壓縮)到SSTable(磁盤)中。

MemTable* mem_; //新的內存表

MemTable* imm_; //待壓縮的內存表

MemTable內部使用了前面介紹的SkipList做為數據存儲，其自身封裝的主要目的如下：

以一種業務形態出現，即業務抽象。
LevelDB是Key-Value存儲系統，而SkipList為單值存儲，需執行用戶數據到SkipList數據的編解碼處理。
LevelDB支持插入、刪除動作，而MemTable中刪除動作將轉換為一次類型為Deletion的添加動作。

業務形態

MemTable做為內存表可用於存儲Key-Value形式的數據、根據Key值返回Value數據，同時需支持表遍歷等功能。

 1     class MemTable {
 2     public:
 3         ......
 4 
 5         // Returns an estimate of the number of bytes of data in use by this
 6         // data structure.
 7         //
 8         // REQUIRES: external synchronization to prevent simultaneous
 9         // operations on the same MemTable.
10         size_t ApproximateMemoryUsage();    //目前內存表大小
11 
12         // Return an iterator that yields the contents of the memtable.
13         //
14         // The caller must ensure that the underlying MemTable remains live
15         // while the returned iterator is live.  The keys returned by this
16         // iterator are internal keys encoded by AppendInternalKey in the
17         // db/format.{h,cc} module.
18         Iterator* NewIterator();        //    內存表迭代器
19 
20         // Add an entry into memtable that maps key to value at the
21         // specified sequence number and with the specified type.
22         // Typically value will be empty if type==kTypeDeletion.
23         void Add(SequenceNumber seq, ValueType type, const Slice& key, const Slice& value);
24 
25         // If memtable contains a value for key, store it in *value and return true.
26         // If memtable contains a deletion for key, store a NotFound() error
27         // in *status and return true.
28         // Else, return false.
29      //根據key值返回正確的數據
30         bool Get(const LookupKey& key, std::string* value, Status* s);
31 
32     private:
33         ~MemTable();  // Private since only Unref() should be used to delete it
34 
35         ......
36     };

這即所謂的業務形態：以一種全新的，SkipList不可見的形式出現，代表了LevelDB中的一個業務模塊。

KV轉儲

LevelDB是鍵值存儲系統，MemTable也被封裝為KV形式的接口，而SkipList是單值存儲結構，因此在插入、讀取數據時需完成一次編解碼工作。

如何編碼？來看Add方法：

 1     void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key, const Slice& value) 
 2     {
 3         // Format of an entry is concatenation of:
 4         //  key_size     : varint32 of internal_key.size()
 5         //  key bytes    : char[internal_key.size()]
 6         //  value_size   : varint32 of value.size()
 7         //  value bytes  : char[value.size()]
 8         size_t key_size = key.size();
 9         size_t val_size = value.size();
10         size_t internal_key_size = key_size + 8;
11         //總長度
12         const size_t encoded_len =
13             VarintLength(internal_key_size) + internal_key_size +
14             VarintLength(val_size) + val_size;
15         char* buf = arena_.Allocate(encoded_len);
16         //Internal Key Size
17         char* p = EncodeVarint32(buf, internal_key_size);
18          //User Key
19         memcpy(p, key.data(), key_size);
20         p += key_size;
21         //Seq Number + Value Type
22         EncodeFixed64(p, (s << 8) | type);
23         p += 8;
24         //User Value Size
25         p = EncodeVarint32(p, val_size);
26          //User Value
27         memcpy(p, value.data(), val_size);
28 
29         assert((p + val_size) - buf == encoded_len);
30         
31         table_.Insert(buf);
32     }

參數傳入的key、value是需要記錄的鍵值對，本文稱之為User Key，User Value。

而最終插入到SkipList的數據為buf，buf數據和User Key、User Value的轉換關系如下：

Part 1	Part 2	Part 3	Part 4	Part 5
User Key Size + 8	User Key	Seq Number << 8 \| Value Type	User Value Size	User Value

表1 User Key/User Value -> SkipList Data Item

如何解碼？來看Get：

 1     bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) 
 2     {
 3         Slice memkey = key.memtable_key();    
 4 
 5         Table::Iterator iter(&table_);
 6         iter.Seek(memkey.data());
 7 
 8         if (iter.Valid()) {
 9             // entry format is:
10             //    klength  varint32
11             //    userkey  char[klength - 8]
12             //    tag      uint64
13             //    vlength  varint32
14             //    value    char[vlength]
15             // Check that it belongs to same user key.  We do not check the
16             // sequence number since the Seek() call above should have skipped
17             // all entries with overly large sequence numbers.
18             const char* entry = iter.key();
19             uint32_t key_length;
20             const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
21             if (comparator_.comparator.user_comparator()->Compare(
22                 Slice(key_ptr, key_length - 8), key.user_key()) == 0) 
23             {
24                 // Correct user key
25                 const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
26                 switch (static_cast<ValueType>(tag & 0xff)) {
27                 case kTypeValue: {
28                     Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
29                     value->assign(v.data(), v.size());
30                     return true;
31                 }
32                 case kTypeDeletion:
33                     *s = Status::NotFound(Slice());
34                     return true;
35                 }
36             }
37         }
38         return false;
39     }

根據memtable_key,通過Table::Iterator的Seek接口找到指定的數據，隨后以編碼的逆序提前User Value並返回。這里有一個新的概念叫memtable_key，即memtable_key中的鍵值，它實際上是由表1中的Part1-Part3組成。

更直觀一些，我們順着Table的typedef看過來：

typedef SkipList<const char*, KeyComparator> Table;

---->

1 struct KeyComparator
2 {
3     const InternalKeyComparator comparator;
4     explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) { }
5     int operator()(const char* a, const char* b) const;
6 };

SkipList通過()操作符完成鍵值比較：

int MemTable::KeyComparator::operator()(const char* aptr, const char* bptr)const {
    // Internal keys are encoded as length-prefixed strings.
Slice a = GetLengthPrefixedSlice(aptr);
    Slice b = GetLengthPrefixedSlice(bptr);
    return comparator.Compare(a, b);    //InternalKeyComparator comparator
}

此處提前的a、b鍵值即SkipList中使用的key，為表1中part1-part3部分。真正的比較由InternalKeyComparator完成：

 1 int InternalKeyComparator::Compare(const Slice& akey, const Slice& bkey) const 
 2 {
 3     // Order by:
 4     //    increasing user key (according to user-supplied comparator)
 5     //    decreasing sequence number
 6     //    decreasing type (though sequence# should be enough to disambiguate)
 7     int r = user_comparator_->Compare(ExtractUserKey(akey),                     ExtractUserKey(bkey));
 8     if (r == 0) {
 9         const uint64_t anum = DecodeFixed64(akey.data() + akey.size() - 8);
10         const uint64_t bnum = DecodeFixed64(bkey.data() + bkey.size() - 8);
11         if (anum > bnum) {
12             r = -1;
13         }
14         else if (anum < bnum) {
15             r = +1;
16         }
17     }
18     return r;
19 }

核心的比較分為兩部分：User Key比較、Seq Number及Value Type比較。

User Key比較由User Compactor完成，如果用戶未指定比較器，系統將使用默認的按位比較器（BytewiseComparatorImpl）完成鍵值比較。

Seq Number即版本號，每一次數據更新將遞增該序號。當用戶希望查看指定版本號的數據時，希望查看的是指定版本或之前的數據，故此處采用降序比較。

Value Type分為kTypeDeletion、kTypeValue兩種，實際上由於任意操作序號的唯一性，類型比較時非必須的。這里同時進行了類型比較也是出於性能的考慮(減少了從中分離序號、類型的工作)。

圖2.1 Compactor

注：

Add/Get接口對的接口參數形式不一致，屬於不良接口封裝。Add中采用Slice Key而Get中則使用了LookupKey Key做為鍵值，此處應統一。
在Add方法中，部分地方使用了變長數據EncodeVarint32、而部分又采用了定長數據EncodeFixed64。此處尚未摸清作者的使用規律，或者和極致的性能優化有關，又或者存在部分隨性的因素在。

刪除記錄

客戶端的刪除動作將被轉換為一次ValueType為Deletion的添加動作，Compact動作將執行真正的刪除:

    void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key, const Slice& value)

--->

    // Value types encoded as the last component of internal keys.
    // DO NOT CHANGE THESE ENUM VALUES: they are embedded in the on-disk
    // data structures.
    enum ValueType {
        kTypeDeletion = 0x0,    //Deletion必須小於Value，查找時按順序排列
        kTypeValue = 0x1
    };

Get時如查找到符合條件的數據為一條刪除記錄，查找失敗:

 1     bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) 
 2     {
 3         Slice memkey = key.memtable_key();    
 4 
 5         Table::Iterator iter(&table_);
 6         iter.Seek(memkey.data());
 7 
 8         if (iter.Valid()) {
 9             const char* entry = iter.key();
10             uint32_t key_length;
11             const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
12             if (comparator_.comparator.user_comparator()->Compare(
13                 Slice(key_ptr, key_length - 8), key.user_key()) == 0) 
14             {
15                 // Correct user key
16                 const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
17                 switch (static_cast<ValueType>(tag & 0xff)) {
18                 case kTypeValue: {
19                     Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
20                     value->assign(v.data(), v.size());
21                     return true;
22                 }
23                 case kTypeDeletion:
24                     *s = Status::NotFound(Slice());
25                     return true;
26                 }
27             }
28         }
29         return false;
30     }

LevelDB源碼之三SSTable

上一節提到的MemTable是內存表，當內存表增長到一定程度時(memtable.size> Options::write_buffer_size)，Compact動作會將當前的MemTable數據持久化,持久化的文件(sst文件)稱之為SSTable。LevelDB中的SSTable分為不同的層級，這也是LevelDB稱之為Level DB的原因，當前版本的最大層級為7(0-6),level-0的數據最新，level-6的數據最舊。除此之外，Compact動作會將多個SSTable合並成少量的幾個SSTable，以剔除無效數據，保證數據訪問效率並降低磁盤占用。

SSTable物理布局

在存儲設備上，一個SSTable被划分為多個Block數據塊。每個Block中存儲的可能是用戶數據、索引數據或任何其他數據。SSTable除Block外，每個Block尾部還帶了額外信息，布局如下：

Block(數據塊)	Compression Type(是否壓縮)	CRC(數字簽名)
Block(數據塊)	Compression Type(是否壓縮)	CRC(數字簽名)

表 3.1 SSTable內部單元

Compression Type標識Block中的數據是否被壓縮，采用了何種壓縮算法，CRC則是Block的數字簽名，用於校驗數據的有效性。

Block是SSTable物理布局的關鍵。來看Block結構：

圖3.1 Block的物理布局

Block由以下兩部分組成：

l 數據記錄：每一個Record代表了一條用戶記錄(Key-Value對)。嚴格上講，並不是完整的用戶記錄，在Key上Block做了優化。

l 重啟點信息：亦即索引信息，用於Record快速定位。如Restart[0]永遠指向block的相對偏移0，Restart[1]指向重啟點Record4的相對偏移。作者在Key存儲上做了優化，每個重啟點指向的第一條Record記錄了完整的Key值，而本重啟點之內的其他key僅包含和前一條的差異項。

讓我們通過Block的構建過程了解上述結構：

 1 void BlockBuilder::Add(const Slice& key, const Slice& value) {
 2     Slice last_key_piece(last_key_);
 3     
 4     assert(!finished_);
 5     assert(counter_ <= options_->block_restart_interval);
 6     assert(buffer_.empty() || options_->comparator->Compare(key, last_key_piece) > 0);
 7 
 8     //1. 構建Restart Point
 9 size_t shared = 0;
10     if (counter_ < options_->block_restart_interval)//配置參數，默認為16
11 {                //尚未達到重啟點間隔，沿用當前的重啟點
12         // See how much sharing to do with previous string
13         const size_t min_length = std::min(last_key_piece.size(), key.size());
14         while ((shared < min_length) && (last_key_piece[shared] == key[shared])) 
15          {
16             shared++;
17         }
18     }
19     else            //觸發並創建新的重啟點
20     {    
21         //此時，shared = 0; 重啟點中將保存完整key
22         // Restart compression
23         restarts_.push_back(buffer_.size());//buffer_.size()為當前數據塊偏移
24         counter_ = 0;
25     }
26     const size_t non_shared = key.size() - shared;
27 
28 //2. 記錄數據
29     // shared size | no shared size | value size | no shared key data | value data
30     // Add "<shared><non_shared><value_size>" to buffer_
31     PutVarint32(&buffer_, shared);
32     PutVarint32(&buffer_, non_shared);
33     PutVarint32(&buffer_, value.size());
34     // Add string delta to buffer_ followed by value
35     buffer_.append(key.data() + shared, non_shared);
36     buffer_.append(value.data(), value.size());
37 
38     // Update state
39     last_key_.resize(shared);
40     last_key_.append(key.data() + shared, non_shared);
41     assert(Slice(last_key_) == key);
42     counter_++;
43 }

代碼3.1 BlockBuilder::Add

Buffer_代表當前數據塊,restart_中則包含了重啟點信息。當向block中新增一條記錄時，首先設置重啟點信息，包括：是否創建新的重啟點，當前key和last key中公共部分大小。重啟點信息整理完畢后，插入Record信息，Record信息的結構如下：

Record: shared size | no shared size | value size | no shared key data | value data

表3.2 Record結構

再來看Block構建完成時調用的Finish方法：

1     Slice BlockBuilder::Finish() {
2         // Append restart array
3         for (size_t i = 0; i < restarts_.size(); i++) {
4             PutFixed32(&buffer_, restarts_[i]);
5         }
6         PutFixed32(&buffer_, restarts_.size());
7         finished_ = true;
8         return Slice(buffer_);
9     }

代碼3.2 BlockBuilder::Finish

此處和圖3.1一致，在所有Record之后記錄重啟點信息，包括每條重啟點信息(block中相對偏移)及重啟點數量。

重啟點機制主要有兩點好處：

索引信息：用於快速定位，讀取時通過重啟點的二分查找先獲取查找數據所屬的重啟點，隨后在重啟點內部遍歷，時間復雜度為Log(n)。
空間壓縮：有序key值使得相鄰記錄的key值的重疊度極高，通過上述方式可以有效降低持久化設備占用。

至此，SSTable的物理布局已然清晰，由上到下依次為：表3.1->圖3.1->表3.2。

SSTable邏輯布局

剛剛看過Block的結構，緊接着來看SSTable的邏輯布局，這次我們先從實現說起：

 1     void TableBuilder::Add(const Slice& key, const Slice& value) {
 2         Rep* r = rep_;
 3         assert(!r->closed);
 4         if (!ok()) return;
 5         if (r->num_entries > 0) {
 6             assert(r->options.comparator->Compare(key, Slice(r->last_key)) > 0);
 7         }
 8 
 9         //1. 構建Index
10         if (r->pending_index_entry) {
11             assert(r->data_block.empty());
12             r->options.comparator->FindShortestSeparator(&r->last_key, key);
13             std::string handle_encoding;
14             r->pending_handle.EncodeTo(&handle_encoding);
15             r->index_block.Add(r->last_key, Slice(handle_encoding));
16             r->pending_index_entry = false;
17         }
18 
19      //2. 記錄數據
20         r->last_key.assign(key.data(), key.size());
21         r->num_entries++;
22         r->data_block.Add(key, value);
23 
24         //3. 數據塊大小已達上限，寫入文件
25         const size_t estimated_block_size = r->data_block.CurrentSizeEstimate();
26         if (estimated_block_size >= r->options.block_size) {
27             Flush();
28         }
29     }

代碼3.3 TableBuilder::Add

這段代碼和代碼3.1類似，先構建索引，隨后插入數據，此處額外增加了數據塊處理邏輯：數據塊大小達到了指定上限，寫入文件。您可能已經注意到，Block中采用了重啟點機制實現索引功能，在保證性能的同時又降低了磁盤占用。那么此處為何沒有采用類似的機制呢？

實際上，此處索引鍵值的存儲也做了優化，具體實現在FindShortestSeparator中，其目的在於獲取最短的可以做為索引的“key”值。舉例來說，“helloworld”和”hellozoomer”之間最短的key值可以是”hellox”。除此之外，另一個FindShortSuccessor方法則更極端，用於找到比指定key值大的最小key，如傳入“helloworld”，返回的key值可能是“i”而已。作者專門為此抽象了兩個接口，放置於Compactor中，可見其對編碼也是是有“潔癖”的(*_*)。

 1     // A Comparator object provides a total order across slices that are
 2     // used as keys in an sstable or a database.  A Comparator implementation
 3     // must be thread-safe since leveldb may invoke its methods concurrently
 4     // from multiple threads.
 5     class Comparator {
 6     public:
 7          ......
 8         // Advanced functions: these are used to reduce the space requirements
 9         // for internal data structures like index blocks.
10 
11         // If *start < limit, changes *start to a short string in [start,limit).
12         // Simple comparator implementations may return with *start unchanged,
13         // i.e., an implementation of this method that does nothing is correct.
14         virtual void FindShortestSeparator(std::string* start, const Slice& limit) const = 0;
15 
16         // Changes *key to a short string >= *key.
17         // Simple comparator implementations may return with *key unchanged,
18         // i.e., an implementation of this method that does nothing is correct.
19         virtual void FindShortSuccessor(std::string* key) const = 0;
20     };

代碼3.4 索引鍵值優化接口

再來看Table構建完成時調用的Finish方法：

 1     Status TableBuilder::Finish() {
 2         //1. Data Block
 3         Rep* r = rep_;
 4         Flush();
 5 
 6         assert(!r->closed);
 7         r->closed = true;
 8         
 9         //2. Meta Block
10         BlockHandle metaindex_block_handle;
11         BlockHandle index_block_handle;
12         if (ok()) 
13         {
14             BlockBuilder meta_index_block(&r->options);
15             // TODO(postrelease): Add stats and other meta blocks
16             WriteBlock(&meta_index_block, &metaindex_block_handle);
17         }
18 
19         //3. Index Block
20         if (ok()) {
21             if (r->pending_index_entry) {
22                 r->options.comparator->FindShortSuccessor(&r->last_key);
23                 std::string handle_encoding;
24                 r->pending_handle.EncodeTo(&handle_encoding);
25                 r->index_block.Add(r->last_key, Slice(handle_encoding));
26                 r->pending_index_entry = false;
27             }
28             WriteBlock(&r->index_block, &index_block_handle);
29         }
30 
31         //4. Footer
32         if (ok()) 
33         {
34             Footer footer;
35             footer.set_metaindex_handle(metaindex_block_handle);
36             footer.set_index_handle(index_block_handle);
37             std::string footer_encoding;
38             footer.EncodeTo(&footer_encoding);
39             r->status = r->file->Append(footer_encoding);
40             if (r->status.ok()) {
41                 r->offset += footer_encoding.size();
42             }
43         }
44         return r->status;
45     }

代碼3.5 TableBuilder::Finish

通過Finish方法，我們可以一窺SSTable的全貌：

圖3.2 SSTable邏輯布局

l Data Block：數據塊，用戶數據存放於此。

l Meta Block：元數據塊，暫未使用，占位而已。

l Index Block：索引塊，用於用戶數據快速定位。

l Footer：見圖3.3，“metaindex_handle指出了metaindex block的起始位置和大小；inex_handle指出了index Block的起始地址和大小；這兩個字段可以理解為索引的索引，是為了正確讀出索引值而設立的，后面跟着一個填充區和魔數。”（引自數據分析與處理之二（Leveldb 實現原理））。

圖3.3 Footer

重啟點機制問題：SSTable一旦創建后，將只存在查詢行為，在鍵值查找或SSTable遍歷時，必定從重啟點開始查找，因此除重啟點位置的Record為完整key外，其他均為差異項亦可快速定位。
Table、Block一旦創建后無法修改，TableBuilder負責Table創建，BlockBuilder負責。Table、Block最重要的接口為Iterator* NewIterator(...) const，用於查找、遍歷數據。LevelDB中的Iterator稍顯復雜，后面會統一備忘。
Table、Block各自采用了類似的索引機制，並形成了Table到Block的多級索引。重啟點、Table的索引機制在保證性能的同時又降低了存儲空間。
表3.1、圖3.2中一直強調SSTable中存儲的是Block，這種描述並不十分准確。表3.1中講到，SSTable中存儲了“Compression Type(是否壓縮)”，如果數據被壓縮，SSTable中存儲的並不是Block數據本身，而是壓縮后的數據，使用時則需先對Block解壓。

Version、Current File、Manifest等暫未備忘，待后續補充。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 LevelDB 源碼解析之 Arena leveldb源碼分析--SSTable之Compaction LevelDB的源碼閱讀（三） Get操作 LevelDB的源碼閱讀（四） Compaction操作玩轉Leveldb原理及源碼--拙見1 leveldb 源碼--總體架構分析 LevelDb leveldb源碼分析--插入刪除流程 Linux環境下levelDB源碼編譯與安裝 LevelDB