B-Tree算法分析與實現

本文轉載自查看原文 2015-12-09 17:28 1751 算法

　　在數據庫系統中，或者說在文件系統中，針對存儲在磁盤上的數據讀取和在內存中是有非常大的區別的，因為內存針對任意在其中的數據是隨機訪問的，然而從磁盤中讀取數據是需要通過機械的方式來讀取一個block，不能指定的只讀取我們期望的數值，比如文件中的某個int。那么針對存儲在磁盤中數據結構的組織就很重要，為了提高訪問數據的效率，在多種數據庫系統中，采用B-Tree及其變種形式來保存數據，比如B+-Tree。我們這里先主要針對B-Tree的算法進行分析和實現。

　　一、 B-Tree的定義與意義

　　B-Tree的定義是這樣的：

　　1、根結點至少有兩個子女；

　　2、每個非根節點所包含的關鍵字個數 j 滿足：m/2 - 1 <= j <= m - 1；

　　3、除根結點以外的所有結點（不包括葉子結點）的度數正好是關鍵字總數加1，故內部子樹個數 k 滿足：m/2 <= k <= m ；

　　4、所有的葉子結點都位於同一層。

　　根據上訴定義，我們可以看出B-Tree是一個自平衡的樹，從第4條可以看出來，1、2、3條主要是規定了B-Tree的節點（Node）分裂（split）的前提一定是滿了（overflow）才會進行，而且一定會分裂成數量幾乎相同的2個子節點。

　　那么使用B-Tree在數據庫中存儲數據有什么優勢呢？我們知道B-Tree是一個扇出（fan-out，也就是可以擁有的子節點數量）不固定的樹，和二叉樹不同，二叉樹的扇出固定只有2，而B-Tree的扇出可以任意大，比如100。扇出非常大，那么在同一個block，或者page中能存放的關鍵字key也就越多，那么針對文件系統進行數據查找的時候，需要搜索的目錄深度也就越少，很簡單的算術。二叉樹，32層可以存儲最多21億左右的key，100扇出的B-Tree 5層就可以最多存儲100億左右的key！！那么在磁盤中查找數據，或者對數據進行更新時，讀取磁盤的次數將大大減少，整體性能有非常非常高的提升。

　　 二、B-Tree Insert分析實現

　　在了解了B-Tree的定義和意義之后，我們來看下B-Tree insert算法是如何實現的。B-Tree insert算法的描述是這樣的：

1、using the SEARCH procedure for M-way trees (described above) find the leaf node to which X should be added.
2、add X to this node in the appropriate place among the values already there. Being a leaf node there are no subtrees to worry about.
3、if there are M-1 or fewer values in the node after adding X, then we are finished.
If there are M nodes after adding X, we say the node has overflowed. To repair this, we split the node into three parts:

Left:
the first (M-1)/2 values
Middle:
the middle value (position 1+((M-1)/2)
Right:
the last (M-1)/2 values

　　簡單來說分為3步：

　　1、首先查找需要插入的key在哪個葉節點中

　　2、然后將關鍵字插入到指定的葉節點中

　　3、如果葉節點沒有overflow，那么就結束了，非常簡單。如果葉節點overflow了，也就是滿了，那么就拆分（split）此節點，將節點中間的關鍵字放到其父節點中，剩余部分拆分為左右子節點。如果拆分出來放到父節點后，父節點也overflow了，那么繼續拆分父節點，父節點當做當前，直到當前節點不再overflow。

　　實現的代碼如下：btree.h

#ifndef BTREE_BTREE_H
#define BTREE_BTREE_H

#define NULL 0
#include <algorithm>

// btree節點
struct b_node {
    int num;                // 當前節點key的數量
    int dim;
    int* keys;
    b_node* parent;            // 父節點
    b_node** childs;        // 所有子節點

    b_node() {
    }

    b_node (int _dim) : num(0), parent(NULL) {
        dim = _dim;
        keys = new int[dim + 1];                // 預留一個位置，方便處理節點滿了的時候插入操作
        childs = new b_node*[dim + 2];            // 扇出肯定需要比key還多一個
        for (int i=0; i<dim+1; ++i) {
            keys[i] = 0;
            childs[i] = NULL;
        }
        childs[dim+1] = NULL;
    }

    // 返回插入的位置
    int insert(int key) {
        int i = 0;
        keys[num] = key;
        for (i = num; i > 0; --i)
        {
            if (keys[i-1] > keys[i])
            {
                std::swap(keys[i-1], keys[i]);
                continue;
            }
            break;
        }
        ++num;                            // 數量添加
        return i;
    }

    bool is_full() {
        if (num < dim) {
            return false;
        }
        return true;
    }

    // 獲取需要插入的位置
    int get_ins_pos(int key) {
        int i = 0;
        for (i=0; i<dim; ++i) {
            if (key > keys[i] && keys[i]) {
                continue;
            }
        }

        return i;
    }
};

// 表達某個值的位置
struct pos {
    b_node* node;            // 所在位置的node指針
    int index;                // 所在node節點的索引
    pos() : node(NULL), index(-1) {
    }
};

class btree {
public:
    btree (int _dim) : dim(_dim), root(NULL) {
    }

    pos query(int key);            // 查找某個某個key
    void insert(int key);        // 插入某個key
    void print();                // 分層打印btree

private:
    pos _query(b_node* root, int key);

    void _print(b_node* node, int level);

    void _insert(b_node* node, int key);
    void _split_node(b_node* node);
    void _link_node(b_node* parent, int pos, b_node* left_child, b_node* right_child);

private:
    int dim;                    // 維度
    b_node* root;                // 根節點
};

#endif

　　所有函數以"_"為開頭的，都是內部函數，對外不可見。將針對節點本身的插入操作和基礎判斷都放在b_node結構中，增加代碼的可讀性。

　　btree.cpp 代碼如下

#include "btree.h"
#include <iostream>
using namespace std;

void btree::insert(int key) {
    _insert(root, key);
}

void btree::_insert(b_node* node, int key) {

    // 根節點為空
    if (root == NULL)
    {
        root = new b_node(dim);
        root->insert(key);
        return;
    }
    
    int index = node->num;
    while (index > 0 && node->keys[index-1] > key)                // 找到對應的子節點
    {
        --index;
    }
    
    // 如果當前node插入節點已經沒有左右兒子了，那么就在當前節點中插入
    if (!node->childs[index])                    // 因為btree一定是既有左兒子，又有右兒子，所以只判斷其中一個是否存在就可以了
    {
        // 如果節點沒有滿
        if (!node->is_full())
        {
            node->insert(key);
            return;
        }

        // 如果當前節點已經滿了，需要將中間節點拆分，然后加入到父節點中，將剩余的2個部分，作為新節點的左右子節點
        // 如果父節點加入新的key之后也滿了，那么遞歸上一個步驟
        node->insert(key);
        _split_node(node);
        return;
    }

    // 已經遍歷到最右key了
    if (index == node->num)
    {
        _insert(node->childs[index], key);
        return;
    }

    _insert(node->childs[index], key);
    return;
}

void btree::_split_node(b_node* node) {
    if (!node || !node->is_full()) {
        return;
    }

    int split_pos = (node->dim-2)/2 + 1;                // 分割點
    int split_value = node->keys[split_pos];
    b_node* split_left_node = new b_node(dim);
    b_node* split_right_node = new b_node(dim);
    
    // 處理左兒子節點
    int i = 0;
    int j = 0;
    for (; i<split_pos; ++i, ++j) {
        split_left_node->keys[i] = node->keys[j];
        split_left_node->childs[i] = node->childs[j];
    }
    split_left_node->childs[i] = node->childs[j];
    split_left_node->num = split_pos;

    // 處理右兒子節點
    for (i = 0, j=split_pos+1; i < dim - split_pos; ++i, ++j) {
        split_right_node->keys[i] = node->keys[j];
        split_right_node->childs[i] = node->childs[j];
    }
    split_right_node->childs[i] = node->childs[j];
    split_right_node->num = dim - split_pos;

    // 將分割的節點上升到父節點中
    b_node* parent = node->parent;
    if (!parent) {            // 父節點不存在
        b_node* new_parent = new b_node(dim);
        new_parent->insert(split_value);

        _link_node(new_parent, 0, split_left_node, split_right_node);

        // 重置根節點
        root = new_parent;    
        return;
    }

    // 如果父節點也滿了，那么先將split出來的節點加入父節點，然后再對父節點split
    if (parent->is_full()) {
        int new_pos = parent->insert(split_value);

        _link_node(parent, new_pos, split_left_node, split_right_node);
        _split_node(parent);                    // 如果父節點也滿了， 那么繼續split父節點
    }
    else {
        int pos = parent->insert(split_value);
        _link_node(parent, pos, split_left_node, split_right_node);
    }

    return;
}

void btree::_link_node(b_node* parent, int pos, b_node* left_child, b_node* right_child) {
    parent->childs[pos] = left_child;
    left_child->parent = parent;

    parent->childs[pos+1] = right_child;
    right_child->parent = parent;
}

void btree::print() {
    cout << "==================================" << endl;
    _print(root, 1);
    cout << "==================================" << endl;
}

void btree::_print(b_node* node, int level) {
    if (!node) {
        return;
    }

    cout << level << ":";
    for (int i=0; i<node->num; ++i)    {
        cout << node->keys[i] << ",";
    }
    cout << endl;

    for (int i=0; i<node->num+1; ++i) {
        _print(node->childs[i], level+1);
    }
    return;
}

　　（1） insert接口調用內部的_insert函數。

　　（2） _insert中首先判斷B-Tree是否為空，要是空的話，先創建根節點，然后簡單的將key插入就可以了。

　　（3）如果不是空的話，判斷key在當前節點是否可以插入，如果當前節點就是葉子節點，那么肯定是沒有子節點了，也就是childs是空了。如果不是葉子節點，那么就需要遞歸下層子節點做判斷，直到直到可以插入的葉子節點，然后做插入操作。

　　（4）插入的時候先判斷當前節點是否已經滿了，如果沒有滿，那么就簡單的直接插入，調用b_node的insert就結束了。否則先將key插入，然后_split_node針對節點進行分裂。

　　（5）在_split_node中，先找到需要上升到父節點的key，然后將key左邊的所有key變成左子樹，將key右邊的所有key變成右子樹，對里面的key和子節點指針做復制。然后將split_value添加到父節點中，沒有父節點就先創建一個父節點，有就加入。如果父節點也overflow了，就遞歸的進行_split_node，直到當前節點沒有overflow為止。

　　代碼中的dim是維度的意思，維度為3，就是指fan-out為4，也就是一個node可以保持3個key，擁有最多4個子節點。這個概念可能不同的地方略有差異，需要根據實際的說明注意一下。

　　測試代碼：

#include "btree.h"

int main() {
    btree btr(3);
    
    btr.insert(10);
    btr.insert(12);
    btr.insert(50);
    btr.insert(11);
    btr.print();

    btr.insert(20);
    btr.insert(22);
    btr.print();

    btr.insert(33);
    btr.insert(35);
    btr.print();

    btr.insert(40);
    btr.print();

    btr.insert(42);
    btr.print();

    btr.insert(13);
    btr.insert(1);
    btr.insert(23);
    btr.print();
    return 0;
}

　　三、BTree刪除

　　BTree刪除的算法，比插入還要稍微的復雜一點。通常的做法是，當刪除一個key的時候，如果被刪除的key不在葉子節點中，那么我們使用其最大左子樹的key來替代它，交換值，然后在最大左子樹中刪除。

　　以上圖為例，如果需要刪除10，那么我們使用7和10進行交換，然后原來的[6,7]變成[6,10]，刪除10.

從BTree中刪除key就可以保證一定是在葉子節點中進行的了。刪除主要分為2步操作：

　　1、將key從當前節點刪除，由於一定是在葉子節點中，那么根本不需要考慮左右子樹的問題。

　　2、由於從節點中刪除了key，那么節點中key的數量肯定減少了。如果節點中key的數量小於(M-1)/2了，我們就認為其underflowed了。如果underflowed沒有發生，那么這次刪除操作就簡單的結束了，如果發生了，那么就需要修復這種問題（這是由於BTree的自平衡特性決定的，可以回頭看下一開始說的BTree定義）。

　　針對BTree的刪除，復雜的部分就是修復underflowed的問題。如何修復這種問題呢？做法是從被刪除節點的鄰居“借”key來修復，那么一個節點可能有2個鄰居，我們選擇key數量更多的鄰居來“借”。那么借完之后，我們將被刪除節點，其鄰居，以及其父節點中key來生成一個新的node，“combined node”（連接節點）。生成新的節點之后，如果其數量大於（M-1），或者等於（M-1）的做法是不一樣的，分為2中做法。

　　（1）如果大於（M-1），那么處理方法也比較簡單，將新的combined node分裂成3個部分，Left，Middle，Right，Middle就是combined node正中間的key，用來替代原來的父節點值，Left和Right作為新的左右子樹。由於大於（M-1），那么可以保證新的Left和Right都是滿足BTree要求的。

　　（2）如果等於（M-1）就比較復雜了。由於新的Combined node的節點數量剛好滿足BTree要求，而且也不能像（1）的情況那樣進行分裂，那么就等於新節點從父節點“借”了一個值，如果父節點被借了值之后，數量大於等於（M-1）/2，那么沒問題，修復結束。如果父節點的值也小於(M-1)/2了，那么就需要再修復父節點，重復這個步驟，直到根節點為止。

　　比如上面的樹，刪除key=3，那么刪除后的樹為

　　由於BTree根節點的特殊性，它只需要最少有一個節點就可以了，如果修復到根節點還有至少一個節點，那么修復結束，否則刪除現有根節點，使用其左子樹替代，左子樹可能為空，那么整棵BTree就是空了！

　　代碼如下：

void btree::del(int key) {
    _del(root, key);
}

void btree::_del(b_node* node, int key) {
    // 先找到刪除節點所在的位置
    pos p = query(key);

    // 查找其最大左子樹key
    pos left_max_p = _get_left_max_key(key);

    b_node* del_node = p.node;
    if (left_max_p.node != NULL)
    {
        del_node = left_max_p.node;
        std::swap(p.node->keys[p.index], left_max_p.node->keys[left_max_p.index]);    // 將最大左子樹key和當前key進行交換
    }

    // 現在針對key進行刪除
    del_node->del(key);    

    // 先判斷如果沒有underflowed，就直接結束了
    if (!del_node->is_underflowed()) {
        return;
    }

    _merge_node(del_node);
}

void btree::_merge_node(b_node* del_node) {
    // 如果underflowed了，那么先判斷是否為根節點，根節點只要最少有一個key就可以了，其他非根節點最少要有(M-1)/2個key
    if (del_node->is_root())
    {
        if (del_node->num == 0)                // 根節點已經沒有key了
        {
            root = del_node->childs[0];
        }
        return;
    }

    // 如果是葉子節點並且underflowed了，那么就需要從其“鄰居”來“借”了
    b_node* ngb_node = del_node->get_pop_ngb();
    if (ngb_node == NULL)
    {
        return;
    }

    int p_key_pos = (del_node->pos_in_parent + ngb_node->pos_in_parent) / 2;
    int parent_key = del_node->parent->keys[p_key_pos];

    // 處理組合后的節點
    b_node* combined_node = new b_node(del_node->num + 1 + ngb_node->num);

    if (del_node->pos_in_parent < ngb_node->pos_in_parent)
    {
        int combined_n = 0;
        _realloc(combined_node, del_node, del_node->num);
        combined_n += del_node->num;

        combined_node->insert(parent_key); ++combined_n;

        _realloc(combined_node, ngb_node, ngb_node->num, combined_n);
    }
    else
    {
        int combined_n = 0;
        _realloc(combined_node, ngb_node, ngb_node->num);
        combined_n += ngb_node->num;

        combined_node->insert(parent_key); ++combined_n;

        _realloc(combined_node, del_node, del_node->num, combined_n);
    }


    // 如果鄰居key的數量大於(M-1)/2, 那么執行case1邏輯，將combined后的node中間值和parent中的值進行交換，然后分裂成2個節點
    if (ngb_node->num > dim/2)
    {
        int split_pos = (del_node->num + ngb_node->num + 1) / 2;
        b_node* combined_left = new b_node(dim);
        b_node* combined_right = new b_node(dim);

        _realloc(combined_left, combined_node, split_pos);
        _realloc(combined_right, combined_node, combined_node->num - split_pos - 1, 0, split_pos + 1);

        combined_left->parent = del_node->parent;
        combined_right->parent = del_node->parent;

        b_node* parent = del_node->parent;
        std::swap(combined_node->keys[split_pos], del_node->parent->keys[del_node->pos_in_parent]);
        parent->childs[p_key_pos] = combined_left;
        combined_left->pos_in_parent = p_key_pos;
        parent->childs[p_key_pos + 1] = combined_right;
        combined_right->pos_in_parent = p_key_pos + 1;
        
        return;
    }

    // 如果鄰居的key的數量剛好是(M-1)/2，那么合並之后就可能會發生underflowed情況
    // 鄰居key的數量不可能會發生小於(M-1)/2的，因為如果是這樣，之前就已經做過fix處理了
    del_node->parent->del(parent_key);
    del_node->parent->childs[del_node->pos_in_parent] = combined_node;
    combined_node->parent = del_node->parent;
    combined_node->pos_in_parent = del_node->pos_in_parent;

    // 如果parent去掉一個節點之后並沒有underflowed，那么就結束
    if (!del_node->parent->is_underflowed())
    {
        return;
    }

    // 否則繼續對parent節點進行修復, 直到根節點
    _merge_node(del_node->parent);
    return;
}

void btree::_realloc(b_node* new_node, b_node* old_node, int num, int new_offset, int old_offset) {
    int i = old_offset;
    int n = new_offset;
    for (; i<old_offset + num; ++i, ++n)
    {
        new_node->keys[n] = old_node->keys[i];
        new_node->childs[n] = old_node->childs[i];

        if (new_node->childs[n]) {
            new_node->childs[n]->parent = new_node;
            new_node->childs[n]->pos_in_parent = n;
        }
    }
    new_node->childs[n] = old_node->childs[i];
    if (new_node->childs[n]) {
        new_node->childs[n]->parent = new_node;
        new_node->childs[n]->pos_in_parent = n;
    }
    new_node->num += num;
    return;
}

　　測試代碼通過一個個的值插入，我們有意的數值安排，將我們的B-Tree從1層，最后擴展到了3層，可以通過print接口來更方便的觀看一下B-Tree各層的數值。

　　如果想知道自己實現的是否正確，或者想了解B-Tree插入節點的流程，https://www.cs.usfca.edu/~galles/visualization/BTree.html 這個網址用動畫的方式給我們展示B-Tree的插入和分裂過程，非常形象，很好理解。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 什么是B-Tree B-Tree詳解 B-Tree（B樹）原理及C++代碼實現自己寫的java實現的多路搜索樹 B-Tree B-Tree和B+Tree的區別 AES算法分析與實現 AdaBoost算法分析與實現 LSM-Tree 與 B-Tree Mysql B-Tree, B+Tree, B*樹介紹 Oracle學習筆記（一）——B-Tree索引