Python字典實現

本文轉載自查看原文 2019-08-08 09:08 751 Python知識錦集

這篇文章描述了在Python中字典是如何實現的。

字典通過鍵(key)來索引，它可以被看做是關聯數組。我們在一個字典中添加3個鍵/值對：

>>> d = {'a': 1, 'b': 2} >>> d['c'] = 3 >>> d {'a': 1, 'b': 2, 'c': 3}

可以這樣訪問字典值：

>>> d['a'] 1 >>> d['b'] 2 >>> d['c'] 3 >>> d['d'] Traceback (most recent call last): File "<stdin>", line 1, in <module> KeyError: 'd'

鍵'd'不存在，所以拋出了KeyError異常。

哈希表

Python字典是用哈希表(hash table)實現的。哈希表是一個數組，它的索引是對鍵運用哈希函數(hash function)求得的。哈希函數的作用是將鍵均勻地分布到數組中，一個好的哈希函數會將沖突（譯者注：沖突指不同鍵經過哈希函數計算得到相同的索引，這樣造成索引重復的沖突。）的數量降到最小。Python沒有這類哈希函數，它最重要的哈希函數（用於字符串和整數）很常規：

>>> map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] >>> map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462]

在這篇文章的余下部分里，我們假定用字符串作為字典的鍵。Python中字符串的哈希函數定義如下：

arguments: string object
return: hash
function: string_hash:
    if hash cached:
        return it
    set len to string's length
    initialize var p ponting to 1st char of string object
    set x to value pointed by p left shifted by 7 bits
    while len >= 0:
        set var x to (1000003 * x) xor value pointed by p
        increment pointer p
    set x to x xor length of string object
    cached x as the hash so we don't need to calculate it again
    return x as the hash

如果你在Python中運行 hash('a')，string_hash()函數將會被執行並返回12416037344。這里我們假定我們使用的時64位機器。（譯者注：這篇文章講述的是Python2的字典實現）

如果一個長度為x的數組被用來存儲鍵/值對，那么我們用x-1作為掩碼來計算數組中slot的索引，這使得計算slot索引的計算非常快。找到一個空slot的概率很大，原因是下述調整大小的機制。這意味着在絕大多數情況下使用簡單的計算是很有意義的。如果數組的大小是8，那么'a'的索引將會是hash('a') & 7 = 0。'b'的索引是3，'c'的索引是2，'z'的索引和'b'一樣都是3因此便產生了沖突。

hash table

我們可以看到當鍵是連續的時候，Python的哈希函數能很好地運作，因為這類數據極為常見。然而，當我們添加了鍵'z'沖突就產生了，因為它（譯者注：這里是指這些key經過哈希函數計算得到的索引）不夠連續。

我們可以使用一個鏈表(linked list)來存儲hash值相同的鍵/值對，但是這樣會增加查找的時間復雜度（再也不是O(1)了）。下面的部分描述了Python字典中用到的沖突解決方法。

開放地址法

開放地址法是一個用探測手段來解決沖突的方法。在上述'z'鍵的例子中，索引為3的slot在數組中已經被使用了，因此我們需要探測出一個尚未被占用的索引。因為需要探測，所以添加一個鍵/值對可能需要更多的時間，但是查找卻是O(1)時間復雜度的，這正是我們訴求的。

二次探查序列是用來查找空閑slot的，代碼如下：

j = (5 * j) + 1 + perturb;
perturb >>= PERTURB_SHIFT;
use j % 2 ** i as the next table index;

你可以在源代碼dictobject.c中找到更多關於探測序列的內容。詳細的探測機制的解釋在源碼的頭部。

probing sequence

現在讓我們順着一個例子來看看Python實現的源碼。

字典的C數據結構

下面的C結構被用來存儲一個字典項：鍵/值對。哈希值、鍵和值都存儲在這個結構中。PyObject是Python對象的基類（譯者注：這里只是類的概念，CPython用C結構體抽象出了Python中的類概念）。

typedef struct { Py_ssize_t me_hash; PyObject *me_key; PyObject *me_value; } PyDictEntry;

下面的結構代表了一個字典。ma_fill是使用了的slots加dummy slots的數量和。當一個鍵值對被移除了時，它占據的那個slot（譯者注：slot一般翻譯為“槽”，個人比較喜歡直接用英文，通俗地講它就是指一個抽象的存儲位置）會被標記為dummy。ma_used是被占用了（即活躍的）的slots數量。ma_mask等於數組長度減一，它被用來計算slot的索引。ma_table是一個（用來存鍵值對）的數組，ma_smalltable是一個初始大小為8的數組。

typedef struct _dictobject PyDictObject; struct _dictobject { PyObject_HEAD Py_ssize_t ma_fill; Py_ssize_t ma_used; Py_ssize_t ma_mask; PyDictEntry *ma_table; PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); PyDictEntry ma_smalltable[PyDict_MINSIZE]; };

字典的初始化

當你第一次創建一個字典，PyDict_New()函數會被調用。下面是精簡了源碼的偽碼實現（它集中於原實現的關鍵點）

returns new dictionary object
function PyDict_New:
    allocate new dictionary object
    clear dictionary's table
    set dictionary's number of used slots + dummy slots (ma_fill) to 0
    set dictionary's number of active slots (ma_used) to 0
    set dictionary's mask (ma_value) to dictionary size - 1 = 7
    set dictionary's lookup function to lookdict_string
    return allocated dictionary object

添加項

當添加一個新鍵值對時PyDict_SetItem()被調用，該函數帶一個指向字典對象的指針和一個鍵值對作為參數。它檢查該鍵是否為字符串並計算它的hash值（如果這個鍵的哈希值已經被緩存了則用緩存值）。然后insertdict()函數被調用來添加新的鍵/值對，如果使用了的slots和dummy slots的總量超過了數組大小的2/3則重新調整字典的大小。

為什么事2/3？這是為了保證探測數列可以快速地找到可用的空slot（譯者注：個人理解是如果空閑量較大，那么每次探測產生沖突的概率就會降低，從而減少探測次數）。稍候我們再看調整大小的函數。

arguments: dictionary, key, value
return: 0 if OK or -1
function PyDict_SetItem:
    if key's hash cached:
        use hash
    else:
        calculate hash
    call insertdict with dictionary object, key, hash and value
    if key/value pair added successfully and capacity orver 2/3:
        call dictresize to resize dictionary's table

insertdict()使用查找函數lookdict_string來尋找空閑的slot，這和尋找key的函數是一樣的。lookdict_string()函數利用hash和mask值計算slot的索引，如果它不能在slot索引（=hash & mask）中找到這個key，它便開始如上述偽碼描述循環來探測直到找到一個可用的空閑slot。第一次探測時，如果key為空(null)，那么如果找到了dummy slot則返回之。這給了之前刪除的slots以重用的優先級。

如果我們想添加這樣的一些鍵/值對：{'a': 1, 'b': 2, 'z': 26, 'y': 25, 'c': 5, 'x': 24}，情形如下：一個字典結構里的表大小為8。

PyDict_SetItem: key='a', value = 1
    hash = hash('a') = 12416037344
    insertdict
        lookdict_string
            slot index = hash & mask = 12416037344 & 7 = 0
            slot 0 is not used so return it
        init entry at index 0 with key, value and hash
        ma_used = 1, ma_fill = 1
PyDict_SetItem: key='b', value = 2
    hash = hash('b') = 12544037731
    insertdict
        lookdict_string
            slot index = hash & mask = 12544037731 & 7 = 3
            slot 3 is not used so return it
        init entry at index 3 with key, value and hash
        ma_used = 2, ma_fill = 2
PyDict_SetItem: key='z', value = 26
    hash = hash('z') = 15616046971
    insertdict
        lookdict_string
            slot index = hash & mask = 15616046971 & 7 = 3
            slot 3 is used so probe for a different slot: 5 is free
        init entry at index 5 with key, value and hash
        ma_used = 3, ma_fill = 3
PyDict_SetItem: key='y', value = 25
    hash = hash('y') = 15488046584
    insertdict
        lookdict_string
            slot index = hash & mask = 15488046584 & 7 = 0
            slot 0 is used so probe for a different slot: 1 is free
        init entry at index 1 with key, value and hash
        ma_used = 4, ma_fill = 4
PyDict_SetItem: key='c', value = 3
    hash = hash('c') = 12672038114
    insertdict
        lookdict_string
            slot index = hash & mask = 12672038114 & 7 = 2
            slot 2 is not used so return it
        init entry at index 2 with key, value and hash
        ma_used = 5, ma_fill = 5
PyDict_SetItem: key='x', value = 24
    hash = hash('x') = 15360046201
    insertdict
        lookdict_string
            slot index = hash & mask = 15360046201 & 7 = 1
            slot 1 is used so probe for a different slot: 7 is free
        init entry at index 7 with key, value and hash
        ma_used = 6, ma_fill = 6

目前為止，一切看起來像這樣：

insert items

8個slots中的6個已經被使用了，這超過了數組容量的2/3，dictresize()函數會被調用來分配一個更大的數組。這個函數也負責將舊表中的數據復制到新表中。

在這里，dictresize()函數會帶一個minused=24的參數，24是4 * ma_used。當已使用的slots量很大時（超過50000），minused=2 * ma_used。為什么是已使用slots量的4倍？這樣會減小重新調整字典大小的步驟並使字典更稀疏（譯者注：這樣就可以降低探測時沖突的概率）。

新表的大小需要大於24，這是通過將當前表大小向左移位來實現的，每次左移一位直到它大於24，結果表大小變為了32（即8 -> 16 -> 32）。

這就是重新調整表大小的結果：分配了一個大小為32的新表，舊表的項利用新掩碼31（即32 - 1）計算哈希從而插到新表的相應slots中。結果看起來像這樣：

resize table

移除項

PyDict_DelItem()被用來刪除一個字典項。key的哈希值被計算出來作為查找函數的參數，刪除后這個slot就成為了dummy slot。

假設我們想從字典中移除鍵'c'，最終我們得到下述數組：

delete item

注意刪除字典項的操作不會觸發數組大小的調整（即使使用slots的數量遠小於總slots量）。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python實現字典樹 Python字典的實現原理 Python字典的實現原理 Python字典實現分析 Python實現創建字典 Python字典和集合的內部實現 Python 實現字典操作詳解 Python 實現字典反轉的方法 Python字典對象實現原理 C++ 之 Python 字典類型實現