TensorFlow中的顯存管理器——BFC Allocator

本文轉載自查看原文 2019-05-04 23:00 2633 Distributed System/ BFC Allocator/ TensorFlow源碼閱讀與架構梳理/ 分布式系統/ Memory Management/ Algorithm/ 分布式/ Deep Learning/ TensorFlow

背景

作者：DeepLearningStack，阿里巴巴算法工程師，開源TensorFlow Contributor]

歡迎大家關注我的公眾號，“互聯網西門二少”，我將繼續輸出我的技術干貨~

使用GPU訓練時，一次訓練任務無論是模型參數還是中間結果都需要占用大量顯存。為了避免每次訓練重新開辟顯存帶來計算之外的開銷，一般框架的做法是在真正的訓練任務開始前，將每個節點的輸入和輸出，以及模型參數的shape計算出來並全局開辟一次，例如Caffe就是這種做法。隨着深度學習模型的發展和迭代，不僅模型訓練的數據shape可能發生變化，就連模型本身在訓練過程中也可能發生變化，那么按照固定shape一次開辟顯存的做法就不能滿足需求了。為此，TensorFlow重新設計了較為靈活的顯存管理機制，它使用了名為BFC的分配算法，並通過BFC Allocator為每個Tensor分配滿足需求的顯存。本節我們將一起窺探BFC Allocator的設計思想。

從Tensor的創建談起

為Tensor分配存儲區的時機

在進入主題之前，讓我們先思考一個問題：TensorFlow中的Tensor究竟是何時拿到所需存儲區的呢？答案是在Tensor對象被創建時就立即進行分配。在TensorFlow的一輪訓練結束后，所有的Tensor都已經被釋放，下一輪計算開始后會按照需求重新創建Tensor，並為其分配新的存儲空間。下面的代碼片段中我們可以看到Tensor創建時，使用Allocator分配存儲區的代碼段。

在創建Tensor對象時需要傳入一個Allocator，這個Allocator可以是任何實現類，在GPU上使用的就是BFCAllocator。

 1 Tensor::Tensor(Allocator* a, DataType type, const TensorShape& shape)
 2     : shape_(shape), buf_(nullptr) {
 3   set_dtype(type);
 4   CHECK_NOTNULL(a);
 5   if (shape_.num_elements() > 0 || a->ShouldAllocateEmptyTensors()) {
 6     CASES(type, buf_ = new Buffer<T>(a, shape.num_elements()));
 7   }
 8   if (buf_ != nullptr && buf_->data() != nullptr && LogMemory::IsEnabled()) {
 9     LogMemory::RecordTensorAllocation("Unknown", LogMemory::UNKNOWN_STEP_ID,
10                                       *this);
11   }
12 }

上面代碼的第6行創建了Buffer對象，它就是Tensor對象的實際存儲區，讓我們看看其構造函數的實現內容。

1 emplate <typename T>
2 Buffer<T>::Buffer(Allocator* a, int64 n,
3                   const AllocationAttributes& allocation_attr)
4     : BufferBase(a, a->Allocate<T>(n, allocation_attr)), elem_(n) {}

上面的代碼段重點在於第4行，因為在此處調用了Allocate函數，此時Buffer真正獲得了一片實際的存儲區。這已經能夠說明存儲區分配的時機是在一個Tensor對象被創建時立即發生的。

遇到的問題——顯存分配與回收的性能需求

Tensor在每次創建時會得到存儲區域，而每一輪訓練都要重新創建新的Tensor，那么這里面臨的一個問題：如此頻繁的分配和回收存儲區，如何才能做的高效？試想對於GPU來說，如果Allocate函數直接封裝CUDA中昂貴的cudaMalloc函數，當Tensor被釋放時直接調用cudaFree函數，那么訓練速度將會因為這些overhead大打折扣。

解決問題的基本思路——存儲池

如果你對操作系統這門課比較熟悉，那么應該很容易想到解決辦法：將顯存按照不同的大小一次性開辟出來，並組成存儲池，每次調用Allocate函數時從存儲池中獲取，Tensor回收時將顯存重新掛到存儲池中。這樣做確實可以滿足性能需求，但是需要為此設計一個相對復雜的存儲管理器。BFC Allocator就是TensorFlow中管理GPU顯存的存儲管理器。

好了，需求和背景都已經了解了，接下來可以進入正題了，讓我們先從原理開始說起。

Best-Fit with Coalescing與dlmalloc

BFC的全稱是Best-Fit with Coalescing。從TensorFlow源碼注釋中得知，BFC算法並非TensorFlow完全原創，而是dlmalloc的一個簡單實現版本。dlmalloc是一款優秀的存儲分配器，它以Doug Lea的名字命名，這個站點包含了dlmalloc的詳細說明，有興趣的同學可以去看一看。之所以在TensorFlow中引入一個簡單版本的dlmalloc算法，是因為該算法可以非常高效的按需分配和回收存儲區，並盡可能減少存儲碎片。

BFC Allocator基本原理

核心在於將存儲區划分成塊，並掛入存儲池中進行管理。將存儲區划分成存儲塊時要滿足以下要求。

1. 塊內地址是連續地址

2. 存儲池中的塊要以每個塊基地址升序排列，並組織成雙向鏈表

3. 高地址塊的size大於低地址塊的size

TensorFlow將存儲塊以及相應的塊信息抽象為一種叫做Chunk的數據結構。

核心數據結構

Chunk

Chunk是BFC最核心的數據結構之一，在TensorFlow源碼中是以struct來描述的。具體來說，一個Chunk代表一段連續的存儲空間，BFC要求各個Chunk要按照基地址升序排列並組織成雙向鏈表，下圖展示了Chunk的結構以及Chunk之間的連接關系。初始時，每個Chunk都有自己的size，並且這些size都是以256字節為模。應當注意，每個Chunk或者完全被標記為使用，或者完全標記為空閑，不存在該Chunk內只有部分空間被使用的情況。

prev，next：這兩個變量起到指針作用，分別指向前驅和后繼Chunk。因為在BFC Allocator模塊中多個chunk都被放入了vector中，所以這兩個指針實際上就是前驅和后繼的index

ptr：該Chunk的起始存儲地址，或者叫基地址

size：該Chunk描述存儲區的實際總大小，每個Chunk的size是不同的，但都以256字節為模

requested_size：該Chunk描述存儲區的使用大小，代表了用戶請求使用的大小，它一定小於等於size。因為Chunk不能被部分使用，所以即使用戶實際只使用requested_size，那么也只能將整個大小為size的Chunk全部分配出去，顯然這可能會造成一些碎片的浪費

allocation_id：該值如果不為0，則代表已經被標記為使用，反之則是空閑

bin_num：代表該Chunk所在Bin的Index。Bin是另一個核心數據結構，下面將會做詳細介紹

Bin

如果我們想查詢某一塊符合條件的空閑Chunk並取出，那么只能對雙向鏈表做遍歷，顯然這個效率不是很高。為了加速查詢某塊Chunk的速度，可以在創建Chunk鏈表時按一定順序排列，並將整個有序鏈表在邏輯上切分成多個段，為每個段記錄所包含的Chunk的范圍，這種結構就是Bin，它相當於一種索引。因此，Bin結構是為了方便Chunk的查詢而出現的。在BFC Allocator中，每個段中Chunk的順序是按照size和基地址升序排序的，每個Bin都設有自己的bin_size，該bin_size表示該段包含的最小Chunk的size。這樣一來，用戶端就可以根據所需要申請的Memory大小直接找到對應的Bin，然后在該Bin中遍歷尋找適合的Chunk。為了能夠根據bin_size直接定位到Bin，規定bin_size與bin_num的大小關系為：bin_size=256 * 2^bin_num。用戶在申請Memory時，會將實際大小映射到最適合的bin_size上，然后再根據bin_size與bin_num的關系找到對應的Bin，進而在該段中遍歷搜索。

Bin中Chunk的是通過Set組織的，為了能在Set中體現雙向鏈表的邏輯，只需要讓Chunk在Set中按照規則升序排列，並修正前驅后繼指針即可。指定Chunk順序的Comparator代碼段定義在Bin結構中，如下所示。

 1 // Sort first by size and then use pointer address as a tie breaker.
 2 bool operator()(const ChunkHandle ha,
 3                 const ChunkHandle hb) const NO_THREAD_SAFETY_ANALYSIS {
 4   const Chunk* a = allocator_->ChunkFromHandle(ha);
 5   const Chunk* b = allocator_->ChunkFromHandle(hb);
 6   if (a->size != b->size) {
 7     return a->size < b->size;
 8   }
 9   return a->ptr < b->ptr;
10 }

輔助工具類

AllocationRegion與RegionManager

這兩個類是起到輔助作用。BFC Allocator每次分配存儲區時都以Chunk為單位，指向Chunk的指針又是ChunkHandle類型（實際為數組下標），但分配存儲的最終目的是把Chunk中指向存儲區域的頭指針ptr分配給請求方。另外，當系統回收存儲區時，面對的也是存儲區的頭指針，那么如果不能根據頭指針找到Chunk和Bin信息，回收就不能成功。因此這里顯然應該設計一系列接口和函數：它能夠記錄每次分配的Chunk，並且能夠保存分配存儲區的地址ptr與Chunk之間的映射關系。AllocationRegion和RegionManager就是完成這些功能的接口。

具體而言，AllocationRegion對應一次存儲區分配的記錄。一次存儲區分配的信息包括起始地址ptr和存儲區大小memory_size，這可能包括多個Chunk，所以該結構要記錄此次分配中所包含所有Chunk的信息。RegionManager是AllocationRegion的管理器，它維護了AllocationRegion的數組。在RegionManager中，AllocationRegion數組是需要按照end_ptr地址排序的。

利用RegionManager查詢某個ptr所對應的ChunkHandle的時序圖如下圖所示。

這部分功能較為簡單，所以不再展開代碼邏輯，感興趣的同學可以閱讀這兩個類的定義立即就能理解。

BFC分配與回收策略

介紹完基本結構和BFC的設計思想之后，就可以試着去理解具體的存儲區分配和回收過程了。

Allocate流程

AllocateRawInternal

這是BFCAllocator的為用戶分配Chunk的總體流程。因為物理設備上實際的空閑存儲區已經被事先開辟好，並以Chunk的形式組織成了雙向鏈表，那么BFC Allocator為用戶分配存儲區時直接從Chunk中獲取即可。當雙向鏈表中找不到合適的Chunk時，不得不向物理設備上申請更多存儲空間，並創建新的Chunk放入到雙向鏈表中，並掛入到B相應的Bin中。下面的流程圖展示了這一過程，該過程涉及到了幾個比較重要的子過程。它們分別是遍歷搜索尋找最佳Chunk指針的FIndChunkPtr過程，當Chunk鏈表中不存在合適的Chunk以至於不得不向物理設備申請新存儲空間的Extend過程，以及分配Chunk時為緩解碎片問題而出現的SplitChunk過程。

整體流程的代碼如下所示。

 1 void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,
 2                                         size_t num_bytes,
 3                                         bool dump_log_on_failure,
 4                                         uint64 freed_before) {
 5   if (num_bytes == 0) {
 6     VLOG(2) << "tried to allocate 0 bytes";
 7     return nullptr;
 8   }
 9   // First, always allocate memory of at least kMinAllocationSize
10   // bytes, and always allocate multiples of kMinAllocationSize bytes
11   // so all memory addresses are nicely byte aligned.
12   size_t rounded_bytes = RoundedBytes(num_bytes);
13 
14   // The BFC allocator tries to find the best fit first.
15   BinNum bin_num = BinNumForSize(rounded_bytes);
16 
17   mutex_lock l(lock_);
18   void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
19   if (ptr != nullptr) {
20     return ptr;
21   }
22 
23   // Try to extend
24   if (Extend(unused_alignment, rounded_bytes)) {
25     ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
26     if (ptr != nullptr) {
27       return ptr;
28     }
29   }
30 
31   // We searched all bins for an existing free chunk to use and
32   // couldn't find one.  This means we must have run out of memory,
33   // Dump the memory log for analysis.
34   if (dump_log_on_failure) {
35     LOG(WARNING) << "Allocator (" << Name() << ") ran out of memory trying "
36                  << "to allocate " << strings::HumanReadableNumBytes(num_bytes)
37                  << ".  Current allocation summary follows.";
38     DumpMemoryLog(rounded_bytes);
39     LOG(WARNING) << RenderOccupancy();
40   }
41   return nullptr;
42 }

FindChunkPtr過程

因為Chunk在每個Bin中都是按照size和基地址升序排列，所以搜索Chunk時只需順序遍歷free_chunks即可，首個找到的符合要求的Chunk即為所求。這個過程非常簡單，不再以圖的形式描述，只展示代碼如下。

 1 void* BFCAllocator::FindChunkPtr(BinNum bin_num, size_t rounded_bytes,
 2                                  size_t num_bytes, uint64 freed_before) {
 3   // First identify the first bin that could satisfy rounded_bytes.
 4   for (; bin_num < kNumBins; bin_num++) {
 5     // Start searching from the first bin for the smallest chunk that fits
 6     // rounded_bytes.
 7     Bin* b = BinFromIndex(bin_num);
 8     for (auto citer = b->free_chunks.begin(); citer != b->free_chunks.end();
 9          ++citer) {
10       const BFCAllocator::ChunkHandle h = (*citer);
11       BFCAllocator::Chunk* chunk = ChunkFromHandle(h);
12       DCHECK(!chunk->in_use());
13       if (freed_before > 0 && freed_before < chunk->freed_count) {
14         continue;
15       }
16       if (chunk->size >= rounded_bytes) {
17         // We found an existing chunk that fits us that wasn't in use, so remove
18         // it from the free bin structure prior to using.
19         RemoveFreeChunkIterFromBin(&b->free_chunks, citer);
20 
21         // If we can break the size of the chunk into two reasonably large
22         // pieces, do so.  In any case don't waste more than
23         // kMaxInternalFragmentation bytes on padding this alloc.
24         const int64 kMaxInternalFragmentation = 128 << 20;  // 128mb
25         if (chunk->size >= rounded_bytes * 2 ||
26             static_cast<int64>(chunk->size) - rounded_bytes >=
27                 kMaxInternalFragmentation) {
28           SplitChunk(h, rounded_bytes);
29           chunk = ChunkFromHandle(h);  // Update chunk pointer in case it moved
30         }
31 
32         // The requested size of the returned chunk is what the user
33         // has allocated.
34         chunk->requested_size = num_bytes;
35         // Assign a unique id and increment the id counter, marking the
36         // chunk as being in use.
37         chunk->allocation_id = next_allocation_id_++;
38 
39         // Update stats.
40         ++stats_.num_allocs;
41         stats_.bytes_in_use += chunk->size;
42         stats_.peak_bytes_in_use =
43             std::max(stats_.peak_bytes_in_use, stats_.bytes_in_use);
44         stats_.largest_alloc_size =
45             std::max<std::size_t>(stats_.largest_alloc_size, chunk->size);
46 
47         VLOG(4) << "Returning: " << chunk->ptr;
48         if (VLOG_IS_ON(4)) {
49           LOG(INFO) << "A: " << RenderOccupancy();
50         }
51         return chunk->ptr;
52       }
53     }
54   }
55 
56   return nullptr;
57 }

SplitChunk過程

上圖中沒有展示出SplitChunk發生的位置，其實該過程是在FindChunkPtr中發生。在選取Chunk時，會有一定概率出現請求的size比所選的Chunk總size小很多的情況。因為每塊Chunk只有in use或free兩種狀態，所以如果空閑的size比請求的size大很多，顯然會造成該Chunk的實際使用率過低，這是一種浪費。BFC Allocator通過調用SplitChunk將Chunk分割成兩部分來緩解這一問題。SplitChunk的功能顧名思義，就是將一塊大的Chunk分割成兩個部分。該過程發生在FindChunkPtr中，我們需要注意觸發SplitChunk過程的條件，在代碼中我們能看到這一函數的調用條件如下。

 1 // If we can break the size of the chunk into two reasonably large
 2 // pieces, do so.  In any case don't waste more than
 3 // kMaxInternalFragmentation bytes on padding this alloc.
 4 const int64 kMaxInternalFragmentation = 128 << 20;  // 128mb
 5 if (chunk->size >= rounded_bytes * 2 ||
 6     static_cast<int64>(chunk->size) - rounded_bytes >=
 7         kMaxInternalFragmentation) {
 8   SplitChunk(h, rounded_bytes);
 9   chunk = ChunkFromHandle(h);  // Update chunk pointer in case it moved
10 }

從代碼中可以清晰的看到，當以下兩個條件之一滿足時，SplitChunk過程將被觸發。

1. 當chunk的size是用戶請求的round size兩倍及以上時（用戶請求的size會根據最小分配單元做round近似）

2. 當chunk的size減去用戶請求的round size后依然大於等於最大碎片限定時（128MB）

在執行SplitChunk時，需要調整Chunk的前驅后繼指針，這就是鏈表的基本操作，非常簡單。另外，SplitChunk會產生新的Free Chunk，需要根據它的大小將它插入到對應的Bin中。

Extend過程

上面的流程圖已經展示，只有在雙向鏈表中不能找到合適的Chunk時，Extend過程才會被調用。它的調用說明現有的存儲池中已經沒有可以滿足需求的存儲區了，需要向物理設備申請，並創建新的Chunk，然后放入Bin中。向物理設備申請存儲空間時，如果因為一次申請的空間較大而失敗，會將請求空間做0.9因子的衰退，下面的代碼段展示了這個細節。申請結束后，需要向region_manager中記錄該次申請。

 1 // Try allocating.
 2 size_t bytes = std::min(curr_region_allocation_bytes_, available_bytes);
 3 void* mem_addr = sub_allocator_->Alloc(alignment, bytes);
 4 if (mem_addr == nullptr && !started_backpedal_) {
 5   // Only backpedal once.
 6   started_backpedal_ = true;
 7 
 8   static constexpr float kBackpedalFactor = 0.9;
 9 
10   // Try allocating less memory.
11   while (mem_addr == nullptr) {
12     bytes = RoundedBytes(bytes * kBackpedalFactor);
13     if (bytes < rounded_bytes) break;
14     mem_addr = sub_allocator_->Alloc(alignment, bytes);
15   }
16 }

Deallocate流程

因為在回收時只知道存儲空間首地址指針，並不知道其對應的Chunk，所以需要先借助region_manager等輔助工具獲取其所對應的Chunk指針，然后考慮其前驅后繼節點是否可以合並。下面展示了整體流程。因為Merge的過程即使鏈表合並的過程，比較簡單，所以在此不再贅述。

這部分對應的代碼邏輯如下圖所示。

 1 void BFCAllocator::FreeAndMaybeCoalesce(BFCAllocator::ChunkHandle h) {
 2   Chunk* c = ChunkFromHandle(h);
 3   CHECK(c->in_use() && (c->bin_num == kInvalidBinNum));
 4 
 5   // Mark the chunk as no longer in use.
 6   c->allocation_id = -1;
 7 
 8   // Optionally record the free time.
 9   if (timing_counter_) {
10     c->freed_count = timing_counter_->next();
11   }
12 
13   // Updates the stats.
14   stats_.bytes_in_use -= c->size;
15 
16   ChunkHandle coalesced_chunk = h;
17 
18   // If the next chunk is free, merge it into c and delete it.
19   if (c->next != kInvalidChunkHandle && !ChunkFromHandle(c->next)->in_use()) {
20     // VLOG(8) << "Merging c->next " << ChunkFromHandle(c->next)->ptr
21     //         << " with c " << c->ptr;
22     RemoveFreeChunkFromBin(c->next);
23     Merge(h, c->next);
24   }
25 
26   // If the previous chunk is free, merge c into it and delete c.
27   if (c->prev != kInvalidChunkHandle && !ChunkFromHandle(c->prev)->in_use()) {
28     // VLOG(8) << "Merging c " << c->ptr << " into c->prev "
29     //         << ChunkFromHandle(c->prev)->ptr;
30 
31     coalesced_chunk = c->prev;
32     RemoveFreeChunkFromBin(c->prev);
33     Merge(c->prev, h);
34   }
35 
36   InsertFreeChunkIntoBin(coalesced_chunk);
37 }

Allow Growth

這是控制Allocator的一個選項，默認是False，此時會在設備上開辟最大限度的存儲空間，並且全局只開辟一次。因為已經開辟了設備上的全部存儲空間，所以若在雙向鏈表中找不到合適的Chunk，那么將會直接報錯OOM退出。當選項為True時，會經歷多次存儲空間的開辟，這完全取決於當前存儲池中是否還有符合需求大小的Chunk。如果沒有，則不斷以2的n次方為基本大小進行開辟嘗試，直到滿足需求為止。那么這個值有什么用處呢？這取決於同一個Device是否允許被多個程序復用。比如在雲基礎設施上，如果能夠開啟Device復用，並打開Device的空分復用功能，那么將會大大提高集群資源的利用率。

總結

本文總結了TensorFlow中存儲管理器——BFC Allocator。它的設計思路來自於經典來的dlmalloc分配算法，是Best fit coalecing的簡單實現版本。BFC Allocator是為了應對TensorFlow中頻繁分配釋放存儲空間需求的場景而出現的解決方案，通過事先將存儲空間從物理設備上開辟好，並將這些空閑存儲空間封裝成Chunk，組織成有序雙向鏈表，然后利用Bin這一種索引結構為Chunk的查詢做加速，最終完成了高效的分配算法。在實際分配時，可能會遇到Chunk鏈表中不存在符合要求的空閑Chunk情況，這時候就可能需要向物理設備中再次開辟新的存儲空間，這個過程被視為對Chunk鏈表的擴展，對應的過程是Extend。因為是按Chunk進行分配，勢必可能造成存儲碎片，為了解決碎片問題，BFC Allocator設計了SplitChunk和Merge函數。BFC Allocator是TensorFlow代碼中比較精簡的一個部分，該部分的代碼難度較低，並且模塊獨立性較強，涉及到的代碼量非常小，但是設計思想和功能卻非常全面，非常適合初學者閱讀和學習。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 tensorflow顯存管理 VS中的配置管理器 Vue中的狀態管理器 - Vuex Python中的上下文管理器 Jmeter 中Cookie管理器的使用服務器中打開IIS管理器（原）tensorflow中函數執行完畢，顯存不自動釋放 Mac OS中的”任務管理器“ 詳解 Python 中的 with 與上下文管理器 JAVA中GridBagLayout布局管理器應用詳解