arm-cache coherency

本文轉載自查看原文 2018-02-13 14:24 4499 SoC/ architecture

提高一個系統的performance，有兩種辦法：

　　1) 不斷提高一個core的performance，手段就是不斷提高freq，減小Vt，這樣都會在增加power(dynamic，leakage)

　　2) 增加processor的個數

arm的big-little processor cluster采用的就是第二個辦法，通過Power Gating和DVFS也盡量的減小了功耗。

但是multiprocessor的另一個問題就是cache coherence的問題。

針對cluster內部，arm采用MPCore multi-core coherency technology，

　　1) 實現了一個基於MESI的cache coherency protocol，並且，增加了一些feature，

　　　　direct cache-to-cache copy of clean data，direct cache-to-cache move of dirty data in cluster。

　　　　不需要寫會main memory，

　　2) 還包括一個模塊SCU(Snoop Control Unit)，保存所有的L1 data cache tag，作為一個directory，來減少broad-cast的總線帶寬浪費，

　　3) MPCore technology，支持一個可選的ACP(Accelerator Coherency port)，accelerator可以讀寫processor cluster內部的cache，

　　　　但是processor 不能拿到accelerator的cache，也無法保證和其cache的一致性。

針對多個cluster之間，可以通過AMBA4的ACE protocal(AXI Coherency Extensions)來實現。

　　1) ACE和ACE-lite，引入了system-level coherency，cache maintenance，DVM(distributed virtual memory)，barrier transaction support。

　　2) ACE本身是支持5狀態的MOESI cache coherency model的，master可以支持MESI，MOESI，MEI等，都兼容，

　　3) ACE需要與指定的system interconnect一起使用，來處理所有shared transaction，

　　　　interconnect在拿到master發送的transaction時，可能會speculative reads，或者等待snoop resp，

　　　　interconnect可能包含一個directory，snoop filter，或者broadcast snoop到所有的master，

　　4) ACE支持的system level coherency，是指所有的master，包括GPU，DMA，dissimilar CPU。

AMBA的發展路線圖：

　　1) AXI4支持了long burst，不在支持write interleaving；

　　2) AXI4-stream，是專為大批量數據傳播的應用，是一個point-to-point 協議，沒有了address channel，只有data channel；

　　3) AXI4-Lite，是一個簡化版的AXI4，主要用在需要apb的peripheral中，做升級。

Software-Based Coherency Approach：

cache的coherency也可以在software中解決，在之前的single processor，just small L1 cache中；

　　但是，目前的SoC中，都是multiprocessor，並且有L2，L3等cache，還有其他的cache master，GPU等。

　　　　用軟件實現的可能性已經很小，難度太大，性能也很低，

Hardware-Based Coherency Approaches：

　　1) Snooping cache coherency protocols，所有的master都“listening in”所有的shared-data transactions，

　　　　　read，操作，addr輸入，所有的processor檢查自己的cache中，是否有該addr，有的話，直接返回ack，不再訪問memory；

　　　　　write，操作，addr輸入，所有的processor檢查自己的cache中，是否有該addr copy，需要invalid掉。

　　　　這種方式的coherency tranffic是比較大的，N(N-1)，因為需要broadcast到所有的processor中，在processor越來越大時，

　　　　　　效率會越來越小。

　　2) Directory-based cache coherency protocols：系統中，有一個single directory，保存系統中cache line的list，

　　　　這樣，master發出一個transaction，首先查找該directory，然后directed cache coherency traffic到某些master中，減小coherency traffic。

　　　　最好的情況下，traffic是2N，最差的情況下是N*(N-1+1)，因為還需要首先檢查directory。

　　　　這種方式，需要一塊很大的on-chip RAM，如果放在off-chip，又增加了系統的latency。

　　實際應用中，可以做些優化，比如Snoop based system，可以加些snoop filters，來減小coherency traffic。

　　ACE對於snoop和directory-based的方式，甚至其他的hybrid類型的protocol都是支持的，

ACE在AXI的基礎上，增加了三個channels，來發送和接收coherency transaction，

　　在現有的channel中，增加了新的信號：

　　　　ARSNOOP和AWSNOOP，表示對shareable transactions的snoop transactions；

　　　　ARBAR和AWBAR，用來表示barrier signal；

ACE-Lite，在AXI的基礎上增加了新的signals，卻沒有增加新的channels，

　　ACE-Lite master主要用來snoop其他的ACE-compliment master，

　　但是themselves並不能被snooped。

以CCI400的interconnect，為例，支持兩個clusters CPU，三個ACE-lite I/O coherent master，

ACE引入了很多new transactions，一般可以根據memory attribute，進行分組。

ACE-Lite I/O Coherency，ACE-Lite Master可以實現，Non-shared，Non-cached，Cache Maintenance transaction，三種group

　　　　的transaction，實現了uncached masters來snoop ACE coherent master，

　　　　比如Gigabit Ethernet 直接讀寫cached data shared with CPU。

DVM(Distributed Virtual Memory)，用來保證MMU內部TLB的一致性，支持TLB Invalidation，Brach Predictor，Instruction cache Invalidation。

cache coherence基礎：

　　cache coherence設計的主要目的是，在multicore的系統中，多個caches的表現與sing-core system相同。

　　cache coherence的define，可以描述為，多個memory copy，允許single-writer-multiple-reader(SWMR)，在某個

　　　　logic time中，只存在最多一個core寫A，或者多個cores read A。

　　coherence的granularities，一般是安裝cache line的大小來定義。

　　必須在寫操作，之后禁止對同一地址的讀操作，直到所有的cache都發反饋信號(ack)，表示該cache已經invalid或者update。

在memory system中，cache controller負責issue coherence req和received coherence rsp，

　　　　　　　　　　memory controller負責，received coherence req和issue coherence rsp，

　　兩者之間通過interconnect來連接。

coherence protocol有兩種，snooping和directory，transactions/action不同，但是stable state是相同的。

　　1) stable states，很多的coherence protocol都是MOESI model的子集，

　　　　M(Modified)，表明一個cache line是valid，exclusive，owned，可能還是dirty的。

　　　　S(Shared)，表明一個cache line是valid，但是不是exclusive，不是dirty，不是owned的。

　　　　I(Invalid)，表明一個cache line是invalid，或者說是不可讀寫的，

　　　　MSI是最基本的protocol status，還有兩個可擴展的status，O和E，

　　　　O(Owned)，表明cache line是valid，owned，但是不是exclusive，而且可能是dirty的，在main memory中的data很可能是stale的。

　　　　E(Exclusive)，表明cache line是valid，exclusive，並且是clean的。

　　2) 由cache controller發出的common transaction：

　　3) common core對cache controller的req：

　　4) snooping protocol，broadcasting a req message到所有的coherence controller，這些req到每個core的order是可以不不定的。

　　　　　　　　　　　　　　　　看具體interconnect的實現。

　　　directory protocol，unicast該req到具體的cache controller或者memory controller。

　　　　snooping結構簡單，但是不易scale to large numbers of core，

　　　　directory，可以scale到large num of core，但是增加了每筆coherence req的lantency。

　　5) 當一個core write cache line時，該coherence protocol作何動作，可以分為invalidation/update兩種，與

　　　　snooping和directory無關。

　　　　invalidation，當一個core發出write cache line的操作時，其他cache copy都被更新為invalid。

　　　　update，當一個core發出write cache line的操作時，其他cache copy都被update為最新的值。

　　　　　　實際中update用的很少，因為update的操作，相對還是比較占用bus的bandwidth，而且這種方式

　　　　　　會將memory consistency model復雜化，因為原子操作中，如果出現多個cache更新該cache中的數據，

　　　　　　情況會很復雜。

Cache和MMU之間的結構：

　　按照工作原理來分，cache有physical index physical tagged, virtual index virtual tagged, physical index virtual tagged等幾種工作方式。

　　1) physical index physical tagged，cache僅僅針對物理地址進行操作，簡單粗暴，而且不會有歧義。

　　　　缺陷：在多進程操作系統中，每個進程指令和代碼都是以虛擬地址的方式存在，cpu發出的memory access的指令都是以虛擬地址的方式發出，

　　　　　　　所以對於每一個memory access的操作，都要先等待MMU將虛擬地址翻譯為物理地址，這樣還是增加了操作的latency。

　　2) virtual index virtual tagged是純粹用虛擬地址來尋址，由於多個virtual address可以對應一個physical address，每一行數據在原有tag的基礎上

　　　　都要將進程標識加上以區分多個進程之間的相同地址，而在處理共享內存時，共享內存在不同的進程中的虛擬地址不相同，如何同步是個問題。

　　　　　　結構太復雜

　　3) virtual index physical tagged方式現在使用的比較多，virtual index的含義是當cpu發出一個地址請求之后，低位地址去和cache中的index匹配

　　　　(低位一般都是頁內偏移地址，virtual address與physical address低位部分相同)，

　　　　physical tagged是指虛擬地址的高位地址去和mmu中的頁表匹配以拿到頁的物理地址，

　　　　這樣virtual index的匹配操作和smmu的轉換操作可以並行工作。

ARM MPCore的cache結構，L1 Cache一般放在processor里邊，可以分為L1 data cache，L1 instruction cache。(8KByte----64KByte)

　　L1 instruction cache，不但能做instuction caching，還可以做Dynamic branch prediction，

　　　　一些使用PC作為目的寄存器的操作，BXJ指令，Return from Exception的指令，不會做prediction；

　　　　多是2-way set associative結構，64byte cache line。

　　L1 data cache，是一塊physically indexed physically tagged cache，

　　　　內部包括一個internal exclusive monitor，用來存放當前有效的exclusive訪問的列表，可以直接返回EXOkay，

　　　　可以產生ACE transaction和CHI transaction，

　　　　多是4-way set associative結構，64byte cache line。

　　L2 cache包括一個集成好的SCU(連接到一個cluster內的4個cores)，一個L2 Cache，(128KByte------2MB)

　　　　SCU中包含L1 Data cache tags來做4個core之間的coherency，

　　　　L2 cache中不支持snoop hardware操作，來保證cache之間的coherency，可以配置選擇ACE或者CHI連接到main memory

　　　　Physically index, Physically tagged cache，8ways-----16ways。

　　　　SCU支持direct cache-to-cache transfer，dirty cache lines to be moved between cores，內建tags filter，來發送指定的

　　　　　　coherent requests。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Arm Cache學習總結 ARM的CACHE原理(轉) arm cache line,PLD指令從ARM VIVT看linux的cache 處理【原創】ARM平台內存和cache對xenomai實時性的影響 [mmu/cache]-ARM MMU的學習筆記-一篇就夠了【轉】 ARM Linux 內核 panic 之cache 一致性 ——cci-400 cache一致互聯 Cache Linux Kernel之flush_cache_all在ARM平台下是如何實現的【轉】 page cache和buffer cache