作者注:
限於能力和時間,文中定有不少錯誤,歡迎指出,郵箱yixiangrong@hotmail.com, 期待討論。由於絕大部分是原創,即使拷貝也指明了出處(如有遺漏請指出),所以轉載請表明出處http://www.cnblogs.com/e-shannon/
http://www.cnblogs.com/e-shannon/p/7495618.html
相關資料:http://bbs.eetop.cn/thread-636542-1-1.html
目 錄
1 前言... 2
1.1 目的... 2
1.2 參考資料... 2
1.3 專業術語Glossary. 2
2 CAPI overview.. 4
2.1 背景... 4
2.1.1 行業背景... 4
2.1.2 技術背景以及開放式總線接口... 5
2.2 Cache. 5
2.2.1 淺析Cache. 5
2.2.2 Cache訪問方式... 6
2.2.3 緩存映射方式和cache line. 7
2.3 Cache Coherency. 11
2.4 Power CPU的cache coherency系統... 12
3 CAPI 詳細結構和流程... 14
3.1 CAPI 硬件結構... 14
3.1.1 CAPP. 14
3.1.2 PSL. 14
3.1.3 AFU.. 16
3.2 PSL 加速接口... 16
3.3 CAPI工作機制... 17
3.3.1 CAPI的流程... 17
3.3.2 CAPI 應用程序流程... 18
3.4 CAPI仿真平台搭建... 22
3.4.1 仿真的原理和模型... 22
3.4.2 仿真步驟... 23
3.5 CAPI 優勢... 23
3.5.1 相比於PCIE IO 加速的優勢... 23
3.5.2 相比於CPU+GPU優勢... 25
3.5.3 劣勢... 26
4 開放的coherent 加速接口... 26
4.1 OpenCAPI 28
4.1.1 DL. 29
4.1.2 TL(待續)... 31
4.2 OpenCAPI和CAPI的比較... 31
4.3 自問自答... 33
4.4 延伸閱讀(可刪)... 34
1
1 前言
1.1 目的
初步研究CAPI的加速原理,理解cache 一致性,對比CAPI和一般PCIE加速設備的優勢和劣勢。部分總結CAPI 1.0的使用,並簡單列舉CAPI現狀,網站以及2.0的對比。簡單介紹現今三個新的開放的CPU高速一致性接口(CCIX,Gen-Z,OpenCAPI)
CAPI的原理含CAPI2.0的總線接口,流程以及仿真步驟(可以指出歷史和自己的彎路)
為了滿足加速accelerators,業界正在為CPU高性能一致性接口(high performance coherence interface)定義開放的標准,2016年出現了openCAPI/Gen-Z/CCIX 三種open標准,本文也會略微提及
說是初步研究,是因為缺少CAPI的軟件分析,比如具體如何減少了I/O overhead,相對於IO加速的優勢沒有性能對比。尤其是cache coherent帶來的優勢沒有自己的具體指標,雖然引用了Power自己的數據。
CAPI全稱coherent acceleration processor interface(一致性加速處理器接口),作為 Power 處理器架構的一個重要加速功能,提供用戶一個可訂制、高效易用、分擔CPU負荷的硬件加速的解決方案,其實現載體是FPGA。Power8的時候,CAPI 的PSL(加密的IP核)是在ALTERA的FPGA上實現,自從ALTERA為intel收購后,改為Xilinx上的IP核,PSL的資源占用情況需要自行查詢,本人手上有的資料是CAPI1.0在Altera的資源使用情況。由於CAPI2.0和1.0基礎原理一致,加之自己主要接觸到1.0,所以本文CAPI如無特殊說明,均是1.0[dream1] 。
限於精力和資源,也沒有深入研究OpenCAPI
限於能力和時間,文中定有不少錯誤,歡迎指出,郵箱yixiangrong@hotmail.com, 期待討論。由於絕大部分是原創,即使拷貝也指明了出處,所以轉載請表明出處
http://www.cnblogs.com/e-shannon/p/7495618.html
1.2 參考資料
1) <OpenPOWER_CAPI_Education_Intro_Latest.ppt>
2) <CCIX,Gen-Z,penCAPI_Overview&Comparison.pdf>
3) <OpenPOWER and the Roadmap Ahead.pdf>
4) 網址來源
https://openpowerfoundation.org/?resource_lib=psl-afu-interface-capi-2-0
http://www.csdn.net/article/2015-06-17/2824990
http://www.openhw.org/module/forum/thread-597651-1-1.html
www-304.ibm.com/webapp/set2/sas/f/capi/CAPI_POWER8.pdf
5) <POWER9-VUG..pdf>
https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/61ad9cf2-c6a3-4d2c-b779-61ff0266d32a/page/1cb956e8-4160-4bea-a956-e51490c2b920/attachment/56cea2a9-a574-4fbb-8b2c-675432367250/media/POWER9-VUG.pdf
1.3 專業術語Glossary
CAPI : Coherent Accelerator Processor Interface
POWER: Performance Optimization With Enhanced RISC
HDK: Hardware development kit
SDK: Software development kit
CCIX: Cache Coherent Internconnect for Accelerators. www.ccixconsortium.com
OpenCAPI: Open Coherent Accelerator Processor Interfae opencapi.org
Gen-Z: genzconsortium.org
LRU: least recent used
HPC: High Performace Computing
DMI: Durable Memory interface (OpenPOWER and the Roadmap Ahead.pdf)
QPI: The Intel QuickPath Interconnect (QPI) is a point-to-point processor interconnect developed by Intel which replaced the front-side bus (FSB) in Xeon, Itanium, and certain desktop platforms starting in 2008.(wiki),與AMD的HyperTransport(HT)競爭
https://jingyan.baidu.com/article/6525d4b11f2c2bac7d2e943e.html
SMP: Symmetric Multi-Processor,一種UMA結構,多核CPU共享所有資源,SMP在POWER架構中采用[dream2]
NUMA: Non-Uniform Memory Access與SMP結構對比,多CPU分成幾組,本地的內存訪問速度快於遠端的內存訪問,所以是Non-Uniform. The trend in hardware has been towards more than one system bus, each serving a small set of processors. Each group of processors has its own memory and possibly its own I/O channels. However, each CPU can access memory associated with the other groups in a coherent way. Each group is called a NUMA node. The number of CPUs within a NUMA node depends on the hardware vendor. It is faster to access local memory than the memory associated with other NUMA nodes. This is the reason for the name, non-uniform memory access architecture.
http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html
https://technet.microsoft.com/en-us/library/ms178144(v=sql.105).aspx
MPP: Massive Parallel Processing多組SMP CPU組,組和組之間內存不能訪問,通過網絡節點互聯,可以無限擴展[dream3]
NUMA與MPP的區別
http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html
從架構來看,NUMA與MPP具有許多相似之處:它們都由多個節點組成,每個節點都具有自己的CPU、內存、I/O,節點之間都可以通過節點互聯機制進行信息交互。那么它們的區別在哪里?通過分析下面NUMA和MPP服務器的內部架構和工作原理不難發現其差異所在。
首先是節點互聯機制不同,NUMA的節點互聯機制是在同一個物理服務器內部實現的,當某個CPU需要進行遠地內存訪問時,它必須等待,這也是NUMA服務器無法實現CPU增加時性能線性擴展的主要原因。而MPP的節點互聯機制是在不同的SMP服務器外部通過I/O 實現的,每個節點只訪問本地內存和存儲,節點之間的信息交互與節點本身的處理是並行進行的。因此MPP在增加節點時性能基本上可以實現線性擴展。
其次是內存訪問機制不同。在NUMA服務器內部,任何一個CPU可以訪問整個系統的內存,但遠地訪問的性能遠遠低於本地內存訪問,因此在開發應用程序時應該盡量避免遠地內存訪問。在MPP服務器中,每個節點只訪問本地內存,不存在遠地內存訪問的問題。
ISA: instruction set architechture
CAIA : Coherent Accelerator Interface Architecture defines a coherent accelerator interface structure for coherently attaching accelerators to the POWER systems using a standard PCIe bus. The intent is to allow implementation of a wide range of accelerator in order to optimally address many different market segments.
CAPP : Coherent Accelerator Processor Proxy
Design unit that snoops the PowerBus commands and provides coherency responses reflecting the state of the caches in PSL. Issues commands to PSL so that it can provide data responses.
PSL : Power Service Layer
The PSL provides the address translation and system memory cache for the AFUs. In addition, the PSL provides miscellaneous facilities for the host processor to manage the virtualization of the AFUs, interrupts, and memory management.
AFU : Accelerator Function Unit
Effective Address(EA)/Real Address(RA)….power ISA book III
AFU使用有效地址即CPU的地址空間(業界也稱為虛擬地址),PSL則將有效地址翻譯為實際地址(業界也稱為物理地址)The AFU uses Effective Addressing, which is the process’s address space (industry calls this “virtual”). The PSL translates the Effective Address into a Real Address (industry calls this “physical”) for accessing memory within the PowerPC system.
MMIO: Memory-mapped input/output.
WED: work element discriptor工作單元描述符。當應用程序申請使用AFU時,一個處理單元被加入到處理單元鏈上,這個處理單元鏈描述了整個應用的處理狀態。處理單元同時含有一個WED,工作單元描述符,這個WED可以是描述job也可以是一個指針,指向更豐富的描述,來告知AFU的工作內容。When an application requests use of an AFU, a process element is added to the process-element linked list that describes the application’s process state. The process element also contains a work element descriptor (WED) provided by the application. The WED can contain the full description of the job to be performed or a pointer to other main memory structures in the application’s memory space. Several programming models are described providing for an AFU to be used by any application or for an AFU to be dedicated to a single application.
[dream1]其他CAPI2.0特點見后(比如支持PCIE gen4,達到16GT/s /per lane,Power9支持)
[dream2]http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html指出SMP的缺點是共享內存,如果增加CPU,那么內存訪問沖突大幅增加,造成CPU資源浪費,性能下降。所以2-4是合理的
問題是 POWER9是SMP結構嗎?它有8核,怎么提高效率的呢?知乎上又說SMP擴展性好,這是怎么回事?
[dream3]是否現在的超算,銀河就是MPP架構?
