工具/插件 -- CACTI:一種Cache/Memory分析工具
@(工具/插件)
最近發現了一種可以評估DRAM訪存功耗的工具,對於需要分析片外存儲(DRAM)的訪存功耗以及延時的設計比較有用,例如:深度學習加速器設計。
1. 簡介
CACTI是一種分析工具,它接受一組 Caches/Memory參數作為輸入,並計算其訪存時間、功耗、周期時間和面積。目前更新到7.0版本,並且支持下面幾種Memory的分析:
- direct mapped caches
- set-associative caches
- fully associative caches
- Embedded DRAM memories
- Commodity DRAM memories
此外,還有以下功能:
-
支持multi-ported uniform cache access (UCA)和multi-banked, multi-ported non-uniform cache access (NUCA).
-
泄漏功耗的計算也考慮到了環境溫度。
-
Router power model.
-
Interconnect model with different delay, power, and area properties including low-swing wire model.
-
An interface to perform trade-off analysis involving power, delay,area, and bandwidth.
-
All process specific values used by the tool are obtained from ITRS and currently, the tool supports 90nm, 65nm, 45nm, and 32nm technology nodes.
-
Chip IO model to calculate latency and energy for DDR bus. Users can model different loads (fan-outs) and evaluate the impact on frequency and energy. This model can be used to study LR-DIMMs, R-DIMMs, etc.
2. 使用
源碼地址:https://github.com/HewlettPackard/cacti
技術文檔: http://www.hpl.hp.com/techreports/2013/HPL-2013-79.pdf
在Windows上沒調起來(windows上c++庫缺少pthread,沒找到比較簡單的方法),后面直接在Centos上測試,下面是簡單的使用方法:
- 從源碼地址下載c++源碼,放到centos系統下。
- 進入源碼文件夾,直接在命令行里
make
- 生成名為
cacti
的可執行文件后,執行
./cacti -infile ***.cfg
其中.cfg文件是配置memory屬性的文件,需要根據所使用的DRAM屬性進行更改,這里我直接拿了他sample里的一個配置文件運行了:./cacti -infile sample_config_files/ddr3_cache.cfg
最后會得到一個詳細的分析文檔,這邊貼一下:
Cache size : 8388608
Block size : 64
Associativity : 8
Read only ports : 0
Write only ports : 0
Read write ports : 1
Single ended read ports : 0
Cache banks (UCA) : 1
Technology : 0.022
Temperature : 360
Tag size : 42
array type : Cache
Model as memory : 0
Model as 3D memory : 0
Access mode : 0
Data array cell type : 0
Data array peripheral type : 0
Tag array cell type : 0
Tag array peripheral type : 0
Optimization target : 2
Design objective (UCA wt) : 0 0 0 100 0
Design objective (UCA dev) : 20 100000 100000 100000 100000
Cache model : 0
Nuca bank : 0
Wire inside mat : 1
Wire outside mat : 1
Interconnect projection : 1
Wire signaling : 1
Print level : 1
ECC overhead : 1
Page size : 8192
Burst length : 8
Internal prefetch width : 8
Force cache config : 0
Subarray Driver direction : 1
iostate : READ
dram_ecc : NO_ECC
io_type : DDR3
dram_dimm : UDIMM
IO Area (sq.mm) = inf
IO Timing Margin (ps) = 35.8333
IO Votlage Margin (V) = 0.155
IO Dynamic Power (mW) = 1282.42 PHY Power (mW) = 232.752 PHY Wakeup Time (us) = 27.503
IO Termination and Bias Power (mW) = 3136.7
---------- CACTI (version 7.0.3DD Prerelease of Aug, 2012), Uniform Cache Access SRAM Model ----------
Cache Parameters:
Total cache size (bytes): 8388608
Number of banks: 1
Associativity: 8
Block size (bytes): 64
Read/write Ports: 1
Read ports: 0
Write ports: 0
Technology size (nm): 22
Access time (ns): 3.03414
Cycle time (ns): 1.84197
Total dynamic read energy per access (nJ): 0.381869
Total dynamic write energy per access (nJ): 0.446873
Total leakage power of a bank (mW): 2520.29
Total gate leakage power of a bank (mW): 4.71441
Cache height x width (mm): 3.07383 x 2.89775
Best Ndwl : 8
Best Ndbl : 8
Best Nspd : 2
Best Ndcm : 1
Best Ndsam L1 : 8
Best Ndsam L2 : 1
Best Ntwl : 16
Best Ntbl : 8
Best Ntspd : 8
Best Ntcm : 1
Best Ntsam L1 : 8
Best Ntsam L2 : 2
Data array, H-tree wire type: Global wires with 30% delay penalty
Tag array, H-tree wire type: Global wires with 30% delay penalty
Time Components:
Data side (with Output driver) (ns): 3.03414
H-tree input delay (ns): 0.860695
Decoder + wordline delay (ns): 0.607741
Bitline delay (ns): 0.473783
Sense Amplifier delay (ns): 0.00189739
H-tree output delay (ns): 1.09002
Tag side (with Output driver) (ns): 0.866708
H-tree input delay (ns): 0.250295
Decoder + wordline delay (ns): 0.0962495
Bitline delay (ns): 0.078
Sense Amplifier delay (ns): 0.00189739
Comparator delay (ns): 0.0162774
H-tree output delay (ns): 0.440265
Power Components:
Data array: Total dynamic read energy/access (nJ): 0.360657
Total energy in H-tree (that includes both address and data transfer) (nJ): 0.270396
Output Htree inside bank Energy (nJ): 0.263979
Decoder (nJ): 0.000237668
Wordline (nJ): 0.000275334
Bitline mux & associated drivers (nJ): 0
Sense amp mux & associated drivers (nJ): 0
Bitlines precharge and equalization circuit (nJ): 0.00163006
Bitlines (nJ): 0.0612354
Sense amplifier energy (nJ): 0.0018371
Sub-array output driver (nJ): 0.0249178
Total leakage power of a bank (mW): 2357.99
Total leakage power in H-tree (that includes both address and data network) ((mW)): 18.9776
Total leakage power in cells (mW): 0
Total leakage power in row logic(mW): 0
Total leakage power in column logic(mW): 0
Total gate leakage power in H-tree (that includes both address and data network) ((mW)): 0.0916133
Tag array: Total dynamic read energy/access (nJ): 0.0212128
Total leakage read/write power of a bank (mW): 162.298
Total energy in H-tree (that includes both address and data transfer) (nJ): 0.00268136
Output Htree inside a bank Energy (nJ): 0.00104879
Decoder (nJ): 0.000585105
Wordline (nJ): 0.000356972
Bitline mux & associated drivers (nJ): 0
Sense amp mux & associated drivers (nJ): 0.000288214
Bitlines precharge and equalization circuit (nJ): 0.00153419
Bitlines (nJ): 0.0132631
Sense amplifier energy (nJ): 0.00155643
Sub-array output driver (nJ): 8.13397e-05
Total leakage power of a bank (mW): 162.298
Total leakage power in H-tree (that includes both address and data network) ((mW)): 0.23223
Total leakage power in cells (mW): 0
Total leakage power in row logic(mW): 0
Total leakage power in column logic(mW): 0
Total gate leakage power in H-tree (that includes both address and data network) ((mW)): 0.00146699
Area Components:
Data array: Area (mm2): 7.28836
Height (mm): 3.07383
Width (mm): 2.3711
Area efficiency (Memory cell area/Total area) - 73.1983 %
MAT Height (mm): 0.716448
MAT Length (mm): 0.540768
Subarray Height (mm): 0.328909
Subarray Length (mm): 0.26532
Tag array: Area (mm2): 0.377107
Height (mm): 0.716051
Width (mm): 0.526648
Area efficiency (Memory cell area/Total area) - 74.9106 %
MAT Height (mm): 0.173381
MAT Length (mm): 0.063873
Subarray Height (mm): 0.0822272
Subarray Length (mm): 0.027995
Wire Properties:
Delay Optimal
Repeater size - 42.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.216837 (ns/mm)
PowerD - 0.000279845 (nJ/mm)
PowerL - 0.0215298 (mW/mm)
PowerLgate - 9.15623e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
5% Overhead
Repeater size - 17.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.226875 (ns/mm)
PowerD - 0.0001818 (nJ/mm)
PowerL - 0.00872349 (mW/mm)
PowerLgate - 3.70994e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
10% Overhead
Repeater size - 15.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.235988 (ns/mm)
PowerD - 0.000174237 (nJ/mm)
PowerL - 0.00769899 (mW/mm)
PowerLgate - 3.27424e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
20% Overhead
Repeater size - 12.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.257722 (ns/mm)
PowerD - 0.00016297 (nJ/mm)
PowerL - 0.00616223 (mW/mm)
PowerLgate - 2.62069e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
30% Overhead
Repeater size - 10.0297
Repeater spacing - 0.0329013 (mm)
Delay - 0.28134 (ns/mm)
PowerD - 0.000155511 (nJ/mm)
PowerL - 0.00513773 (mW/mm)
PowerLgate - 2.18498e-05 (mW/mm)
Wire width - 0.022 microns
Wire spacing - 0.022 microns
Low-swing wire (1 mm) - Note: Unlike repeated wires,
delay and power values of low-swing wires do not
have a linear relationship with length.
delay - 0.0902442 (ns)
powerD - 2.8399e-06 (nJ)
PowerL - 1.71796e-07 (mW)
PowerLgate - 1.29017e-09 (mW)
Wire width - 4.4e-08 microns
Wire spacing - 4.4e-08 microns
Segmentation fault
其中
Cache Parameters:
Total dynamic read energy per access (nJ): 0.381869
Total dynamic write energy per access (nJ): 0.446873
給出了單次的讀寫功耗。
具體的配置文件相關條目的說明可以翻閱上面提到的技術文檔,后面有時間再研究一下。