Mali GPU

本文轉載自查看原文 2020-01-06 14:56 1178

vs里先做了face frustrum sample三種test 剩下的沒被剔掉的才做了varying的計算

然后genereate plygon list 寫入system memory

transaction elimination是mali的一個gpu優化技術

TE 為了省帶寬做的

以tile為單位做比較沒更新內容的tile就不傳了用舊的

https://developer.arm.com/architectures/media-architectures/transaction-elimination

幀與幀之間的比較全格式支持

16x16pixels 的tile

mali

Bifrost G系列

Midgard T系列低配在這里從這里看 opengl3.0支持

Utgard 這個沒有手機用不用看了耶

耶 ogl2扔掉

Index-Driven Vertex Shading (IDVS)

把position和其它attitude分開放

https://developer.arm.com/solutions/graphics/developer-guides/the-bifrost-shader-core/index-driven-geometry-pipeline

https://community.arm.com/developer/tools-software/graphics/b/blog/posts/eats-shoots-and-interleaves

這個功能有些雞肋，mali上多半bifrost性能夠了 midgrard需要優化可idvs低端機又不支持。。。

高端機做的feature更高，低端機還是得閹。。噢一套數據就可以了運行時分平台bind pack吧。。

mali的shadercore是unified的

一種core 可以做三種計算 vertex shader，fragment shader，compute kernel

MIDGARD

這個是弱一些的

MaliT系列

BIFROST

是Mali高端的G系列----搖搖欲墜的天國之路

這兩者在GPU block model部分是一樣的（對shader core的調度）只是Bifrost最高32core並行 Midgard最高16core

vertex queue-----vertex/tile/compute tile是fixed管線

fragment queue--fragment

一個rendertarget的task進一個queue

一個queue的工作可以給多個shader core並行

多個queue的工作可以在一個shader core上並行

不同rt的vertex queue和fragment queue也可以並行

一個rt的vq和fq之間必然是線性的了 fq對vq有依賴

L2cache減小的帶寬在於重復數據的fetch 只拿一次省的是這部分

L2cache的大小 32-64KB / shader core

L2cache到總線（system memory）的port的數量和帶寬可配置數據在32bit pixel/core/clock

8-core design to have a total of 256-bits of memory bandwidth (for both read and write) per clock cycle

bifrost 這里最高做到12core以上的帶寬量這里高於midgard

midgard shadercore

tripipe是可編程部分其它是固定管線部分

tripipe有三部分功能

算術運算A-pipe

memory load/sstore 和 varying access --LS-pipe

texture access--texture unit ---T-pipe bilinear filter一個clock，trilinear需要兩個時鍾周期因為采樣兩級mipmap

=================

bifrost shadercore

bifrost的固定部分和midgard類似

由excution core替代了 tripipe因為超過三部分了。。

一個core里面有一個或者多個excution engine做算術運算和thread state

Load/store unit

varying unit

ZS/blend unit ---depth stencil /blend

texture unit bilinear filter一個clock，trilinear需要兩個時鍾周期因為采樣兩級mipmap

https://community.arm.com/developer/tools-software/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-4---the-bifrost-shader-core

https://community.arm.com/developer/tools-software/graphics/b/blog/posts/the-mali-gpu-an-abstract-machine-part-3---the-midgard-shader-core

https://admin.jlb.kr/upload/abstract/190007/259.pdf

unite 2019

http://fileadmin.cs.lth.se/cs/Education/EDAN35/guestLectures/ARM-Mali.pdf

midgard

https://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.110-Bifrost-JemDavies-ARM-v04-9.pdf

bifrost

G76 streamline原始數據

MIDGARD

memory system

每個shader core 倆16k的L1cache--一個給texture fetche一個給geometry

所有的shader cores公用一個32K-64K的L2cache

L1和L2使用 64 byte cache lines

然后這部分信息對優化compute shader 增加cache locality有幫助這部分信息目前我還不太了解 mark

DDR 是Double Data Rate 隨機存儲器 ---內存在時鍾升沿降沿都能讀寫數據所以是double rate

GPU Limits

If we scale this to a Mali-T760 MP8 running at 600MHz we can calculate the theoretical peak performance as:

- Fillrate:
  - 8 pixels per clock = 4.8 GPix/s
  - That's 2314 complete 1080p frames per second!
- Texture rate:
  - 8 bilinear texels per clock = 4.8 GTex/s
  - That's 38 bilinear filtered texture lookups per pixel for 1080p @ 60 FPS!
- Arithmetic rate:
  - 17 FP32 FLOPS per pipe per core = 163 FP32 GFLOPS
  - That's 1311 FLOPS per pixel for 1080p @ 60 FPS!
- Bandwidth:
  - 256-bits of memory access per clock = 19.2GB/s read and write bandwidth1.
  - That's 154 bytes per pixel for 1080p @ 60 FPS!

帶寬看上去比蘋果小些哦。。。

他還說。。。。compute shader 和vertex shader 行為是一致的。。。interesting 把頂點看成cs的一維數據

BIFROST

The Execution Engines: Arithmetic Processing

這種quardthread 更好的並行 T0 T1 T2 T3

一個quard 128bits 一個clock 就是橫着一排

無論vector幾都可以就這樣往上排

=========

quad對應warp

是一組thread 4個一組這里叫quad

=======

The Execution Engines: Thread State

ZS/Blend unit

　　所有tile memory access 在這里

https://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_pixel_local_storage.txt
https://www.khronos.org/registry/gles/extensions/ARM/ARM_shader_framebuffer_fetch.txt
and the merged sub-pass functionality in Vulkan.

Varying unit

　　varying interpolator

　　插值用的 varying的access在load store unit（LS pipe）它有cache

　　128bits/quard/clock

　　fp16 vector4 一個quard需要2clock （16x4=）

Load store unit

　　vertex attribute fetch,

　　 varying fetch,

　　 buffer accesses,

　　and thread stack accesses.

　　他有一個16K L1cache 公用L2cache

64bytes cache line/clock

　　quard thread 128bits 被優化到一個clock（4thread）

bifrost這里把對memory 的訪問分成三個units

load/store cache access, varying interpolation, tile-buffer accesses

midgard 這里就一個 unit

這種改變增加了三部分的並行減小資源競爭緩解LSpipe壓力

texture unit --16K L1cache /core，公用L2 ，雙線性一個clock/texel

　　三線性 2 clock/texel 因為需要mipmap 兩級bilinear filer

　　volumetric 3d texture 兩倍cycle of 2d

　　16bits/color channel 需要更多cycles/pixel

　　but Bifrost 采樣 depth16 和depth24時做了優化 1cycle/texel 比midgard快一倍

======================

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Mali GPU的獨有特性 framebufferfetch in mali multiple render targets mrt GPU Instancing GPU與顯卡 kvm GPU透傳（GPU passthrough） kvm實現GPU透傳(GPU passthrough) TensorFlow指定使用GPU 多塊gpu GPU體系架構(二)：GPU存儲體系 GPU程序緩存(GPU Program Caching) 【GPU編解碼】GPU硬解碼---CUVID