最近在公司群里同事發了一個UE4關於Mask材質的優化,比如在場景中有大面積的草和樹的時候,可以在很大程度上提高效率。這其中的原理就是利用了GPU的特性Early Z,但是它的做法跟我最開始的理解有些出入,因為Early Z是GPU硬件實現的,每個廠商在實現的時候也有所不同。這次在查閱了一些資源和通過實驗測試,讓我們來揭開Early Z的神秘面紗。首先我們先講解一下什么是Early Z,然后再講解一下UE4是如何利用Early Z的特性解決草和 樹的Overdraw問題的,然后我們講解一下Early Z演化,最后我們通過實驗數據來驗證Early Z是如何工作的。
什么是Early Z
我們知道傳統的渲染管線中,深度測試是發生在Pixel/Fragment Shader之后的,如下圖所示:

但是,如果我們仔細想下,在光柵化的時候我們已經知道了每個片斷(fragment)的深度,如果這個時候我們可以提前做測試就可以避免后面復雜的Pixel/Fragment Shader計算過程,硬件廠商當然也想到了這一點,他們也在自己的硬件中各自實現了Early Z功能。在網上找到了一些他們的資料,我們簡單看一下。
nVidia
nVidia的GPU Programming Guide里面有關於Early Z的優化方案,里面提到了一些關於Early Z的一些使用細節。
Early-z(GPU Programming Guide Version 2.5.0 (GeForce 7 and earlier GPUs)) optimization (sometimes called "z-cull") improves performance by avoiding the rendering of occluded surfaces. If the occluded surfaces have expensive shaders applied to them, z-cull can save a large amount of
computation time. To take advantage of z-cull, follow these guidelines:
-
Don't create triangles with holes in them (that is, avoid alpha test or texkill)
-
Don't modify depth (that is, allow the GPU to use the interpolated depth value)
Violating these rules can invalidate the data the GPU uses for early
optimization, and can disable z-cull until the depth buffer is cleared again.
可以看到不要使用alpha test 或者texkll(clip discard),不要修改深度,只允許使用光柵化插值后的深度,違背這些規則會使GPU Early Z優化失效,直到下一次清除深度緩沖區,然后才能使用Early Z。限於當時的條件,是有這樣的限制,那么到了現在GPU還有這些限制嗎?我們接下來的實驗會說明這一點。
ZCULL and EarlyZ: Coarse and Fine-grained Z and Stencil Culling
NVIDIA GeForce 6 series and later GPUs can perform a coarse level Z and Stencil culling. Thanks to this optimization large blocks of pixels will not be scheduled for pixel shading if they are determined to be definitely occluded. In addition, GeForce 8 series and later GPUs can also perform fine-grained Z and Stencil culling, which allow the GPU to skip the shading of occluded pixels. These hardware optimizations are automatically enabled when possible, so they are mostly transparent to developers. However, it is good to know when they cannot be enabled or when they can underperform to ensure that you are taking advantage of them.
Coarse Z/Stencil culling (also known as ZCULL) will not be able to cull any pixels in the following cases:
1. If you don't use Clears (instead of fullscreen quads that write depth) to clear the depth-stencil buffer.
2. If the pixel shader writes depth.
3. If you change the direction of the depth test while writing depth. ZCULL will not cull any pixels until the next depth buffer Clear.
4. If stencil writes are enabled while doing stencil testing (no stencil culling)
5. On GeForce 8 series, if the DepthStencilView has Texture2D[MS]Array dimension
Also note that ZCULL will perform less efficiently in the following circumstances
1. If the depth buffer was written using a different depth test direction than that used for testing 2. If the depth of the scene contains a lot of high frequency information (i.e.: the depth varies a lot within a few pixels)
3. If you allocate too many large depth buffers.
4. If using DXGI_FORMAT_D32_FLOAT format Similarly,
fine-grained Z/Stencil culling (also known as EarlyZ) is disabled in the following cases:
1. If the pixel shader outputs depth
2. If the pixel shader uses the .z component of an input attribute with the SV_Position semantic (only on GeForce 8 series in D3D10)
3. If Depth or Stencil writes are enabled, or Occlusion Queries are enabled, and one of the following is true:
• Alpha-test is enabled
• Pixel Shader kills pixels (clip(), texkil, discard)
• Alpha To Coverage is enabled
• SampleMask is not 0xFFFFFFFF (SampleMask is set in D3D10 using OMSetBlendState and in D3D9 setting the D3DRS_MULTISAMPLEMASK renderstate)
這是GPU Programming Guide GeForce 8 and 9 Series,可以看到它里面又加入了ZCull(即Hierachical Z)這里也有一些需要注意的地方,但是它沒有詳細說明如果開啟了Alpha Test之后會不地導致后面的所有Early Z失效。
AMD
Emil Persson的Depth in Depth對Early Z有一個比較深入的講解。
Hierarchical Z, or HiZ for short, allows tiles of pixels to be rejected in a hierarchical fashion. This allows for faster rejection of occluded pixels and offers some bandwidth saving by doing a rough depth test using lower resolution buffers first instead of reading individual depth samples. Tiles that can safely be discarded are eliminated and thus the fragment 1 shader will not be executed for those pixels. Tiles that cannot safely be discarded are passed on to the Early Z stage, which will be discussed later on.
The Early Z component operates on a pixel level and allows fragments to be rejected before executingthe fragment shader. This means that if a certain fragment is found to be occluded by the current contents of the depth buffer, the fragment shader doesn't have to run for that pixel. Early Z can also reject fragments before shading based on the stencil test. On hardware prior to the Radeon HD 2000series, early Z was a monolithic top-of-the-pipe operation, which means that the entire read-modify- write cycle is executed before the fragment shader. As a result this impacts other functionality that kills fragments such as alpha test and texkill (called "clip" in HLSL and "discard" in GLSL). If Early Z would be left on and the alpha test kills a fragment, the depth- and/or stencil-buffer would have been incorrectly updated for the killed fragments. Therefore, Early Z is disabled for these cases. However, if depth and stencil writes are disabled there are no updates to the depth-stencil buffer anyway, so in this case Early Z will be enabled. On the Radeon HD 2000 series, Early Z works in all cases.
最后作者還給了一個參考表,列出了在什么情況下Early Z會失效,如下圖所示:

總結
通過上面兩個比較陳舊的文檔,我們可能會對什么時候會導致Early Z的失效比較模糊,而且隨着硬件的演進,這些限制條件也會變化,后面我們通過一些實驗來做些驗證。
UE4對Mask材質的Early Z優化
上面簡單講了下什么是Early Z,接下來我們來解決下UE4是如何解決Mask材質帶來的Overdraw問題。
它需要開啟一個開關,叫做Mask Material Only in Early-Z pass

上面這個只是一個操作,那么代碼是怎么實現的呢?我們這里就不貼代碼了,這里只是說一下它做這個的步驟,具體代碼可以去參考UE4 Pre Pass的相關代碼。
-
首先UE4會把場景中所有的Opaque和Mask的材質做一遍Pre-Pass,只寫深度不寫顏色,這樣可以做到快速寫入,先渲染Opaque再渲染Mask的物體,渲染Mask的時候開啟Clip。
-
做完Pre-pass之后,這個時候把深度測試改為Equal,關閉寫深度渲染Opaque物體。然后再渲染Mask物體,同樣是關閉深度寫,深度測試改為Equal,但是這個時候是不開啟clip的,因為pre-pass已經把深度寫入,這個時候只需要把Equal的像素寫入就可以了。這也是上面Mask Material only in early Z-pass的來歷。
這就是UE4提高Mask材質渲染效率的辦法,但是這個有個前提就是你場景中的Mask材質比較費才有比較大的提升。等等,它的實現方法跟我們看到的一些文章是矛盾的,而有些文檔又沒說清楚,既然UE4已經實現了這個功能,並且已經實現了性能提升,那說明先前的文章只針對當時的GPU有效,后面隨着硬件的演進更智能了,可以處理的情況更多了。為了驗證,我們做一些實驗。
揭開Earlyl Z的神秘面紗
為了驗證上面的一系了假設,我這里做了一個簡單的實驗。這個Demo的基於rastertek的Drect3D 11的教程Texturing,這個Demo就是在屏幕上渲染一個帶紋理的三角形。如下圖所示:

我修改了它的代碼,讓它在同一個位置畫四個三角形,第一個三角形采用Mask渲染,第二個三角形在PS中修改深度,第三個三角形使用Mask渲染,第四個三角形使用Mask渲染,但是跟UE4一樣,把深度寫關閉,把深度測試改為Equal,關閉clip。測試顯卡為nVidia GTX 570。這樣我用GPA(intel graphics performance analyzer)分析PS調用次數以及實現Pixel的個數如下表所示:
| 渲染批次 |
Depth |
Clip |
PS Invocations |
Pixels Rendered |
| 1 Mask |
Less Write |
Yes |
10.4k |
6548 |
| 2 Modify depth |
Less Write |
No |
10.4k |
3820 |
| 3 Mask |
Less Write |
Yes |
10.4k |
0 |
| 4 Mask |
Equal do not Write(寫不與深度不影響結果因為是equal,但是為了節省帶寬關閉) |
No |
6548 |
6548 |
從上圖可以看出不論是Modify depth或者Clip都只影響當前Draw call的early z優化,並不會影響后面的early z優化。可以看出,隨着硬件的演化,early z(包括Hierachical Z)變得更智能了,可以處理的情況更多了。
總結
通過對Early Z的簡單分析以及實驗,我們得出了一個有用的結論:
-
Early Z由硬件實現,隨着硬件的演進,它的功能也在不斷進化,處理的情況也變多。
-
Alpha Test或者Depth modify都會使用early z失效,但是后面渲染的批次還可以繼續使用early z(Hierachical Z)優化。
-
渲染API可以通過設置earlydepthstencil(d3d)或者layout(early_fragment_tests) in;(opengl)來強制使用early z。
隨着硬件的演進,原來硬件的很多限制也會被解除,這樣就需要我們不斷學習新的知識來正確的優化我們的引擎或者游戲。
