這次接着上一篇ComputeShader基礎用法系列之三來繼續說。上一節說到了要通過Compute Shader進行GPU Culling。
為什么需要GPU Culling呢?使用GPU Culling能帶來什么好處?
傳統意義上的culling是通過相機的Cull進行的,Camera.Cull所帶來的性能問題隨着場景的復雜程度提高而會越來越嚴重。那么我們能否將Cull放到GPU來做呢,利用GPU的高並行處理機制達到轉移CPU壓力。
答案當然是可以的,但是像CameraCulling一樣,GPU Culling同樣需要包圍盒數據,這就意味着需要傳入數據到GPU內存。所以我們能推出以下的方法:
1.將包圍盒數據通過ComputeBuffer傳入GPU
2.在ComputeShader中進行Culling操作
3. 通過DrawIndirect的方式將物體繪制出來。
這里為什么要用DrawIndirect的呢?DrawIndirect是什么呢?我們來看一下:
這個方法前兩個步驟都沒有問題,但是第三個步驟回讀CPU是個大問題,我們知道CPU和GPU之間的傳輸帶寬在手機上是非常有限的,如果大量GPU數據回讀CPU,手機上必然是難以承受的。而且還有個問題在於這樣做只是確定可以把視錐外的物體Renderer禁用,但是視錐內的這些物體還是要再走一遍相機裁減,這樣的話兩遍裁減兩邊都占用性能,體驗簡直不要太差。 通過在PC上profiler我們可以看到直接回讀cpu culling結果的問題:

Camera Culling也在執行,Gpu Culling也在執行,而且注意等待GPU返回數據這一步,相當的耗時。
關於回讀CPU的代碼我就不往外面粘貼了,沒什么參考意義,只是用來看看回讀究竟多耗性能。那么接下來我們的主角:DrawIndirect就登場了。
Graphics.DrawMeshInstancedIndirect 這個方法主要是把在顯存里面的數據直接Draw到渲染管線中,而不是傳統的從CPU發送數據,通過這個接口,我們就可以直接把GPU Culling的結果放到渲染管線中執行,而無需回讀CPU,也可以繞過CameraCulling機制。
我們首先來看官方對於這個API的講解:https://docs.unity3d.com/ScriptReference/Graphics.DrawMeshInstancedIndirect.html
大家可以把代碼直接copy到Unity工程查看一下效果。滿屏幕的小方塊:

官方這個例子只是告訴我們這個API如何使用,但是並沒有做Culling操作。這就會導致很多不需要Draw的信息被放入了管線中處理。
跟着官方的例子,學會使用這個接口后,就直接上代碼:
代碼時基於官方提供的例子進行了一點點修改:
using System.Collections; using System.Collections.Generic; using UnityEngine; public class DrawIndirectCulled : MonoBehaviour { public struct ObjInfo { public Vector3 boundMin; public Vector3 boundMax; public Matrix4x4 localToWorldMatrix; public Matrix4x4 worldToLocalMatrix; } public struct MatrixInfo { public Matrix4x4 localToWorldMatrix; public Matrix4x4 worldToLocalMatrix; } public int instanceCount = 100000; public Mesh instanceMesh; public Material instanceMaterial; public int subMeshIndex = 0; public ComputeShader compute; private int cachedInstanceCount = -1; private int cachedSubMeshIndex = -1; private ComputeBuffer positionBuffer; private ComputeBuffer argsBuffer; private ComputeBuffer cullResult; List<ObjInfo> infos = new List<ObjInfo>(); private uint[] args = new uint[5] { 0, 0, 0, 0, 0 }; private int kernel; private int visibleCount; void Start() { kernel = compute.FindKernel("CSMain"); argsBuffer = new ComputeBuffer(1, args.Length * sizeof(uint), ComputeBufferType.IndirectArguments); cullResult = new ComputeBuffer(instanceCount, sizeof(float)*32, ComputeBufferType.Append); UpdateBuffers(); } void Update() { // Update starting position buffer if (cachedInstanceCount != instanceCount || cachedSubMeshIndex != subMeshIndex) UpdateBuffers(); var camera = Camera.main; var vpMatrix = GL.GetGPUProjectionMatrix(camera.projectionMatrix,false) * camera.worldToCameraMatrix; compute.SetMatrix("vpMatrix", vpMatrix); positionBuffer.SetData(infos); compute.SetBuffer(kernel, "input", positionBuffer); cullResult.SetCounterValue(0); compute.SetBuffer(kernel, "cullresult", cullResult); compute.SetInt("instanceCount", instanceCount); compute.SetInt("visibleCount", 0); compute.Dispatch(kernel, instanceCount / 64, 1, 1); instanceMaterial.SetBuffer("positionBuffer", cullResult); // Indirect args if (instanceMesh != null) { args[0] = (uint)instanceMesh.GetIndexCount(subMeshIndex); args[1] = (uint)instanceCount; args[2] = (uint)instanceMesh.GetIndexStart(subMeshIndex); args[3] = (uint)instanceMesh.GetBaseVertex(subMeshIndex); } else { args[0] = args[1] = args[2] = args[3] = 0; } argsBuffer.SetData(args); // Pad input if (Input.GetAxisRaw("Horizontal") != 0.0f) instanceCount = (int)Mathf.Clamp(instanceCount + Input.GetAxis("Horizontal") * 40000, 1.0f, 5000000.0f); // Render Graphics.DrawMeshInstancedIndirect(instanceMesh, subMeshIndex, instanceMaterial, new Bounds(Vector3.zero, new Vector3(100.0f, 100.0f, 100.0f)), argsBuffer); } void OnGUI() { GUI.Label(new Rect(265, 25, 200, 30), "Instance Count: " + instanceCount.ToString()); instanceCount = (int)GUI.HorizontalSlider(new Rect(25, 20, 200, 30), (float)instanceCount, 1.0f, 5000000.0f); } void UpdateBuffers() { // Ensure submesh index is in range if (instanceMesh != null) subMeshIndex = Mathf.Clamp(subMeshIndex, 0, instanceMesh.subMeshCount - 1); // Positions if (positionBuffer != null) positionBuffer.Release(); positionBuffer = new ComputeBuffer(instanceCount, 152); infos.Clear(); Vector4[] positions = new Vector4[instanceCount]; for (int i = 0; i < instanceCount; i++) { ObjInfo info = default; float angle = Random.Range(0.0f, Mathf.PI * 2.0f); float distance = Random.Range(20.0f, 100.0f); float height = Random.Range(-2.0f, 2.0f); float size = Random.Range(0.05f, 0.25f); var position = new Vector3(Mathf.Sin(angle) * distance, height, Mathf.Cos(angle) * distance); info.boundMin = position - new Vector3(0.5f, 0.5f, 0.5f); info.boundMax = position + new Vector3(0.5f, 0.5f, 0.5f); info.localToWorldMatrix = Matrix4x4.TRS(position, Quaternion.identity, Vector3.one); info.worldToLocalMatrix = Matrix4x4.Inverse(info.localToWorldMatrix); infos.Add(info); } cachedInstanceCount = instanceCount; cachedSubMeshIndex = subMeshIndex; } void OnDestroy() { if (positionBuffer != null) positionBuffer.Release(); positionBuffer = null; if (argsBuffer != null) argsBuffer.Release(); argsBuffer = null; if (cullResult != null) cullResult.Release(); cullResult = null; } }
compute shader代碼如下:
// Each #kernel tells which function to compile; you can have many kernels #pragma kernel CSMain struct ObjInfo { float3 boundMin; float3 boundMax; float4x4 localToWorldMatrix; float4x4 worldToLocalMatrix; }; struct MatrixInfo { float4x4 localToWorldMatrix; float4x4 worldToLocalMatrix; }; uint instanceCount; // Create a RenderTexture with enableRandomWrite flag and set it // with cs.SetTexture float4x4 vpMatrix; StructuredBuffer<ObjInfo> input; AppendStructuredBuffer<MatrixInfo> cullresult; [numthreads(64,1,1)] void CSMain (uint3 id : SV_DispatchThreadID) { if(instanceCount<=id.x) return; ObjInfo info = input[id.x]; float3 boundMax = info.boundMax; float3 boundMin = info.boundMin; float4 boundVerts[8]; float4x4 mvpMatrix = mul(vpMatrix,info.localToWorldMatrix); boundVerts[0] = mul(mvpMatrix, float4(boundMin, 1)); boundVerts[1] = mul(mvpMatrix, float4(boundMax, 1)); boundVerts[2] = mul(mvpMatrix, float4(boundMax.x, boundMax.y, boundMin.z, 1)); boundVerts[3] = mul(mvpMatrix, float4(boundMax.x, boundMin.y, boundMax.z, 1)); boundVerts[4] = mul(mvpMatrix, float4(boundMin.x, boundMax.y, boundMax.z, 1)); boundVerts[5] = mul(mvpMatrix, float4(boundMin.x, boundMax.y, boundMin.z, 1)); boundVerts[6] = mul(mvpMatrix, float4(boundMax.x, boundMin.y, boundMin.z, 1)); boundVerts[7] = mul(mvpMatrix, float4(boundMin.x, boundMin.y, boundMax.z, 1)); bool isInside = false; for (int i = 0; i < 8; i++) { float4 boundVert = boundVerts[i]; bool inside = boundVert.x <= boundVert.w && boundVert.x >= -boundVert.w && boundVert.y <= boundVert.w && boundVert.y >= -boundVert.w && boundVert.z <= boundVert.w && boundVert.z >= -boundVert.w; isInside = isInside || inside; } if (isInside) { MatrixInfo matrixInfo; matrixInfo.localToWorldMatrix = info.localToWorldMatrix; matrixInfo.worldToLocalMatrix = info.worldToLocalMatrix; cullresult.Append(matrixInfo); } }
我們會看到從腳本里面傳入compute shader的包圍盒信息的八個頂點都進行了轉換到投影空間裁剪的操作。裁剪完成將結果buffer傳入shader中,shader代碼如下(為了方便,直接用了內置管線的表面着色器):
Shader "Unlit/IndirectShader" { Properties { _MainTex ("Albedo (RGB)", 2D) = "white" {} _Glossiness ("Smoothness", Range(0,1)) = 0.5 _Metallic ("Metallic", Range(0,1)) = 0.0 } SubShader { Tags { "RenderType"="Opaque" } LOD 200 CGPROGRAM // Physically based Standard lighting model #pragma surface surf Standard addshadow fullforwardshadows #pragma multi_compile_instancing #pragma instancing_options procedural:setup sampler2D _MainTex; struct Input { float2 uv_MainTex; }; #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED struct MatrixInfo { float4x4 localToWorldMatrix; float4x4 worldToLocalMatrix; }; StructuredBuffer<MatrixInfo> positionBuffer; #endif void rotate2D(inout float2 v, float r) { float s, c; sincos(r, s, c); v = float2(v.x * c - v.y * s, v.x * s + v.y * c); } void setup() { #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED MatrixInfo data = positionBuffer[unity_InstanceID]; unity_ObjectToWorld = data.localToWorldMatrix; unity_WorldToObject = data.worldToLocalMatrix; #endif } half _Glossiness; half _Metallic; void surf (Input IN, inout SurfaceOutputStandard o) { fixed4 c = tex2D (_MainTex, IN.uv_MainTex); o.Albedo = c.rgb; o.Metallic = _Metallic; o.Smoothness = _Glossiness; o.Alpha = c.a; } ENDCG } }
效果如下:

准確的視錐culling。。。
這樣,Gpu culling就完成了。核心就是理解DrawIndirect這個接口和GpuInstance,這個比較基礎,這里就不說了(不會用接口看官方文檔的介紹,GPU Instance的原理可以自行百度,或者找個時間再寫一篇掃個盲),代碼沒什么難度,但是跑一下發現一個問題:

可以看到set compute buffer的執行效率如此之低。因為set compute buffer實際上是cpu 向 gpu傳輸數據,帶寬問題就會導致這個效率問題。因此我們可以把set compute buffer這一步驟移到當數量改變時再去set,但是這種程度的卡頓在游戲中實際使用時無法接受的。所以目前draw indirect和gpu culling更適合於位置旋轉縮放不變的一些物體,並且有高度的重復mesh。我們可以將所有的模型預烘焙位置信息,然后數據一次放在gpu就不動了。最常見的例子就是大批量草地的渲染,通過這種方式會得到非常好的優化。
這就完了?就這?
是的,完了,本來想把基於GPU的Hi-z寫一下,但是懶,嗯!在這里簡單說下原理吧:
我們剛才GPU culling做的是視錐剔除,還有遮擋剔除還沒有做,而通過GPU 的 Hi-z culling是常見的遮擋剔除方案。簡單來說就是通過不同采樣不同mip level的深度圖,根據深度圖和物體進行深度對比,決定哪個物體被cull,就不會被append到result中。深度圖的miplevel可以直接采樣低level的mipmap,但是會比較激進,因為要保證正確的遮擋剔除,必須取多個像素中深度最大的一個像素。而默認的mipmap不是這樣的。
具體hiz的實現已經有很多了,這里給一個鏈接:https://zhuanlan.zhihu.com/p/47615677 文章來自知乎大V:MAXWELL
揉了揉困酣的雙眼,看了看時間,已經是凌晨1點20了,寫的內容如果有誤可能是因為太困了,歡迎指正。
