ComputeShader基礎用法系列之四

本文轉載自查看原文 2020-12-16 01:26 373 shader/ Unity/ ComputeShader

這次接着上一篇ComputeShader基礎用法系列之三來繼續說。上一節說到了要通過Compute Shader進行GPU Culling。

為什么需要GPU Culling呢？使用GPU Culling能帶來什么好處？

傳統意義上的culling是通過相機的Cull進行的，Camera.Cull所帶來的性能問題隨着場景的復雜程度提高而會越來越嚴重。那么我們能否將Cull放到GPU來做呢，利用GPU的高並行處理機制達到轉移CPU壓力。

答案當然是可以的，但是像CameraCulling一樣，GPU Culling同樣需要包圍盒數據，這就意味着需要傳入數據到GPU內存。所以我們能推出以下的方法：

1.將包圍盒數據通過ComputeBuffer傳入GPU

2.在ComputeShader中進行Culling操作

3. 通過DrawIndirect的方式將物體繪制出來。

這里為什么要用DrawIndirect的呢？DrawIndirect是什么呢？我們來看一下：

這個方法前兩個步驟都沒有問題，但是第三個步驟回讀CPU是個大問題，我們知道CPU和GPU之間的傳輸帶寬在手機上是非常有限的，如果大量GPU數據回讀CPU，手機上必然是難以承受的。而且還有個問題在於這樣做只是確定可以把視錐外的物體Renderer禁用，但是視錐內的這些物體還是要再走一遍相機裁減，這樣的話兩遍裁減兩邊都占用性能，體驗簡直不要太差。通過在PC上profiler我們可以看到直接回讀cpu culling結果的問題：

Camera Culling也在執行，Gpu Culling也在執行，而且注意等待GPU返回數據這一步，相當的耗時。

關於回讀CPU的代碼我就不往外面粘貼了，沒什么參考意義，只是用來看看回讀究竟多耗性能。那么接下來我們的主角：DrawIndirect就登場了。

Graphics.DrawMeshInstancedIndirect 這個方法主要是把在顯存里面的數據直接Draw到渲染管線中，而不是傳統的從CPU發送數據，通過這個接口，我們就可以直接把GPU Culling的結果放到渲染管線中執行，而無需回讀CPU，也可以繞過CameraCulling機制。

我們首先來看官方對於這個API的講解：https://docs.unity3d.com/ScriptReference/Graphics.DrawMeshInstancedIndirect.html

大家可以把代碼直接copy到Unity工程查看一下效果。滿屏幕的小方塊：

官方這個例子只是告訴我們這個API如何使用，但是並沒有做Culling操作。這就會導致很多不需要Draw的信息被放入了管線中處理。

跟着官方的例子，學會使用這個接口后，就直接上代碼：

代碼時基於官方提供的例子進行了一點點修改：

using System.Collections;
using System.Collections.Generic;
using UnityEngine;

public class DrawIndirectCulled : MonoBehaviour
{
    public struct ObjInfo
    {
        public Vector3 boundMin;
        public Vector3 boundMax;
        public Matrix4x4 localToWorldMatrix;
        public Matrix4x4 worldToLocalMatrix;
    }
    public struct MatrixInfo
    {
        public Matrix4x4 localToWorldMatrix;
        public Matrix4x4 worldToLocalMatrix;
    }
    public int instanceCount = 100000;
    public Mesh instanceMesh;
    public Material instanceMaterial;
    public int subMeshIndex = 0;
    public ComputeShader compute;

    private int cachedInstanceCount = -1;
    private int cachedSubMeshIndex = -1;
    private ComputeBuffer positionBuffer;
    private ComputeBuffer argsBuffer;
    private ComputeBuffer cullResult;
    List<ObjInfo> infos = new List<ObjInfo>();
    private uint[] args = new uint[5] { 0, 0, 0, 0, 0 };
    private int kernel;

    private int visibleCount;
    void Start()
    {
        kernel = compute.FindKernel("CSMain");
        argsBuffer = new ComputeBuffer(1, args.Length * sizeof(uint), ComputeBufferType.IndirectArguments);
        cullResult = new ComputeBuffer(instanceCount, sizeof(float)*32, ComputeBufferType.Append);
        UpdateBuffers();
    }

    void Update()
    {
        // Update starting position buffer
        if (cachedInstanceCount != instanceCount || cachedSubMeshIndex != subMeshIndex)
            UpdateBuffers();

        var camera = Camera.main;
        var vpMatrix = GL.GetGPUProjectionMatrix(camera.projectionMatrix,false) * camera.worldToCameraMatrix;
        compute.SetMatrix("vpMatrix", vpMatrix);
        positionBuffer.SetData(infos);
        compute.SetBuffer(kernel, "input", positionBuffer);
        cullResult.SetCounterValue(0);
        compute.SetBuffer(kernel, "cullresult", cullResult);
        compute.SetInt("instanceCount", instanceCount);
        compute.SetInt("visibleCount", 0);
        compute.Dispatch(kernel, instanceCount / 64, 1, 1);
        instanceMaterial.SetBuffer("positionBuffer", cullResult);
        // Indirect args
        if (instanceMesh != null)
        {
            args[0] = (uint)instanceMesh.GetIndexCount(subMeshIndex);
            args[1] = (uint)instanceCount;
            args[2] = (uint)instanceMesh.GetIndexStart(subMeshIndex);
            args[3] = (uint)instanceMesh.GetBaseVertex(subMeshIndex);
        }
        else
        {
            args[0] = args[1] = args[2] = args[3] = 0;
        }
        argsBuffer.SetData(args);
        // Pad input
        if (Input.GetAxisRaw("Horizontal") != 0.0f)
            instanceCount = (int)Mathf.Clamp(instanceCount + Input.GetAxis("Horizontal") * 40000, 1.0f, 5000000.0f);

        // Render
        Graphics.DrawMeshInstancedIndirect(instanceMesh, subMeshIndex, instanceMaterial, new Bounds(Vector3.zero, new Vector3(100.0f, 100.0f, 100.0f)), argsBuffer);
    }

    void OnGUI()
    {
        GUI.Label(new Rect(265, 25, 200, 30), "Instance Count: " + instanceCount.ToString());
        instanceCount = (int)GUI.HorizontalSlider(new Rect(25, 20, 200, 30), (float)instanceCount, 1.0f, 5000000.0f);
    }

    void UpdateBuffers()
    {
        // Ensure submesh index is in range
        if (instanceMesh != null)
            subMeshIndex = Mathf.Clamp(subMeshIndex, 0, instanceMesh.subMeshCount - 1);

        // Positions
        if (positionBuffer != null)
            positionBuffer.Release();
        positionBuffer = new ComputeBuffer(instanceCount, 152);
        infos.Clear();
        Vector4[] positions = new Vector4[instanceCount];
        for (int i = 0; i < instanceCount; i++)
        {
            ObjInfo info = default;
            float angle = Random.Range(0.0f, Mathf.PI * 2.0f);
            float distance = Random.Range(20.0f, 100.0f);
            float height = Random.Range(-2.0f, 2.0f);
            float size = Random.Range(0.05f, 0.25f);
            var position = new Vector3(Mathf.Sin(angle) * distance, height, Mathf.Cos(angle) * distance);
            info.boundMin = position - new Vector3(0.5f, 0.5f, 0.5f);
            info.boundMax = position + new Vector3(0.5f, 0.5f, 0.5f);
            info.localToWorldMatrix = Matrix4x4.TRS(position, Quaternion.identity, Vector3.one);
            info.worldToLocalMatrix = Matrix4x4.Inverse(info.localToWorldMatrix);
            infos.Add(info);
        }
        
        cachedInstanceCount = instanceCount;
        cachedSubMeshIndex = subMeshIndex;
    }

    void OnDestroy()
    {
        if (positionBuffer != null)
            positionBuffer.Release();
        positionBuffer = null;

        if (argsBuffer != null)
            argsBuffer.Release();
        argsBuffer = null;

        if (cullResult != null)
            cullResult.Release();
        cullResult = null;
    }
}

compute shader代碼如下：

// Each #kernel tells which function to compile; you can have many kernels
#pragma kernel CSMain
struct ObjInfo {
    float3 boundMin;
    float3 boundMax;
    float4x4 localToWorldMatrix;
    float4x4 worldToLocalMatrix;
};
struct MatrixInfo
{
    float4x4 localToWorldMatrix;
    float4x4 worldToLocalMatrix;
};

    uint instanceCount;
// Create a RenderTexture with enableRandomWrite flag and set it
// with cs.SetTexture
float4x4 vpMatrix;
StructuredBuffer<ObjInfo> input;

AppendStructuredBuffer<MatrixInfo> cullresult;
[numthreads(64,1,1)]
void CSMain (uint3 id : SV_DispatchThreadID)
{
    if(instanceCount<=id.x)
        return;
    ObjInfo info = input[id.x];
    float3 boundMax = info.boundMax;
    float3 boundMin = info.boundMin;
    float4 boundVerts[8];
    float4x4 mvpMatrix = mul(vpMatrix,info.localToWorldMatrix);
    boundVerts[0] = mul(mvpMatrix, float4(boundMin, 1));
    boundVerts[1] = mul(mvpMatrix, float4(boundMax, 1));
    boundVerts[2] = mul(mvpMatrix, float4(boundMax.x, boundMax.y, boundMin.z, 1));
    boundVerts[3] = mul(mvpMatrix, float4(boundMax.x, boundMin.y, boundMax.z, 1));
    boundVerts[4] = mul(mvpMatrix, float4(boundMin.x, boundMax.y, boundMax.z, 1));
    boundVerts[5] = mul(mvpMatrix, float4(boundMin.x, boundMax.y, boundMin.z, 1));
    boundVerts[6] = mul(mvpMatrix, float4(boundMax.x, boundMin.y, boundMin.z, 1));
    boundVerts[7] = mul(mvpMatrix, float4(boundMin.x, boundMin.y, boundMax.z, 1));

    bool isInside = false;
    for (int i = 0; i < 8; i++)
    {
        float4 boundVert = boundVerts[i];
        bool inside = boundVert.x <= boundVert.w && boundVert.x >= -boundVert.w &&
            boundVert.y <= boundVert.w && boundVert.y >= -boundVert.w &&
            boundVert.z <= boundVert.w && boundVert.z >= -boundVert.w;
        isInside = isInside || inside;
    }
    if (isInside)
    {
        MatrixInfo matrixInfo;
        matrixInfo.localToWorldMatrix = info.localToWorldMatrix;
        matrixInfo.worldToLocalMatrix = info.worldToLocalMatrix;
        cullresult.Append(matrixInfo);
    }
}

我們會看到從腳本里面傳入compute shader的包圍盒信息的八個頂點都進行了轉換到投影空間裁剪的操作。裁剪完成將結果buffer傳入shader中，shader代碼如下（為了方便，直接用了內置管線的表面着色器）：

Shader "Unlit/IndirectShader"
{
Properties {
        _MainTex ("Albedo (RGB)", 2D) = "white" {}
        _Glossiness ("Smoothness", Range(0,1)) = 0.5
        _Metallic ("Metallic", Range(0,1)) = 0.0
    }
    SubShader {
        Tags { "RenderType"="Opaque" }
        LOD 200

        CGPROGRAM
        // Physically based Standard lighting model
        #pragma surface surf Standard addshadow fullforwardshadows
        #pragma multi_compile_instancing
        #pragma instancing_options procedural:setup

        sampler2D _MainTex;

        struct Input {
            float2 uv_MainTex;
        };

    #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
struct MatrixInfo
{
    float4x4 localToWorldMatrix;
    float4x4 worldToLocalMatrix;
};
        StructuredBuffer<MatrixInfo> positionBuffer;
    #endif

        void rotate2D(inout float2 v, float r)
        {
            float s, c;
            sincos(r, s, c);
            v = float2(v.x * c - v.y * s, v.x * s + v.y * c);
        }

        void setup()
        {
        #ifdef UNITY_PROCEDURAL_INSTANCING_ENABLED
            MatrixInfo data = positionBuffer[unity_InstanceID];

            unity_ObjectToWorld = data.localToWorldMatrix;
            unity_WorldToObject = data.worldToLocalMatrix;
        #endif
        }

        half _Glossiness;
        half _Metallic;

        void surf (Input IN, inout SurfaceOutputStandard o) {
            fixed4 c = tex2D (_MainTex, IN.uv_MainTex);
            o.Albedo = c.rgb;
            o.Metallic = _Metallic;
            o.Smoothness = _Glossiness;
            o.Alpha = c.a;
        }
        ENDCG
    }

}

效果如下：

准確的視錐culling。。。

這樣，Gpu culling就完成了。核心就是理解DrawIndirect這個接口和GpuInstance，這個比較基礎，這里就不說了（不會用接口看官方文檔的介紹，GPU Instance的原理可以自行百度，或者找個時間再寫一篇掃個盲），代碼沒什么難度，但是跑一下發現一個問題：

可以看到set compute buffer的執行效率如此之低。因為set compute buffer實際上是cpu 向 gpu傳輸數據，帶寬問題就會導致這個效率問題。因此我們可以把set compute buffer這一步驟移到當數量改變時再去set，但是這種程度的卡頓在游戲中實際使用時無法接受的。所以目前draw indirect和gpu culling更適合於位置旋轉縮放不變的一些物體，並且有高度的重復mesh。我們可以將所有的模型預烘焙位置信息，然后數據一次放在gpu就不動了。最常見的例子就是大批量草地的渲染，通過這種方式會得到非常好的優化。

這就完了？就這？

是的，完了，本來想把基於GPU的Hi-z寫一下，但是懶，嗯！在這里簡單說下原理吧：

我們剛才GPU culling做的是視錐剔除，還有遮擋剔除還沒有做，而通過GPU 的 Hi-z culling是常見的遮擋剔除方案。簡單來說就是通過不同采樣不同mip level的深度圖，根據深度圖和物體進行深度對比，決定哪個物體被cull，就不會被append到result中。深度圖的miplevel可以直接采樣低level的mipmap，但是會比較激進，因為要保證正確的遮擋剔除，必須取多個像素中深度最大的一個像素。而默認的mipmap不是這樣的。

具體hiz的實現已經有很多了，這里給一個鏈接：https://zhuanlan.zhihu.com/p/47615677 文章來自知乎大V：MAXWELL

揉了揉困酣的雙眼，看了看時間，已經是凌晨1點20了，寫的內容如果有誤可能是因為太困了，歡迎指正。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 ComputeShader基礎用法系列之一 ComputeShader基礎用法系列之二 Cypress 高級用法系列一 lodash用法系列(5),鏈式 lodash用法系列(1),數組集合操作 lodash用法系列1：數組集合操作 Python字典你必須知道的用法系列 lodash用法系列(2),處理對象 lodash用法系列(6),函數種種 lodash用法系列(3),使用函數