linux DRM GPU scheduler 筆記

本文轉載自查看原文 2021-01-20 22:24 1095 GPU scheduler/ dma fence/ linux/ DRM/KMS/ GPU/ fence

內核文檔：

Overview

The GPU scheduler provides entities which allow userspace to push jobs into software queues which are then scheduled on a hardware run queue. The software queues have a priority among them. The scheduler selects the entities from the run queue using a FIFO. The scheduler provides dependency handling features among jobs. The driver is supposed to provide callback functions for backend operations to the scheduler like submitting a job to hardware run queue, returning the dependencies of a job etc.

The organisation of the scheduler is the following:

1. Each hw run queue has one scheduler

2. Each scheduler has multiple run queues with different priorities (e.g., HIGH_HW,HIGH_SW, KERNEL, NORMAL)

3. Each scheduler run queue has a queue of entities to schedule

4. Entities themselves maintain a queue of jobs that will be scheduled on the hardware.

The jobs in a entity are always scheduled in the order that they were pushed.

原理概述：

眾所周知，現代GPU給CPU提供了下發命令流（command stream）接口，而這些命令流用於控制GPU硬件，下發着色程序，以及傳遞OpenGL或vulkan所需的狀態值等。

linux中的GPU scheduler正是用於GPU命令流的調度，這部分代碼是從AMD GPU driver中的獨立出來的。

從內核文檔的描述中可知，GPU scheduler為用戶程序提供了entities，可用於用戶程序向其提交jobs，這些jobs先被加入到software queue上，然后再經調度器調度到hardware（GPU）上。

GPU上一個命令流通道對應一個GPU scheduler。

GPU scheduler調度策略有兩個，一是按優先級調度，二是同等優先級下先入隊的先調度，即FIFO模式。GPU scheduler通過回調函數的方法實現不同硬件的jobs提交。

在jobs被提交到硬件之前GPU scheduler提供了依賴項檢查特性，只有當jobs的所有依賴項全部可用時，才會被提交到硬件上。

GPU上一個命令流通道對應一個gpu scheduler，一個gpu scheduler上包含多個run queue，這些run queue代表了不同的優先級。

當有一個新的job需要提交到GPU上時，先被提交到entities上，被提交的entity通過負載均衡算法，確定該entity最終會被調度到的gpu scheduler，並把entity加入到選中的gpu scheduler的run Queue的列表中，等待被調度。

基本使用方法：

1.GPU scheduler在能工作前，需要對其初始化，並提供與硬件操作相關的回調函數，而GPU上的一個命令流通道對應一個scheduler

2.在scheduler中有software run queue，這些run queue對應不同的優先級，優先級高的run queue優先調度

3.新的job首先被提交到entity上，而后entity被加入到scheduler的run queue的隊列中。在相同優先級下，job和entity按先進先出規則調度（FIFO）

4.當scheduler開始調度時，首先從優先級最高的run queue中選出最先進入的entity，再從選出的entity中，選出最先加入的job

5.在一個job能被提交到GPU HW 上前，需要做依賴性項檢測，比如即將render的framebuffer是否可用（依賴檢測也是通過回調函數的實現的）

6.scheduler選出的job，最終需要通過初始化中實現的回調函數，將job提交到的GPU的hardware run queue上

7.在GPU處理完一個job后，通過dma_fence的callback通知GPU scheduler，並signal finish fence

1.注冊sched

struct drm_gpu_scheduler;
 
int drm_sched_init(struct drm_gpu_scheduler *sched,            //sched: scheduler instance
           const struct drm_sched_backend_ops *ops,            //ops: backend operations for this scheduler
           unsigned hw_submission,                        　　 //hw_submission: number of hw submissions that can be in flight
           unsigned hang_limit,                        　　    //hang_limit: number of times to allow a job to hang before dropping it
           long timeout,                            　　　　   //timeout: timeout value in jiffies for the scheduler
           const char *name)                        　　　　   //name: name used for debugging

通過函數drm_sched_init()完一個struct drm_gpu_scheduler *sched的初始化工作。

一旦成功完成，會啟動一個內核線程，這個內核線程中實現了主要的調度邏輯。

內核線程處於等待狀態，直到有新的job被提交，並喚醒調度作業。

hw_submission：指定GPU上單個命令通道可同時提交的命令數。

通過函數drm_sched_init()初始化一個GPU scheduler。其中通過參數const struct drm_sched_backend_ops *ops提供平台相關的回調接口。

struct drm_sched_backend_ops {
    struct dma_fence *(*dependency)(struct drm_sched_job *sched_job,
                    struct drm_sched_entity *s_entity);
    struct dma_fence *(*run_job)(struct drm_sched_job *sched_job);
    void (*timedout_job)(struct drm_sched_job *sched_job);
    void (*free_job)(struct drm_sched_job *sched_job);
}

dependency：當一個job被視作下一個調度對象時，會調用該接口。如果該job存在依賴項，需返回一個dma_fence指針，GPU scheduler會在這個返回的dma_fence的callback list中添加喚醒操作，一旦該fence被signal，就能再次喚醒GPU scheduler。如果不存在任何依賴項，則返回NULL。

run_job：一旦job的所有依賴項變得可用后，會調用該接口。這個接口主要是實現GPU HW相關的命令提交。該接口成功把命令提交到GPU上后，返回一個dma_fence，gpu scheduler會向這個dma_fence的callback list中添加finish fence喚醒操作，而這個dma_fence一般會在GPU處理完畢該job后被signal。

timedout_job：當一個提交到GPU執行時間過長時，該接口會被調用，以觸發GPU執行恢復的處理流程。

free_job：用做當job被處理完畢后的相關資源釋放工作。

2.初始化entities

int drm_sched_entity_init(struct drm_sched_entity *entity,
              enum drm_sched_priority priority,
              struct drm_gpu_scheduler **sched_list,
              unsigned int num_sched_list,
              atomic_t *guilty)

初始化一個struct drm_sched_entity *entity。

priority：指定entity的優先級，目前支持的優先級有（由低到高）：

    DRM_SCHED_PRIORITY_MIN
    DRM_SCHED_PRIORITY_NORMAL
    DRM_SCHED_PRIORITY_HIGH
    DRM_SCHED_PRIORITY_KERNEL

sched_list：entity中job可以被提交的gpu scheduler列表。當gpu存在多個gpu命令流通道時，這樣同一個job就有多個潛在的可被提交的通道（HW相關，需要gpu支持），sched_list中即保存了這些潛在的通道。

當一個entity有多個Gpu scheduler時，drm scheduler支持負載均衡算法。

num_sched_list：指定了sched_list中gpu scheduler的個數。

函數drm_sched_entity_init()可在open()函數中被調用，這樣當多個應用程序調用各自open()函數后，driver會為每個應用程序創建一個entity。

如前文所述，entity是job的提交點，gpu上的一個命令流通道對應一個gpu scheduler，當有多個應用程序同時向同一個gpu命令流通道提交job時，job先被加入到各自的entity上，再等待gpu scheduler的統一調度。

3.初始化job

int drm_sched_job_init(struct drm_sched_job *job,
               struct drm_sched_entity *entity,
               void *owner)

entity：指定job會被提交到的entity。如果entity的sched_list大於一個時，會調用負載均衡算法，從entity的sched_list中挑選一個最佳的gpu scheduler進行job調度。

函數drm_sched_job_init()會為該job初始化兩個dma_fence：scheduled 和finished，當scheduled fence被signaled，表明該job要被發送到GPU上，當finished fence被signaled，表明該job已在gpu上處理完畢。

所以通過這兩個fence可以告知外界job當前的狀態。

4.提交job

void drm_sched_entity_push_job(struct drm_sched_job *sched_job,
                   struct drm_sched_entity *entity)

當一個job被drm_sched_job_init()初始化后，就可以通過函數 drm_sched_entity_push_job()提交到entity的job_queue上了。

如果entity是首次被提交job到其上的job_queue上，該entity會被加入到gpu scheduler的run queue上，並喚醒gpu scheduler上的調度線程。

5.dma_fence的作用

DMA fence是linux中用於不同內核模塊DMA同步操作的原語，常用於GPU rendering、displaying buffer等之間的同步。

使用DMA FENCE可以減少在用戶態的等待，讓數據的同步在內核中進行。例如GPU rendering和displaying buffer的同步，GPU rendering負責向framebuffer中寫入渲染數據，displaying負責將frambuffer中數據顯示到屏幕上。那么displaying需要等待GPU rendering完成后，才能讀取framebuffer中的數據（反過來也一樣，gpu rendering也需要等displaying顯示完畢后，才能在framebuffer上畫下一幀圖像）。我們可以在應用程序中去同步兩者，即等rendering結束后，才調用displaying模塊。在等待的過程中應用程序往往會休眠，不能做其他事情（比如准備下一幀framebuffer的渲染）。等displaying顯示完畢后，再調用GPU rendering也會使gpu不飽和，處於空閑狀態。

使用DMA fence后，將GPU rendering和displaying的同步放到內核中，應用程序調用GPU rendering后，返回一個out fence，不用等GPU rendering的完成，即可調用displaying，並把GPU rendering的out fence傳遞給displaying模塊作為in fence。這樣在內核中displaying模塊要做顯示輸出前會等待in fence被signal，而一旦GPU rendering完成后，就會signal該fence。不再需要應用程序的參與。

更詳細的解釋請參考該文章： https://www.collabora.com/news-and-blog/blog/2016/09/13/mainline-explicit-fencing-part-1/。

在Gpu scheduler中，一個job在被調度前要確定其是否有依賴項，一個job被調度后需要告知外界自己當前的狀態，這兩者均是通過fence來實現的。

job的依賴項叫 in_fence，job自身狀態報告叫out_fence，本質上都是相同的數據結構。

當存在in fence且不處於signaled狀態，該job需要繼續等待（這里的等待方式不是通過調用dma_fence_wait()），直到所有的in fence被signal。在等待前Gpu scheduler會注冊一個新的回調接口dma_fence_cb到in fence上（其中包含喚醒Gpu scheduler調度程序的代碼），當in fence被signal時，這個callback會被調用，在其中再次喚醒對該job的調度。

一個job有兩個out fence：scheduled 和finished，當scheduled fence被signaled，表明該job要被發送到GPU上，當finished fence被signaled，表明該job已在gpu上處理完畢。

以下在VC4上的測試代碼：

Vc4 driver沒有使用drm的gpu scheduler作為調度器。VC4在in fence的處理上是blocking式的，即應用程序會阻塞在這里，這里似乎不符合fence的初衷。且當前driver也只支持單個in fence，而vulkan上可能傳入多個依賴項。

實際的測試發現，與原本的driver相比使用gpu scheduler沒有帶來明顯的性能變化，但可以解決fence的問題，當然主要的目的還是練手，主要參考了v3d driver的代碼。

1.首先是調用 drm_sched_init()創建scheduler。Vc4上對應兩個命令入口，bin和render，所以這里創建兩個scheduler。

 1 static const struct drm_sched_backend_ops vc4_bin_sched_ops = {
 2     .dependency = vc4_job_dependency,
 3     .run_job = vc4_bin_job_run,
 4     .timedout_job = NULL,
 5     .free_job = vc4_job_free,
 6 };
 7  
 8 static const struct drm_sched_backend_ops vc4_render_sched_ops = {
 9     .dependency = vc4_job_dependency,
10     .run_job = vc4_render_job_run,
11     .timedout_job = NULL,
12     .free_job = vc4_job_free,
13 };
14  
15 int vc4_sched_init(struct vc4_dev *vc4)
16 {
17     int hw_jobs_limit = 1;
18     int job_hang_limit = 0;
19     int hang_limit_ms = 500;
20     int ret;
21  
22     ret = drm_sched_init(&vc4->queue[VC4_BIN].sched,
23                          &vc4_bin_sched_ops,
24                          hw_jobs_limit,
25                          job_hang_limit,
26                          msecs_to_jiffies(hang_limit_ms),
27                          "vc4_bin");
28     if (ret) {
29         dev_err(vc4->base.dev, "Failed to create bin scheduler: %d.", ret);
30         return ret;
31     }
32  
33     ret = drm_sched_init(&vc4->queue[VC4_RENDER].sched,
34                          &vc4_render_sched_ops,
35                          hw_jobs_limit,
36                          job_hang_limit,
37                          msecs_to_jiffies(hang_limit_ms),
38                          "vc4_render");
39     if (ret) {
40         dev_err(vc4->base.dev, "Failed to create render scheduler: %d.", ret);
41         vc4_sched_fini(vc4);
42         return ret;
43     }
44  
45     return ret;
46 }

2.在drm driver的open回調接口中添加，entity的初始代碼。

 1 static int vc4_open(struct drm_device *dev, struct drm_file *file)
 2 {
 3     struct vc4_dev *vc4 = to_vc4_dev(dev);
 4     struct vc4_file *vc4file;
 5     struct drm_gpu_scheduler *sched;
 6     int i;
 7  
 8     vc4file = kzalloc(sizeof(*vc4file), GFP_KERNEL);
 9     if (!vc4file)
10         return -ENOMEM;
11  
12     vc4_perfmon_open_file(vc4file);
13  
14     for (i = 0; i < VC4_MAX_QUEUES; i++) {
15         sched = &vc4->queue[i].sched;
16         drm_sched_entity_init(&vc4file->sched_entity[i],
17                               DRM_SCHED_PRIORITY_NORMAL,
18                               &sched, 1,
19                               NULL);
20     }
21  
22     file->driver_priv = vc4file;
23  
24     return 0;
25 }

3.在driver完成了job的打包后，就可以向entity上提交job了。

  1 static void vc4_job_free(struct kref *ref)
  2 {
  3     struct vc4_job *job = container_of(ref, struct vc4_job, refcount);
  4     struct vc4_dev *vc4 = job->dev;
  5     struct vc4_exec_info *exec = job->exec;
  6     struct vc4_seqno_cb *cb, *cb_temp;
  7     struct dma_fence *fence;
  8     unsigned long index;
  9     unsigned long irqflags;
 10  
 11     xa_for_each(&job->deps, index, fence) {
 12         dma_fence_put(fence);
 13     }
 14     xa_destroy(&job->deps);
 15  
 16     dma_fence_put(job->irq_fence);
 17     dma_fence_put(job->done_fence);
 18  
 19     if (exec)
 20         vc4_complete_exec(&job->dev->base, exec);
 21  
 22     spin_lock_irqsave(&vc4->job_lock, irqflags);
 23     list_for_each_entry_safe(cb, cb_temp, &vc4->seqno_cb_list, work.entry) {
 24         if (cb->seqno <= vc4->finished_seqno) {
 25             list_del_init(&cb->work.entry);
 26             schedule_work(&cb->work);
 27         }
 28     }
 29  
 30     spin_unlock_irqrestore(&vc4->job_lock, irqflags);
 31  
 32     kfree(job);
 33 }
 34  
 35 void vc4_job_put(struct vc4_job *job)
 36 {
 37     kref_put(&job->refcount, job->free);
 38 }
 39  
 40 static int vc4_job_init(struct vc4_dev *vc4, struct drm_file *file_priv,
 41                              struct vc4_job *job, void (*free)(struct kref *ref), u32 in_sync)
 42 {
 43     struct dma_fence *in_fence = NULL;
 44     int ret;
 45  
 46     xa_init_flags(&job->deps, XA_FLAGS_ALLOC);
 47  
 48     if (in_sync) {
 49         ret = drm_syncobj_find_fence(file_priv, in_sync, 0, 0, &in_fence);
 50         if (ret == -EINVAL)
 51             goto fail;
 52  
 53         ret = drm_gem_fence_array_add(&job->deps, in_fence);
 54         if (ret) {
 55             dma_fence_put(in_fence);
 56             goto fail;
 57         }
 58     }
 59  
 60     kref_init(&job->refcount);
 61     job->free = free;
 62  
 63     return 0;
 64  
 65 fail:
 66     xa_destroy(&job->deps);
 67     return ret;
 68 }
 69  
 70 static int vc4_push_job(struct drm_file *file_priv, struct vc4_job *job, enum vc4_queue queue)
 71 {
 72     struct vc4_file *vc4file = file_priv->driver_priv;
 73     int ret;
 74  
 75     ret = drm_sched_job_init(&job->base, &vc4file->sched_entity[queue], vc4file);
 76     if (ret)
 77         return ret;
 78  
 79     job->done_fence = dma_fence_get(&job->base.s_fence->finished);
 80  
 81     kref_get(&job->refcount);
 82  
 83     drm_sched_entity_push_job(&job->base, &vc4file->sched_entity[queue]);
 84  
 85     return 0;
 86 }
 87  
 88 /* Queues a struct vc4_exec_info for execution.  If no job is
 89 * currently executing, then submits it.
 90 *
 91 * Unlike most GPUs, our hardware only handles one command list at a
 92 * time.  To queue multiple jobs at once, we'd need to edit the
 93 * previous command list to have a jump to the new one at the end, and
 94 * then bump the end address.  That's a change for a later date,
 95 * though.
 96 */
 97 static int
 98 vc4_queue_submit_to_scheduler(struct drm_device *dev,
 99                                           struct drm_file *file_priv,
100                                           struct vc4_exec_info *exec,
101                                           struct ww_acquire_ctx *acquire_ctx)
102 {
103     struct vc4_dev *vc4 = to_vc4_dev(dev);
104     struct drm_vc4_submit_cl *args = exec->args;
105     struct vc4_job *bin = NULL;
106     struct vc4_job *render = NULL;
107     struct drm_syncobj *out_sync;
108     uint64_t seqno;
109     unsigned long irqflags;
110     int ret;
111  
112     spin_lock_irqsave(&vc4->job_lock, irqflags);
113  
114     seqno = ++vc4->emit_seqno;
115     exec->seqno = seqno;
116  
117     spin_unlock_irqrestore(&vc4->job_lock, irqflags);
118  
119     render = kcalloc(1, sizeof(*render), GFP_KERNEL);
120     if (!render)
121         return -ENOMEM;
122  
123     render->exec = exec;
124  
125     ret = vc4_job_init(vc4, file_priv, render, vc4_job_free, args->in_sync);
126     if (ret) {
127         kfree(render);
128         return ret;
129     }
130  
131     if (args->bin_cl_size != 0) {
132         bin = kcalloc(1, sizeof(*bin), GFP_KERNEL);
133         if (!bin) {
134             vc4_job_put(render);
135             return -ENOMEM;
136         }
137  
138         bin->exec = exec;
139  
140         ret = vc4_job_init(vc4, file_priv, bin, vc4_job_free, args->in_sync);
141         if (ret) {
142             vc4_job_put(render);
143             kfree(bin);
144             return ret;
145         }
146     }
147  
148     mutex_lock(&vc4->sched_lock);
149  
150     if (bin) {
151         ret = vc4_push_job(file_priv, bin, VC4_BIN);
152         if (ret)
153             goto FAIL;
154  
155         ret = drm_gem_fence_array_add(&render->deps, dma_fence_get(bin->done_fence));
156         if (ret)
157             goto FAIL;
158     }
159  
160     vc4_push_job(file_priv, render, VC4_RENDER);
161  
162     mutex_unlock(&vc4->sched_lock);
163  
164     if (args->out_sync) {
165         out_sync = drm_syncobj_find(file_priv, args->out_sync);
166         if (!out_sync) {
167             ret = -EINVAL;
168             goto FAIL;;
169         }
170  
171         drm_syncobj_replace_fence(out_sync, &bin->base.s_fence->scheduled);
172         exec->fence = render->done_fence;
173  
174         drm_syncobj_put(out_sync);
175     }
176  
177     vc4_update_bo_seqnos(exec, seqno);
178  
179     vc4_unlock_bo_reservations(dev, exec, acquire_ctx);
180  
181     if (bin)
182         vc4_job_put(bin);
183     vc4_job_put(render);
184  
185     return 0;
186  
187 FAIL:
188     return ret;
189 }

參考資料：

https://dri.freedesktop.org/docs/drm/gpu/drm-mm.html#gpu-scheduler

https://rosenzweig.io/blog/from-bifrost-to-panfrost.html

https://www.collabora.com/news-and-blog/blog/2017/01/26/mainline-explicit-fencing-part-3/

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 linux DRM GEM 筆記 linux DRM/KMS 測試工具 modetest、kmscude、igt-gpu-tools (一) linux DRM/KMS 測試工具 modetest、kmscude、igt-gpu-tools (二) linux DRM/KMS 測試工具 modetest、kmscude、igt-gpu-tools Linux中的DRM Linux中的DRM 介紹【轉】 Linux圖形顯示系統之DRM linux DRM driver 使用示例【轉】 Linux 中基於 DRM 的圖形顯示系統概述安裝linux centos 7.7 遇到 DRM:Pointer to TMDS table invalid