視頻流GPU解碼在ffempg的實現(一)-基本概念


這段時間在實現Gpu的視頻流解碼,遇到了很多的問題。

得到了阿里視頻處理專家蔡鼎老師以及英偉達開發季光老師的指導,在這里表示感謝!

 

基本命令(linux下)

1.查看物理顯卡

lspci  | grep -i vga
root@g1060server:/home/user# lspci  | grep -i vga
09:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
81:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1)


2.直接查看英偉達的物理顯卡信息
有的時候因為服務器型號,GPU型號等不兼容等問題,會導致主板無法識別到插入的顯卡,
我們可用下面的命令來查看主板是否識別到了顯卡:

root@g1060server:/home/user# lspci | grep -i nvidia
81:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1)
81:00.1 Audio device: NVIDIA Corporation Device 10f1 (rev a1)
82:00.0 VGA compatible controller: NVIDIA Corporation Device 1c03 (rev a1)
82:00.1 Audio device: NVIDIA Corporation Device 10f1 (rev a1)

出現上面的東西,說明主板已經識別到顯卡信息


cuda版本,驅動信息

root@g1060server:/home/user# nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Wed_Jul_17_18:36:13_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0


英偉達顯卡運行狀態信息

root@g1060server:/home/user# nvidia-smi
modprobe: ERROR: could not insert 'nvidia_340': No such device
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

查看失敗,一般沒安裝驅動

 

user@g1060server:~$ nvidia-smi
Fri Jan  5 21:50:34 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:81:00.0  On |                  N/A |
| 32%   35C    P8    10W / 120W |   3083MiB /  6071MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 106...  Off  | 00000000:82:00.0 Off |                  N/A |
| 32%   37C    P8    10W / 120W |   2542MiB /  6072MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

查看成功


查看cuda驅動是否安裝成功

root@g1060server:/home/user# cd /usr/local/cuda-8.0/samples/1_Utilities/deviceQuery
root@g1060server:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# ls
deviceQuery  deviceQuery.cpp  deviceQuery.o  Makefile  NsightEclipse.xml  readme.txt
root@g1060server:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# make
make: 沒有什么可以做的為 `all'
root@g1060server:/usr/local/cuda-8.0/samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 35
-> CUDA driver version is insufficient for CUDA runtime version
Result = FAIL

再次確認cuda驅動安裝失敗

查看cuda是否安裝成功
/usr/local/cuda/extras/demo_suite/deviceQuery

root@g1060server:/home/user/mjl/test# /usr/local/cuda/extras/demo_suite/deviceQuery
/usr/local/cuda/extras/demo_suite/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          9.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6071 MBytes (6366363648 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 129 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "GeForce GTX 1060 6GB"
  CUDA Driver Version / Runtime Version          9.0 / 8.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6073 MBytes (6367739904 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1709 MHz (1.71 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 130 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
> Peer access from GeForce GTX 1060 6GB (GPU0) -> GeForce GTX 1060 6GB (GPU1) : Yes
> Peer access from GeForce GTX 1060 6GB (GPU1) -> GeForce GTX 1060 6GB (GPU0) : Yes

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = GeForce GTX 1060 6GB, Device1 = GeForce GTX 1060 6GB
Result = PASS

查看成功

 

主要流程

要想實現ffempg的GPU化,必須要要對ffempg的解碼流程有基本的認識才能改造(因為GPU也是這個流程,不過中間一部分用GPU運算)

我在http://www.cnblogs.com/baldermurphy/p/7828337.html 中曾經帖出過CPU解碼的流程

主要流程如下

    avformat_network_init();
    av_register_all();//1.注冊各種編碼解碼模塊,如果3.3及以上版本,里面包含GPU解碼模塊
  
    std::string tempfile = “xxxx”;//視頻流地址

    avformat_find_stream_info(format_context_, nullptr)//2.拉取一小段數據流分析,便於得到數據的基本格式
    if (AVMEDIA_TYPE_VIDEO == enc->codec_type && video_stream_index_ < 0)//3.篩選出視頻流
    codec_ = avcodec_find_decoder(enc->codec_id);//4.找到對應的解碼器
    codec_context_ = avcodec_alloc_context3(codec_);//5.創建解碼器對應的結構體
    
    av_read_frame(format_context_, &packet_); //6.讀取數據包
    
    avcodec_send_packet(codec_context_, &packet_) //7.發出解碼
    avcodec_receive_frame(codec_context_, yuv_frame_) //8.接收解碼 
    
    sws_scale(y2r_sws_context_, yuv_frame_->data, yuv_frame_->linesize, 0, codec_context_->height, rgb_data_, rgb_line_size_) //9.數據格式轉換

 GPU解碼需要改變4,7,8,9這幾個步驟,也就是

找到gpu解碼器,

拉取數據給GPU解碼器,

得到解碼后的數據,

數據格式使用gpu轉換(如果需要的話,如nv12轉bgra),

最終的格式由具體的需求確定,比如很多opengl的互操作,轉成指定的格式(bgra),共用一段內存,數據立刻刷新,連拷貝都不用;

如果是轉化成圖片,又是另一種需求(bgr);

 

適用場景的匹配

不得不提的一點是,GPU運算是一個很好的功能,可是也要看需求和場景,如果不考慮這個,可能得不償失

比如一個極端的例子,opencv里面也有實現圖片的解碼,可是在追求效率的時候不會使用它的,

因為一張圖片數據上傳到GPU(非並行,很耗時),解碼(非常快),GPU顯存拷貝到內存(非並行,很耗時)

在上傳和拷貝出來的就花了幾百毫秒,而圖片數據的操作很頻繁,這會導致cpu占用率的得不到很好的緩解,甚至有的時候,不降反升,解碼雖然快,可是用戶的體驗是慢,而且CPU,GPU都占用了

 

主要的幾個網站

英偉達推薦的ffempg的gpu解碼sdk

https://developer.nvidia.com/nvidia-video-codec-sdk

檢查顯存泄露的工具

http://docs.nvidia.com/cuda/cuda-memcheck/index.html#device-side-allocation-checking

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM