ffmpeg Nvidia硬件加速總結


0. 概述

FFmpeg可通過Nvidia的GPU進行加速,其中高層接口是通過Video Codec SDK來實現GPU資源的調用。Video Codec SDK包含完整的的高性能工具、源碼及文檔,支持,可以運行在Windows和Linux系統之上。從軟件上來說,SDK包含兩類硬件加速接口,用於編碼加速的NVENCODE API和用於解碼加速的NVDECODE API(之前被稱為NVCUVID API)。從硬件上來說,Nvidia GPU有一到多個編解碼器(解碼器又稱硬件加速引擎),它們獨立於CUDA核。從視頻格式上來說,編碼支持H.264、H.265、無損壓縮,位深度支持8bit、10bit,色域空間支持YUV 4:4:4和4:2:0,分辨率支持最高8K;解碼支持MPEG-2、VC1、VP8、VP9、H.264、H.265、無損壓縮,位深度支持8 bit、10bit、12bit,色域空間支持YUV 4:2:0,分辨率支持最高8K。Video Codec SDK已經被集成在ffmpeg工程中,但是ffmpeg對編解碼器配置參數較少,如果需要充分的發揮編解碼器特性,還需要直接使用SDK進行編程。 
這里寫圖片描述
Nvidia編碼器與CPU上的x264的性能對比與質量對比如下圖所示,性能以每秒鍾編碼幀數為參考指標,質量以PSNR為參考指標。 
這里寫圖片描述
可看出性能方面Nvidia編碼器是x264的2~5倍,質量方面對於fast stream場景來說Nvidia編碼器優於x264,高質量場景來說低於x264,但沒有說明是哪款Nvidia的產品,以及對比測試的x264運行平台的CPU的型號及平台能力。下圖可以看出對於1080P@30fps,NVENC可支持21路的編碼或9路的高質量編碼。 
這里寫圖片描述
不同型號的GPU的編碼的能力表格如下: 
這里寫圖片描述
Nvidia解碼器性能指標如下圖所示,不過只有兩款Tesla的產品。 
這里寫圖片描述
解碼的能力表格如下: 
這里寫圖片描述

1. 安裝驅動與SDK

1.1 前期准備

需要關閉所有開源的顯示驅動 
vi /etc/modprobe.d/blacklist.conf 
添加 
blacklist amd76x_edac 
blacklist vga16fb 
blacklist nouveau 
blacklist nvidiafb 
blacklist rivatv

1.2 驅動安裝

(1). 刪除原來的驅動 
apt-get remove –purge nvidia* 
(2). 官方下載run文件的驅動進行安裝 
service lightdm stop 
chmod 777 NVIDIA-Linux-x86_64-367.44.run 
./NVIDIA-Linux-x86_64-367.44.run 
service lightdm start 
reboot 
(2). 驅動安裝驗證 
運行nvidia-smi,有如下輸出則安裝成功 
這里寫圖片描述 
問題1:如果重啟之后發現圖形界面進不去,發生了循環登錄,說明視頻驅動沒有安裝完全,需要重裝驅動,保險的方法是聯網安裝 
console中執行 
apt-get remove –purge nvidia-* 
add-apt-repository ppa:graphics-drivers/ppa 
apt-get update 
service lightdm stop 
apt-get install nvidia-375 nvidia-settings nvidia-prime 
nvidia-xconfig 
apt-get install mesa-common-dev //安裝缺少的庫 
apt-get install freeglut3-dev 
update-initramfs -u 
reboot

1.3 SDK安裝

(1). 官方下載run文件的驅動進行安裝 
cuda_8.0.44_linux.run –no-opengl-libs //不需要opengl支持 
apt-get install freeglut3-dev build-essential libx11-dev 
apt-get install libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa 
apt-get install libglu1-mesa-dev 
gedit ~/.bashrc 
添加 
export PATH=/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

gedit /etc/ld.so.conf.d/cuda.conf 
添加 
/usr/local/cuda/lib64 
/lib 
/lib32 
/lib64 
/usr/lib 
/user/lib32 
sudo ldconfig 
(2). SDK安裝驗證 
運行nvcc -V,有如下輸出則安裝成功。 
這里寫圖片描述

2. Sample測試

2.1 Sample編譯

進入Samples目錄,運行make,如果沒有安裝OpenGL相關庫,則NvDecodeGL會編譯不通過 
每個工程的含義可參考《NVIDIA_Video_Codec_SDK_Samples_Guide》 
NvEncoder: 基本功能的編碼 
NvEncoderCudaInterpo: CUDA surface的編碼 
NvEncoderD3D9Interpo: D3D9 surface的編碼,Linux下沒有 
NvEncoderLowLatency: 低延時特征的使用,如幀內刷新與參考圖像有效性(RPI) 
NvEncoderPerf: 最大性能的編碼 
NvTranscoder: NVENC的轉碼能力 
NvDecodeD3D9: 視頻解碼D3D9顯示,Linux下沒有 
NvDecodeD3D11: 視頻解碼D3D11顯示,Linux下沒有 
NvDecodeGL: 視頻解碼OpenGL顯示

2.2 Sample測試

參見《NVIDIA_Video_Codec_SDK_Samples_Guide》 
問題2:如果運行例子后顯示libcuda.so failed! 
在/usr/lib/x86_64-linux-gnu下制作鏈接libcuda.so,鏈接至libcuda.so.375.26

3. ffmpeg結合

3.1 ffmpeg編譯

3.1.1 前期工作

確保Video_Codec_SDK_7.1.9/Samples/common/inc 目錄下有基本的頭文件 
確保Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 目錄下有libGLEW.a

3.1.2 configure命令

configure \
  --enable-version3 \ --enable-libfdk-aac \ --enable-libmp3lame \ --enable-libx264 \ --enable-nvenc \ --extra-cflags=-I/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc \ --extra-ldflags=-L/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 \ --enable-shared \ --enable-gpl \ --enable-postproc \ --enable-nonfree \ --enable-avfilter \ --enable-pthreads
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

3.1.2 make

運行make & make install

3.2 ffmpeg測試

運行ffmpeg -codecs|grep nvenc 
顯示一下信息說明

ffmpeg version 3.0.git Copyright (c) 2000-2016 the FFmpeg developers built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.1) 20160609 configuration: --enable-version3 --enable-libfdk-aac --enable-libmp3lame --enable-libx264 --enable-nvenc --extra-cflags=-I/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc --extra-ldflags=-L/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 --enable-shared --enable-gpl --enable-postproc --enable-nonfree --enable-avfilter --enable-pthreads libavutil 55. 29.100 / 55. 29.100 libavcodec 57. 54.100 / 57. 54.100 libavformat 57. 48.100 / 57. 48.100 libavdevice 57. 0.102 / 57. 0.102 libavfilter 6. 57.100 / 6. 57.100 libswscale 4. 1.100 / 4. 1.100 libswresample 2. 1.100 / 2. 1.100 libpostproc 54. 0.100 / 54. 0.100 DEV.LS h264 H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 (encoders: libx264 libx264rgb h264_nvenc nvenc nvenc_h264 ) DEV.L. hevc H.265 / HEVC (High Efficiency Video Coding) (encoders: nvenc_hevc hevc_nvenc )
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

其中前綴含義如下: 
前綴含義 
D….. = Decoding supported 
.E…. = Encoding supported 
..V… = Video codec 
..A… = Audio codec 
..S… = Subtitle codec 
…I.. = Intra frame-only codec 
….L. = Lossy compression 
…..S = Lossless compression

3.3 編解碼器使用方法

h265編碼測試 
(1). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 60 -y 2_60.265 
(2). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 30 -y 2_30.265

h264編碼測試 
(3). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 60 -y 2_60.264 
(4). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 30 -y 2_30.264

h264轉h265 
(5). ffmpeg -i 1_60.264 -vcodec hevc_nvenc -r 60 -y 2_60_264to265.265 
(6). ffmpeg -i 1_30.264 -vcodec hevc_nvenc -r 30 -y 2_30_264to265.265

h265轉h264 
(7). ffmpeg -i 1_60.265 -vcodec h264_nvenc -r 60 -y 2_60_265to264.264 
(8). ffmpeg -i 1_30.265 -vcodec h264_nvenc -r 30 -y 2_30_265to264.264

3.4 程序開發使用方法

av_find_encoder_by_name(“h264_nvenc”); 
av_find_encoder_by_name(“hevc_nvenc”);

4. 輔助工具

watch -n 1 nvidia-smi 
以1秒鍾為間隔來查看GPU資源占用情況

5. 實測結果

5.1 硬件性能

本人用Geforce GTX1070與Tesla P4進行了測試,兩者都是Pascal架構。 
(1). GTX1070的硬件信息如下(deviceQuery顯示):

CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1070" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8110 MBytes (8504279040 bytes) (15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores GPU Max Clock rate: 1683 MHz (1.68 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070 Result = PASS
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

(2). P4的硬件信息如下:

 CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla P4" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 7606 MBytes (7975862272 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1114 MHz (1.11 GHz) Memory Clock rate: 3003 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P4 Result = PASS
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36

5.2 實驗結果

(1). GTX1070 
| | hevc編碼 | h264編碼 | h264轉h265 | h265轉h264 | 
| 60fps | 387fps(6.45x) | 430fps(7.17x) | 348fps(5.79x) | 170fps(2.84x) | 
| 30fps | 345fps(11.5x) | 429fps(14.3x) | 318fps(10.6x) | 94fps(3.13x) | 
(2). P4

| | hevc編碼 | h264編碼 | h264轉h265 | h265轉h264 | 
| 60fps | 235fps(3.91x) | 334fps(5.57x) | 217fps(3.63x) | 171fps(2.85x) | 
| 30fps | 212fps(7.07x) | 322fps(10.7x) | 198fps(6.59x) | 94fps(3.14x) |

5.3 實驗分析

雖然在硬件性能上,P4比GTX1070顯存略少,主頻略低,CUDA的數量多出了33%,但從實驗結果上看除了h265->h264結果持平外,P4表現都要遜色於GTX1070,這和官網所言“編解碼器獨立於CUDA核”相一致。

6. 源碼分析

集成在ffmpeg框架內的視頻編解碼器需要定義一個AVCodec結構體包含(私有結構體AVClass、三個函數等)

6.1 h264部分

(1). 結構體(nvenc_h264.c)

AVCodec ff_h264_nvenc_encoder = {
    .name = "h264_nvenc", .long_name = NULL_IF_CONFIG_SMALL("NVIDIA NVENC H.264 encoder"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_H264, .init = ff_nvenc_encode_init, //初始化函數 .encode2 = ff_nvenc_encode_frame, //編碼函數 .close = ff_nvenc_encode_close, //關閉函數 .priv_data_size = sizeof(NvencContext), //內部數據結構,見nvenc.h .priv_class = &h264_nvenc_class, //私有結構體 .defaults = defaults, .capabilities = AV_CODEC_CAP_DELAY, .caps_internal = FF_CODEC_CAP_INIT_CLEANUP, .pix_fmts = ff_nvenc_pix_fmts, }; static const AVClass h264_nvenc_class = { .class_name = "h264_nvenc", .item_name = av_default_item_name, .option = options, //編碼器選項參數在這個AVOption結構體中 .version = LIBAVUTIL_VERSION_INT, };
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

注意還有兩個AVCodec,一個名字叫nvenc、一個叫nvenc_h264,對應三大函數與h264_nvenc是一樣的 
(2). 處理函數(nvenc.c)

av_cold int ff_nvenc_encode_init(AVCodecContext *avctx)
{
   NvencContext *ctx = avctx->priv_data; //讀入私有結構體
   ... //下面是一些nvenc的api nvenc_load_libraries nvenc_setup_device nvenc_setup_encoder nvenc_setup_surfaces nvenc_setup_extradata ... } int ff_nvenc_encode_frame(AVCodecContext *avctx, AVPacket *pkt, const AVFrame *frame, int *got_packet) { ... if (frame) { inSurf = get_free_frame(ctx); //來一幀 ... res = nvenc_upload_frame(avctx, frame, inSurf);//編一幀 ... } } av_cold int ff_nvenc_encode_close(AVCodecContext *avctx) { ... //一些free和destroy的工作 }
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28

6.2 h265部分

(1). 結構體(nvenc_hevc.c)

AVCodec ff_hevc_nvenc_encoder = {
    .name = "hevc_nvenc", .long_name = NULL_IF_CONFIG_SMALL("NVIDIA NVENC hevc encoder"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_HEVC, .init = ff_nvenc_encode_init, //初始化函數 .encode2 = ff_nvenc_encode_frame, //編碼函數 .close = ff_nvenc_encode_close, //關閉函數 .priv_data_size = sizeof(NvencContext), //內部數據結構,見nvenc.h .priv_class = &hevc_nvenc_class, //私有結構體 .defaults = defaults, .pix_fmts = ff_nvenc_pix_fmts, .capabilities = AV_CODEC_CAP_DELAY, .caps_internal = FF_CODEC_CAP_INIT_CLEANUP, }; static const AVClass hevc_nvenc_class = { .class_name = "hevc_nvenc", .item_name = av_default_item_name, .option = options,//編碼器選項參數在這個AVOption結構體中 .version = LIBAVUTIL_VERSION_INT, };
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22

注意還有一個AVCodec,一個叫nvenc_hevc,對應三大函數與h264_nvenc是一樣的 
(2) 處理函數(nvenc.c) 
同h264的處理函數


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM