ffmpeg Nvidia硬件加速總結

本文轉載自查看原文 2017-09-01 15:03 2609 視音頻處理

0. 概述

FFmpeg可通過Nvidia的GPU進行加速，其中高層接口是通過Video Codec SDK來實現GPU資源的調用。Video Codec SDK包含完整的的高性能工具、源碼及文檔，支持，可以運行在Windows和Linux系統之上。從軟件上來說，SDK包含兩類硬件加速接口，用於編碼加速的NVENCODE API和用於解碼加速的NVDECODE API(之前被稱為NVCUVID API)。從硬件上來說，Nvidia GPU有一到多個編解碼器(解碼器又稱硬件加速引擎)，它們獨立於CUDA核。從視頻格式上來說，編碼支持H.264、H.265、無損壓縮，位深度支持8bit、10bit，色域空間支持YUV 4:4:4和4:2:0，分辨率支持最高8K；解碼支持MPEG-2、VC1、VP8、VP9、H.264、H.265、無損壓縮，位深度支持8 bit、10bit、12bit，色域空間支持YUV 4:2:0，分辨率支持最高8K。Video Codec SDK已經被集成在ffmpeg工程中，但是ffmpeg對編解碼器配置參數較少，如果需要充分的發揮編解碼器特性，還需要直接使用SDK進行編程。
這里寫圖片描述
Nvidia編碼器與CPU上的x264的性能對比與質量對比如下圖所示，性能以每秒鍾編碼幀數為參考指標，質量以PSNR為參考指標。

可看出性能方面Nvidia編碼器是x264的2~5倍，質量方面對於fast stream場景來說Nvidia編碼器優於x264，高質量場景來說低於x264，但沒有說明是哪款Nvidia的產品，以及對比測試的x264運行平台的CPU的型號及平台能力。下圖可以看出對於1080P@30fps，NVENC可支持21路的編碼或9路的高質量編碼。
這里寫圖片描述
不同型號的GPU的編碼的能力表格如下：

Nvidia解碼器性能指標如下圖所示，不過只有兩款Tesla的產品。

解碼的能力表格如下：

1. 安裝驅動與SDK

1.1 前期准備

需要關閉所有開源的顯示驅動
vi /etc/modprobe.d/blacklist.conf
添加
blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist nvidiafb
blacklist rivatv

1.2 驅動安裝

(1). 刪除原來的驅動
apt-get remove –purge nvidia*
(2). 官方下載run文件的驅動進行安裝
service lightdm stop
chmod 777 NVIDIA-Linux-x86_64-367.44.run
./NVIDIA-Linux-x86_64-367.44.run
service lightdm start
reboot
(2). 驅動安裝驗證
運行nvidia-smi，有如下輸出則安裝成功
這里寫圖片描述
問題1：如果重啟之后發現圖形界面進不去，發生了循環登錄，說明視頻驅動沒有安裝完全，需要重裝驅動，保險的方法是聯網安裝
console中執行
apt-get remove –purge nvidia-*
add-apt-repository ppa:graphics-drivers/ppa
apt-get update
service lightdm stop
apt-get install nvidia-375 nvidia-settings nvidia-prime
nvidia-xconfig
apt-get install mesa-common-dev //安裝缺少的庫
apt-get install freeglut3-dev
update-initramfs -u
reboot

1.3 SDK安裝

(1). 官方下載run文件的驅動進行安裝
cuda_8.0.44_linux.run –no-opengl-libs //不需要opengl支持
apt-get install freeglut3-dev build-essential libx11-dev
apt-get install libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa
apt-get install libglu1-mesa-dev
gedit ~/.bashrc
添加
export PATH=/usr/local/cuda/bin:$PATH

export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

gedit /etc/ld.so.conf.d/cuda.conf
添加
/usr/local/cuda/lib64
/lib
/lib32
/lib64
/usr/lib
/user/lib32
sudo ldconfig
(2). SDK安裝驗證
運行nvcc -V，有如下輸出則安裝成功。
這里寫圖片描述

2. Sample測試

2.1 Sample編譯

進入Samples目錄，運行make，如果沒有安裝OpenGL相關庫，則NvDecodeGL會編譯不通過
每個工程的含義可參考《NVIDIA_Video_Codec_SDK_Samples_Guide》
NvEncoder: 基本功能的編碼
NvEncoderCudaInterpo: CUDA surface的編碼
NvEncoderD3D9Interpo: D3D9 surface的編碼，Linux下沒有
NvEncoderLowLatency: 低延時特征的使用，如幀內刷新與參考圖像有效性(RPI)
NvEncoderPerf: 最大性能的編碼
NvTranscoder: NVENC的轉碼能力
NvDecodeD3D9: 視頻解碼D3D9顯示，Linux下沒有
NvDecodeD3D11: 視頻解碼D3D11顯示，Linux下沒有
NvDecodeGL: 視頻解碼OpenGL顯示

2.2 Sample測試

參見《NVIDIA_Video_Codec_SDK_Samples_Guide》
問題2：如果運行例子后顯示libcuda.so failed!
在/usr/lib/x86_64-linux-gnu下制作鏈接libcuda.so，鏈接至libcuda.so.375.26

3. ffmpeg結合

3.1 ffmpeg編譯

3.1.1 前期工作

確保Video_Codec_SDK_7.1.9/Samples/common/inc 目錄下有基本的頭文件
確保Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 目錄下有libGLEW.a

3.1.2 configure命令

configure \
  --enable-version3 \ --enable-libfdk-aac \ --enable-libmp3lame \ --enable-libx264 \ --enable-nvenc \ --extra-cflags=-I/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc \ --extra-ldflags=-L/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 \ --enable-shared \ --enable-gpl \ --enable-postproc \ --enable-nonfree \ --enable-avfilter \ --enable-pthreads

3.1.2 make

運行make & make install

3.2 ffmpeg測試

運行ffmpeg -codecs|grep nvenc
顯示一下信息說明

ffmpeg version 3.0.git Copyright (c) 2000-2016 the FFmpeg developers built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.1) 20160609 configuration: --enable-version3 --enable-libfdk-aac --enable-libmp3lame --enable-libx264 --enable-nvenc --extra-cflags=-I/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc --extra-ldflags=-L/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 --enable-shared --enable-gpl --enable-postproc --enable-nonfree --enable-avfilter --enable-pthreads libavutil 55. 29.100 / 55. 29.100 libavcodec 57. 54.100 / 57. 54.100 libavformat 57. 48.100 / 57. 48.100 libavdevice 57. 0.102 / 57. 0.102 libavfilter 6. 57.100 / 6. 57.100 libswscale 4. 1.100 / 4. 1.100 libswresample 2. 1.100 / 2. 1.100 libpostproc 54. 0.100 / 54. 0.100 DEV.LS h264 H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 (encoders: libx264 libx264rgb h264_nvenc nvenc nvenc_h264 ) DEV.L. hevc H.265 / HEVC (High Efficiency Video Coding) (encoders: nvenc_hevc hevc_nvenc )

其中前綴含義如下：
前綴含義
D….. = Decoding supported
.E…. = Encoding supported
..V… = Video codec
..A… = Audio codec
..S… = Subtitle codec
…I.. = Intra frame-only codec
….L. = Lossy compression
…..S = Lossless compression

3.3 編解碼器使用方法

h265編碼測試
(1). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 60 -y 2_60.265
(2). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 30 -y 2_30.265

h264編碼測試
(3). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 60 -y 2_60.264
(4). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 30 -y 2_30.264

h264轉h265
(5). ffmpeg -i 1_60.264 -vcodec hevc_nvenc -r 60 -y 2_60_264to265.265
(6). ffmpeg -i 1_30.264 -vcodec hevc_nvenc -r 30 -y 2_30_264to265.265

h265轉h264
(7). ffmpeg -i 1_60.265 -vcodec h264_nvenc -r 60 -y 2_60_265to264.264
(8). ffmpeg -i 1_30.265 -vcodec h264_nvenc -r 30 -y 2_30_265to264.264

3.4 程序開發使用方法

av_find_encoder_by_name(“h264_nvenc”);
av_find_encoder_by_name(“hevc_nvenc”);

4. 輔助工具

watch -n 1 nvidia-smi
以1秒鍾為間隔來查看GPU資源占用情況

5. 實測結果

5.1 硬件性能

本人用Geforce GTX1070與Tesla P4進行了測試，兩者都是Pascal架構。
(1). GTX1070的硬件信息如下(deviceQuery顯示)：

CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1070" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8110 MBytes (8504279040 bytes) (15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores GPU Max Clock rate: 1683 MHz (1.68 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070 Result = PASS

(2). P4的硬件信息如下：

 CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla P4" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 7606 MBytes (7975862272 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1114 MHz (1.11 GHz) Memory Clock rate: 3003 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P4 Result = PASS

5.2 實驗結果

(1). GTX1070
| | hevc編碼 | h264編碼 | h264轉h265 | h265轉h264 |
| 60fps | 387fps(6.45x) | 430fps(7.17x) | 348fps(5.79x) | 170fps(2.84x) |
| 30fps | 345fps(11.5x) | 429fps(14.3x) | 318fps(10.6x) | 94fps(3.13x) |
(2). P4

5.3 實驗分析

雖然在硬件性能上，P4比GTX1070顯存略少，主頻略低，CUDA的數量多出了33%，但從實驗結果上看除了h265->h264結果持平外，P4表現都要遜色於GTX1070，這和官網所言“編解碼器獨立於CUDA核”相一致。

6. 源碼分析

集成在ffmpeg框架內的視頻編解碼器需要定義一個AVCodec結構體包含（私有結構體AVClass、三個函數等）

6.1 h264部分

(1). 結構體(nvenc_h264.c)

AVCodec ff_h264_nvenc_encoder = {
    .name = "h264_nvenc", .long_name = NULL_IF_CONFIG_SMALL("NVIDIA NVENC H.264 encoder"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_H264, .init = ff_nvenc_encode_init, //初始化函數 .encode2 = ff_nvenc_encode_frame, //編碼函數 .close = ff_nvenc_encode_close, //關閉函數 .priv_data_size = sizeof(NvencContext), //內部數據結構，見nvenc.h .priv_class = &h264_nvenc_class, //私有結構體 .defaults = defaults, .capabilities = AV_CODEC_CAP_DELAY, .caps_internal = FF_CODEC_CAP_INIT_CLEANUP, .pix_fmts = ff_nvenc_pix_fmts, }; static const AVClass h264_nvenc_class = { .class_name = "h264_nvenc", .item_name = av_default_item_name, .option = options, //編碼器選項參數在這個AVOption結構體中 .version = LIBAVUTIL_VERSION_INT, };

注意還有兩個AVCodec，一個名字叫nvenc、一個叫nvenc_h264，對應三大函數與h264_nvenc是一樣的
(2). 處理函數(nvenc.c)

av_cold int ff_nvenc_encode_init(AVCodecContext *avctx)
{
   NvencContext *ctx = avctx->priv_data; //讀入私有結構體
   ... //下面是一些nvenc的api nvenc_load_libraries nvenc_setup_device nvenc_setup_encoder nvenc_setup_surfaces nvenc_setup_extradata ... } int ff_nvenc_encode_frame(AVCodecContext *avctx, AVPacket *pkt, const AVFrame *frame, int *got_packet) { ... if (frame) { inSurf = get_free_frame(ctx); //來一幀 ... res = nvenc_upload_frame(avctx, frame, inSurf);//編一幀 ... } } av_cold int ff_nvenc_encode_close(AVCodecContext *avctx) { ... //一些free和destroy的工作 }

6.2 h265部分

(1). 結構體(nvenc_hevc.c)

AVCodec ff_hevc_nvenc_encoder = {
    .name = "hevc_nvenc", .long_name = NULL_IF_CONFIG_SMALL("NVIDIA NVENC hevc encoder"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_HEVC, .init = ff_nvenc_encode_init, //初始化函數 .encode2 = ff_nvenc_encode_frame, //編碼函數 .close = ff_nvenc_encode_close, //關閉函數 .priv_data_size = sizeof(NvencContext), //內部數據結構，見nvenc.h .priv_class = &hevc_nvenc_class, //私有結構體 .defaults = defaults, .pix_fmts = ff_nvenc_pix_fmts, .capabilities = AV_CODEC_CAP_DELAY, .caps_internal = FF_CODEC_CAP_INIT_CLEANUP, }; static const AVClass hevc_nvenc_class = { .class_name = "hevc_nvenc", .item_name = av_default_item_name, .option = options,//編碼器選項參數在這個AVOption結構體中 .version = LIBAVUTIL_VERSION_INT, };

注意還有一個AVCodec，一個叫nvenc_hevc，對應三大函數與h264_nvenc是一樣的
(2) 處理函數(nvenc.c)
同h264的處理函數

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [FFMPEG硬件加速]nvidia方案 ffmpeg Intel硬件加速總結 ffmpeg轉碼使用硬件加速 ffmpeg實現dxva2硬件加速 Android 硬件加速使用總結 Android硬件加速相關問題總結基於FFmpeg+VAAPI的硬件加速渲染技術 GPU硬件加速 FortiGate 硬件加速 Android的硬件加速