0. 概述
FFmpeg可通過Nvidia的GPU進行加速,其中高層接口是通過Video Codec SDK來實現GPU資源的調用。Video Codec SDK包含完整的的高性能工具、源碼及文檔,支持,可以運行在Windows和Linux系統之上。從軟件上來說,SDK包含兩類硬件加速接口,用於編碼加速的NVENCODE API和用於解碼加速的NVDECODE API(之前被稱為NVCUVID API)。從硬件上來說,Nvidia GPU有一到多個編解碼器(解碼器又稱硬件加速引擎),它們獨立於CUDA核。從視頻格式上來說,編碼支持H.264、H.265、無損壓縮,位深度支持8bit、10bit,色域空間支持YUV 4:4:4和4:2:0,分辨率支持最高8K;解碼支持MPEG-2、VC1、VP8、VP9、H.264、H.265、無損壓縮,位深度支持8 bit、10bit、12bit,色域空間支持YUV 4:2:0,分辨率支持最高8K。Video Codec SDK已經被集成在ffmpeg工程中,但是ffmpeg對編解碼器配置參數較少,如果需要充分的發揮編解碼器特性,還需要直接使用SDK進行編程。
Nvidia編碼器與CPU上的x264的性能對比與質量對比如下圖所示,性能以每秒鍾編碼幀數為參考指標,質量以PSNR為參考指標。
可看出性能方面Nvidia編碼器是x264的2~5倍,質量方面對於fast stream場景來說Nvidia編碼器優於x264,高質量場景來說低於x264,但沒有說明是哪款Nvidia的產品,以及對比測試的x264運行平台的CPU的型號及平台能力。下圖可以看出對於1080P@30fps,NVENC可支持21路的編碼或9路的高質量編碼。
不同型號的GPU的編碼的能力表格如下:
Nvidia解碼器性能指標如下圖所示,不過只有兩款Tesla的產品。
解碼的能力表格如下:
1. 安裝驅動與SDK
1.1 前期准備
需要關閉所有開源的顯示驅動
vi /etc/modprobe.d/blacklist.conf
添加
blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist nvidiafb
blacklist rivatv
1.2 驅動安裝
(1). 刪除原來的驅動
apt-get remove –purge nvidia*
(2). 官方下載run文件的驅動進行安裝
service lightdm stop
chmod 777 NVIDIA-Linux-x86_64-367.44.run
./NVIDIA-Linux-x86_64-367.44.run
service lightdm start
reboot
(2). 驅動安裝驗證
運行nvidia-smi,有如下輸出則安裝成功
問題1:如果重啟之后發現圖形界面進不去,發生了循環登錄,說明視頻驅動沒有安裝完全,需要重裝驅動,保險的方法是聯網安裝
console中執行
apt-get remove –purge nvidia-*
add-apt-repository ppa:graphics-drivers/ppa
apt-get update
service lightdm stop
apt-get install nvidia-375 nvidia-settings nvidia-prime
nvidia-xconfig
apt-get install mesa-common-dev //安裝缺少的庫
apt-get install freeglut3-dev
update-initramfs -u
reboot
1.3 SDK安裝
(1). 官方下載run文件的驅動進行安裝
cuda_8.0.44_linux.run –no-opengl-libs //不需要opengl支持
apt-get install freeglut3-dev build-essential libx11-dev
apt-get install libxmu-dev libxi-dev libgl1-mesa-glx libglu1-mesa
apt-get install libglu1-mesa-dev
gedit ~/.bashrc
添加
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
gedit /etc/ld.so.conf.d/cuda.conf
添加
/usr/local/cuda/lib64
/lib
/lib32
/lib64
/usr/lib
/user/lib32
sudo ldconfig
(2). SDK安裝驗證
運行nvcc -V,有如下輸出則安裝成功。
2. Sample測試
2.1 Sample編譯
進入Samples目錄,運行make,如果沒有安裝OpenGL相關庫,則NvDecodeGL會編譯不通過
每個工程的含義可參考《NVIDIA_Video_Codec_SDK_Samples_Guide》
NvEncoder: 基本功能的編碼
NvEncoderCudaInterpo: CUDA surface的編碼
NvEncoderD3D9Interpo: D3D9 surface的編碼,Linux下沒有
NvEncoderLowLatency: 低延時特征的使用,如幀內刷新與參考圖像有效性(RPI)
NvEncoderPerf: 最大性能的編碼
NvTranscoder: NVENC的轉碼能力
NvDecodeD3D9: 視頻解碼D3D9顯示,Linux下沒有
NvDecodeD3D11: 視頻解碼D3D11顯示,Linux下沒有
NvDecodeGL: 視頻解碼OpenGL顯示
2.2 Sample測試
參見《NVIDIA_Video_Codec_SDK_Samples_Guide》
問題2:如果運行例子后顯示libcuda.so failed!
在/usr/lib/x86_64-linux-gnu下制作鏈接libcuda.so,鏈接至libcuda.so.375.26
3. ffmpeg結合
3.1 ffmpeg編譯
3.1.1 前期工作
確保Video_Codec_SDK_7.1.9/Samples/common/inc 目錄下有基本的頭文件
確保Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 目錄下有libGLEW.a
3.1.2 configure命令
configure \
--enable-version3 \ --enable-libfdk-aac \ --enable-libmp3lame \ --enable-libx264 \ --enable-nvenc \ --extra-cflags=-I/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc \ --extra-ldflags=-L/root/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 \ --enable-shared \ --enable-gpl \ --enable-postproc \ --enable-nonfree \ --enable-avfilter \ --enable-pthreads
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
3.1.2 make
運行make & make install
3.2 ffmpeg測試
運行ffmpeg -codecs|grep nvenc
顯示一下信息說明
ffmpeg version 3.0.git Copyright (c) 2000-2016 the FFmpeg developers built with gcc 5.4.0 (Ubuntu 5.4.0-6ubuntu1~16.04.1) 20160609 configuration: --enable-version3 --enable-libfdk-aac --enable-libmp3lame --enable-libx264 --enable-nvenc --extra-cflags=-I/workspace/Video_Codec_SDK_7.1.9/Samples/common/inc --extra-ldflags=-L/workspace/Video_Codec_SDK_7.1.9/Samples/common/lib/linux/x86_64 --enable-shared --enable-gpl --enable-postproc --enable-nonfree --enable-avfilter --enable-pthreads libavutil 55. 29.100 / 55. 29.100 libavcodec 57. 54.100 / 57. 54.100 libavformat 57. 48.100 / 57. 48.100 libavdevice 57. 0.102 / 57. 0.102 libavfilter 6. 57.100 / 6. 57.100 libswscale 4. 1.100 / 4. 1.100 libswresample 2. 1.100 / 2. 1.100 libpostproc 54. 0.100 / 54. 0.100 DEV.LS h264 H.264 / AVC / MPEG-4 AVC / MPEG-4 part 10 (encoders: libx264 libx264rgb h264_nvenc nvenc nvenc_h264 ) DEV.L. hevc H.265 / HEVC (High Efficiency Video Coding) (encoders: nvenc_hevc hevc_nvenc )
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
其中前綴含義如下:
前綴含義
D….. = Decoding supported
.E…. = Encoding supported
..V… = Video codec
..A… = Audio codec
..S… = Subtitle codec
…I.. = Intra frame-only codec
….L. = Lossy compression
…..S = Lossless compression
3.3 編解碼器使用方法
h265編碼測試
(1). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 60 -y 2_60.265
(2). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec hevc_nvenc -r 30 -y 2_30.265
h264編碼測試
(3). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 60 -y 2_60.264
(4). ffmpeg -s 1920x1080 -pix_fmt yuv420p -i BQTerrace_1920x1080_60.yuv -vcodec h264_nvenc -r 30 -y 2_30.264
h264轉h265
(5). ffmpeg -i 1_60.264 -vcodec hevc_nvenc -r 60 -y 2_60_264to265.265
(6). ffmpeg -i 1_30.264 -vcodec hevc_nvenc -r 30 -y 2_30_264to265.265
h265轉h264
(7). ffmpeg -i 1_60.265 -vcodec h264_nvenc -r 60 -y 2_60_265to264.264
(8). ffmpeg -i 1_30.265 -vcodec h264_nvenc -r 30 -y 2_30_265to264.264
3.4 程序開發使用方法
av_find_encoder_by_name(“h264_nvenc”);
av_find_encoder_by_name(“hevc_nvenc”);
4. 輔助工具
watch -n 1 nvidia-smi
以1秒鍾為間隔來查看GPU資源占用情況
5. 實測結果
5.1 硬件性能
本人用Geforce GTX1070與Tesla P4進行了測試,兩者都是Pascal架構。
(1). GTX1070的硬件信息如下(deviceQuery顯示):
CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1070" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 8110 MBytes (8504279040 bytes) (15) Multiprocessors, (128) CUDA Cores/MP: 1920 CUDA Cores GPU Max Clock rate: 1683 MHz (1.68 GHz) Memory Clock rate: 4004 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1070 Result = PASS
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
(2). P4的硬件信息如下:
CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "Tesla P4" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 7606 MBytes (7975862272 bytes) (20) Multiprocessors, (128) CUDA Cores/MP: 2560 CUDA Cores GPU Max Clock rate: 1114 MHz (1.11 GHz) Memory Clock rate: 3003 Mhz Memory Bus Width: 256-bit L2 Cache Size: 2097152 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Enabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 5 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla P4 Result = PASS
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
5.2 實驗結果
(1). GTX1070
| | hevc編碼 | h264編碼 | h264轉h265 | h265轉h264 |
| 60fps | 387fps(6.45x) | 430fps(7.17x) | 348fps(5.79x) | 170fps(2.84x) |
| 30fps | 345fps(11.5x) | 429fps(14.3x) | 318fps(10.6x) | 94fps(3.13x) |
(2). P4
| | hevc編碼 | h264編碼 | h264轉h265 | h265轉h264 |
| 60fps | 235fps(3.91x) | 334fps(5.57x) | 217fps(3.63x) | 171fps(2.85x) |
| 30fps | 212fps(7.07x) | 322fps(10.7x) | 198fps(6.59x) | 94fps(3.14x) |
5.3 實驗分析
雖然在硬件性能上,P4比GTX1070顯存略少,主頻略低,CUDA的數量多出了33%,但從實驗結果上看除了h265->h264結果持平外,P4表現都要遜色於GTX1070,這和官網所言“編解碼器獨立於CUDA核”相一致。
6. 源碼分析
集成在ffmpeg框架內的視頻編解碼器需要定義一個AVCodec結構體包含(私有結構體AVClass、三個函數等)
6.1 h264部分
(1). 結構體(nvenc_h264.c)
AVCodec ff_h264_nvenc_encoder = {
.name = "h264_nvenc", .long_name = NULL_IF_CONFIG_SMALL("NVIDIA NVENC H.264 encoder"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_H264, .init = ff_nvenc_encode_init, //初始化函數 .encode2 = ff_nvenc_encode_frame, //編碼函數 .close = ff_nvenc_encode_close, //關閉函數 .priv_data_size = sizeof(NvencContext), //內部數據結構,見nvenc.h .priv_class = &h264_nvenc_class, //私有結構體 .defaults = defaults, .capabilities = AV_CODEC_CAP_DELAY, .caps_internal = FF_CODEC_CAP_INIT_CLEANUP, .pix_fmts = ff_nvenc_pix_fmts, }; static const AVClass h264_nvenc_class = { .class_name = "h264_nvenc", .item_name = av_default_item_name, .option = options, //編碼器選項參數在這個AVOption結構體中 .version = LIBAVUTIL_VERSION_INT, };
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
注意還有兩個AVCodec,一個名字叫nvenc、一個叫nvenc_h264,對應三大函數與h264_nvenc是一樣的
(2). 處理函數(nvenc.c)
av_cold int ff_nvenc_encode_init(AVCodecContext *avctx)
{
NvencContext *ctx = avctx->priv_data; //讀入私有結構體
... //下面是一些nvenc的api nvenc_load_libraries nvenc_setup_device nvenc_setup_encoder nvenc_setup_surfaces nvenc_setup_extradata ... } int ff_nvenc_encode_frame(AVCodecContext *avctx, AVPacket *pkt, const AVFrame *frame, int *got_packet) { ... if (frame) { inSurf = get_free_frame(ctx); //來一幀 ... res = nvenc_upload_frame(avctx, frame, inSurf);//編一幀 ... } } av_cold int ff_nvenc_encode_close(AVCodecContext *avctx) { ... //一些free和destroy的工作 }
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
6.2 h265部分
(1). 結構體(nvenc_hevc.c)
AVCodec ff_hevc_nvenc_encoder = {
.name = "hevc_nvenc", .long_name = NULL_IF_CONFIG_SMALL("NVIDIA NVENC hevc encoder"), .type = AVMEDIA_TYPE_VIDEO, .id = AV_CODEC_ID_HEVC, .init = ff_nvenc_encode_init, //初始化函數 .encode2 = ff_nvenc_encode_frame, //編碼函數 .close = ff_nvenc_encode_close, //關閉函數 .priv_data_size = sizeof(NvencContext), //內部數據結構,見nvenc.h .priv_class = &hevc_nvenc_class, //私有結構體 .defaults = defaults, .pix_fmts = ff_nvenc_pix_fmts, .capabilities = AV_CODEC_CAP_DELAY, .caps_internal = FF_CODEC_CAP_INIT_CLEANUP, }; static const AVClass hevc_nvenc_class = { .class_name = "hevc_nvenc", .item_name = av_default_item_name, .option = options,//編碼器選項參數在這個AVOption結構體中 .version = LIBAVUTIL_VERSION_INT, };
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
注意還有一個AVCodec,一個叫nvenc_hevc,對應三大函數與h264_nvenc是一樣的
(2) 處理函數(nvenc.c)
同h264的處理函數