背景
因為GPU解碼輸出的像素格式是NV12,而NV12轉換BGR24的耗時比YUV420轉換BGR24要高4倍,因此使用scale_npp在GPU上將像素格式轉為YUV420再輸出。
同時,也需要使用fps filter來設置幀率。
同樣使用FFmpeg的api,類似功能是命令行如下:
ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i ~/video/test.mp4 -vf "fps=15,scale_npp=format=yuv420p,hwdownload,format=yuv420p" -f null /dev/null
報錯現象
出錯先打印下面的日志,應該是decoder的某個索引用完了,導致send packet出錯,內部又不斷的重復初始化,顯存也就耗光了。
2021-06-09 12:14:42,473 FATAL 140468490848000 xxxx.cpp ffmpeg_log_callback No decoder surfaces left
運行一段時間后日志的報錯: 同時nvidia-smi查看顯存占用,發現顯存已經被占滿。
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback decoder->cvdl->cuvidCreateDecoder(&decoder->decoder, params) failed
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback Failed setup for format cuda: hwaccel initialisation returned error.
2021-06-09 12:51:30,353 NOTICE 140464455923456 xxxx.cpp get_hw_format Failed to get HW surface format.
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback decode_slice_header error
原因
經過測試,fps=12.5得設置在scale_npp后面才行。設置在前面就會有顯存問題。可能是解碼和npp都在顯存上處理,設置framerate的filter插入在npp之前,丟掉的frame沒有真正釋放顯存。
fps, as a filter, needs to be inserted in a filtergraph. It offers five rounding modes that affect which source frames are dropped or duplicated in order to achieve the target framerate.
2021-06-23更新
上述原因分析錯誤。實際將fps filter放在npp scale之后,100路並發測試發現有內存泄漏,最終引發oom異常。
最終確定出錯原因是av_buffersink_get_frame的使用錯誤,需要在返回值不是EAGAIN或error時循環調用該接口。因為之前沒有加fps filter時,基本是一次av_buffersrc_add_frame_flags對應一次av_buffersink_get_frame,所以沒問題。
添加fps filter后,沒有循環調用,導致滯留的frame沒有取出,相關資源不會釋放,導致最終av_buffer_pool_get失敗,報錯No decoder surfaces left
參考ffmpeg/doc/examples/filtering_video.c的源碼,略去了初始化部分代碼:
/* read all packets */
while
(
1
) {
if
((ret = av_read_frame(fmt_ctx, &packet)) <
0
)
break
;
if
(packet.stream_index == video_stream_index) {
ret = avcodec_send_packet(dec_ctx, &packet);
if
(ret <
0
) {
av_log(NULL, AV_LOG_ERROR,
"Error while sending a packet to the decoder\n"
);
break
;
}
while
(ret >=
0
) {
ret = avcodec_receive_frame(dec_ctx, frame);
if
(ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) {
break
;
}
else
if
(ret <
0
) {
av_log(NULL, AV_LOG_ERROR,
"Error while receiving a frame from the decoder\n"
);
goto
end;
}
frame->pts = frame->best_effort_timestamp;
/* push the decoded frame into the filtergraph */
if
(av_buffersrc_add_frame_flags(buffersrc_ctx, frame, AV_BUFFERSRC_FLAG_KEEP_REF) <
0
) {
av_log(NULL, AV_LOG_ERROR,
"Error while feeding the filtergraph\n"
);
break
;
}
/* pull filtered frames from the filtergraph */
while
(
1
) {
ret = av_buffersink_get_frame(buffersink_ctx, filt_frame);
if
(ret == AVERROR(EAGAIN) || ret == AVERROR_EOF)
break
;
if
(ret <
0
)
goto
end;
display_frame(filt_frame, buffersink_ctx->inputs[
0
]->time_base);
av_frame_unref(filt_frame);
}
av_frame_unref(frame);
}
}
av_packet_unref(&packet);
}
|
解決方案
第一次的錯誤嘗試
修改init_filters時設置給avfilter_graph_parse_ptr的參數,將filters_descr從
fps=12.5,scale_npp=format=yuv420p,hwdownload,format=yuv420p
改為
scale_npp=format=yuv420p,hwdownload,format=yuv420p,fps=12.5
備注:調整filters_descr后,因為fps filter后移,可能會對效率有一定影響。
第二次修改方案
參照示例代碼,將
avcodec_receive_frame和
av_buffersink_get_frame的調用過程根據返回值進行循環調用,取出內部緩存的frame
排查步驟
復現問題
經過多次測試,發現啟動三個進程后,用postman給每個進程批量發送25路rtmp視頻流並發,3-5分鍾后即可復現。
確定導致出錯的范圍
1. 查看日志報錯信息,進行匯總,發現首先出現的異常是No decoder surfaces left,正常情況不應該有這個報錯。
2. 添加調試日志
3. 臨時替換掉ffmpeg filter的代碼,直接調用av_hwframe_transfer_data將解碼結果拷貝回內存,測試發現沒有出現問題。
4. 改回ffmpeg filter進行像素格式轉換,復現問題。
5. 針對ffmpeg filter,修改filters_descr,去除fps的過濾進行測試,結果正常。因此出錯和fps filter有關。
6. 嘗試替換新的fps過濾方案。同時將filters_descr中的fps=后移,測試結果也正常。結合之前的測試結果,應該是fps filter插入到scale_npp之前時,縮小幀率會drop frame,但是顯存沒有正確釋放。
TODO,嘗試fps=在scale_npp之前時修復顯存泄漏的問題。得深入看FFmpeg fps filter的代碼。
其他,
一路並發,解碼進程會占用205MB顯存。
75路並發時,三個顯卡各占用5128MB顯存。
第二次分析問題
因為第一次修改將fps filter后移后,出現了內存問題。並且之前沒有查到根本原因,所以繼續深入排查。
在libavutil/buffer.c libavcodec/nvdec.c libavcodec/nvdec_h264.c等源碼中添加日志。
經過多次測試,發現是nvdec_decoder_frame_alloc中,判斷if (pool->nb_allocated >= pool->dpb_size) return NULL;
為什么nb_allocated會大於dpb_size呢?
日志顯示,nvdec_decoder_frame_alloc申請次數過多,導致報錯后,會重新申請新的NVDECFramePool *pool; 但是每次打印新的pool地址后,會很快重新nb_allocated大於dpb_size。而對比正常運行的解碼線程,只會創建3次,nb_allocated最終是3. (實際75路並發中,會有部分線程解碼正常)
是什么導致了這種差別?
對比ffmpeg/doc/examples/filtering_video.c以及其他demo源碼,注意到avcodec_receive_frame和av_buffersink_get_frame的使用不規范。而且只有加上fps filter時才有內存問題。因此嘗試將get frame的接口改成的while循環中調用,測試解決了內存問題。
[ffmpeg]$ git status libav*
On branch master
Changes not staged for commit:
modified: libavcodec/decode.c
modified: libavcodec/h264_slice.c
modified: libavcodec/h264dec.c
modified: libavcodec/nvdec.c
modified: libavcodec/nvdec_h264.c
modified: libavutil/buffer.c
modified: libavutil/mem.c
涉及到的函數:
static int decode_simple_internal(AVCodecContext *avctx, AVFrame *frame)
static AVBufferRef *nvdec_decoder_frame_alloc(void *opaque, int size) 重要
int ff_nvdec_decode_init(AVCodecContext *avctx) 重要
pool->dpb_size = frames_ctx->initial_pool_size; //dpb_size初始是10
ctx->decoder_pool = av_buffer_pool_init2(sizeof(int), pool, nvdec_decoder_frame_alloc, av_free); //設置decoder pool, 會設置nvdec_decoder_frame_alloc來申請空間
ff_nvdec_start_frame
nvdec_h264_start_frame
av_buffer_create
AVBufferRef *av_buffer_pool_get(AVBufferPool *pool)
fps的問題
解碼時設置framerate的filter,fps=12.5, 處理完的tmp frame的pts就是加1遞增了。之前frame的pts是間隔40ms。
不設置fps=xxx測試, npp scale像素轉換的輸出pts也是間隔40ms;
參考信息
AVBufferPool is an API for a lock-free thread-safe pool of AVBuffers.
Frequently allocating and freeing large buffers may be slow. AVBufferPool is meant to solve this in cases when the caller needs a set of buffers of the same size (the most obvious use case being buffers for raw video or audio frames).
At the beginning, the user must call av_buffer_pool_init() to create the buffer pool. Then whenever a buffer is needed, call av_buffer_pool_get() to get a reference to a new buffer, similar to av_buffer_alloc(). This new reference works in all aspects the same way as the one created by av_buffer_alloc(). However, when the last reference to this buffer is unreferenced, it is returned to the pool instead of being freed and will be reused for subsequent av_buffer_pool_get() calls.
When the caller is done with the pool and no longer needs to allocate any new buffers, av_buffer_pool_uninit() must be called to mark the pool as freeable. Once all the buffers are released, it will automatically be freed.
Allocating and releasing buffers with this API is thread-safe as long as either the default alloc callback is used, or the user-supplied one is thread-safe.
How do I reduce frames with blending in ffmpeg
Using ffmpeg to change framerate