No decoder surfaces left 和 CUDA_ERROR_OUT_OF_MEMORY的報錯解決

本文轉載自查看原文 2021-06-09 18:42 1440 FFmpeg

背景

因為GPU解碼輸出的像素格式是NV12，而NV12轉換BGR24的耗時比YUV420轉換BGR24要高4倍，因此使用scale_npp在GPU上將像素格式轉為YUV420再輸出。

同時，也需要使用fps filter來設置幀率。

同樣使用FFmpeg的api，類似功能是命令行如下：

ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i ~/video/test.mp4 -vf "fps=15,scale_npp=format=yuv420p,hwdownload,format=yuv420p" -f null /dev/null

報錯現象

出錯先打印下面的日志，應該是decoder的某個索引用完了，導致send packet出錯，內部又不斷的重復初始化，顯存也就耗光了。
2021-06-09 12:14:42,473 FATAL 140468490848000 xxxx.cpp ffmpeg_log_callback No decoder surfaces left

運行一段時間后日志的報錯：同時nvidia-smi查看顯存占用，發現顯存已經被占滿。

2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback decoder->cvdl->cuvidCreateDecoder(&decoder->decoder, params) failed
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback -> CUDA_ERROR_OUT_OF_MEMORY: out of memory
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback

2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback Failed setup for format cuda: hwaccel initialisation returned error.

2021-06-09 12:51:30,353 NOTICE 140464455923456 xxxx.cpp get_hw_format Failed to get HW surface format.
2021-06-09 12:51:30,353 FATAL 140464455923456 xxxx.cpp ffmpeg_log_callback decode_slice_header error

原因

經過測試，fps=12.5得設置在scale_npp后面才行。設置在前面就會有顯存問題。可能是解碼和npp都在顯存上處理，設置framerate的filter插入在npp之前，丟掉的frame沒有真正釋放顯存。

fps, as a filter, needs to be inserted in a filtergraph. It offers five rounding modes that affect which source frames are dropped or duplicated in order to achieve the target framerate.

2021-06-23更新

上述原因分析錯誤。實際將fps filter放在npp scale之后，100路並發測試發現有內存泄漏，最終引發oom異常。

最終確定出錯原因是av_buffersink_get_frame的使用錯誤，需要在返回值不是EAGAIN或error時循環調用該接口。因為之前沒有加fps filter時，基本是一次av_buffersrc_add_frame_flags對應一次av_buffersink_get_frame，所以沒問題。

添加fps filter后，沒有循環調用，導致滯留的frame沒有取出，相關資源不會釋放，導致最終av_buffer_pool_get失敗，報錯No decoder surfaces left

參考ffmpeg/doc/examples/filtering_video.c的源碼，略去了初始化部分代碼：

 
                  /* read all packets */ 
                 
                  while  
                  ( 
                  1 
                  ) { 
                 
                  if  
                  ((ret = av_read_frame(fmt_ctx, &packet)) <  
                  0 
                  ) 
                 
                  break 
                  ; 
                 
                  if  
                  (packet.stream_index == video_stream_index) { 
                 
                  ret = avcodec_send_packet(dec_ctx, &packet); 
                 
                  if  
                  (ret <  
                  0 
                  ) { 
                 
                  av_log(NULL, AV_LOG_ERROR,  
                  "Error while sending a packet to the decoder\n" 
                  ); 
                 
                  break 
                  ; 
                 
                  } 
                 
                  while  
                  (ret >=  
                  0 
                  ) { 
                 
                  ret = avcodec_receive_frame(dec_ctx, frame); 
                 
                  if  
                  (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) { 
                 
                  break 
                  ; 
                 
                  }  
                  else  
                  if  
                  (ret <  
                  0 
                  ) { 
                 
                  av_log(NULL, AV_LOG_ERROR,  
                  "Error while receiving a frame from the decoder\n" 
                  ); 
                 
                  goto  
                  end; 
                 
                  } 
                 
                  frame->pts = frame->best_effort_timestamp; 
                 
                  /* push the decoded frame into the filtergraph */ 
                 
                  if  
                  (av_buffersrc_add_frame_flags(buffersrc_ctx, frame, AV_BUFFERSRC_FLAG_KEEP_REF) <  
                  0 
                  ) { 
                 
                  av_log(NULL, AV_LOG_ERROR,  
                  "Error while feeding the filtergraph\n" 
                  ); 
                 
                  break 
                  ; 
                 
                  } 
                 
                  /* pull filtered frames from the filtergraph */ 
                 
                  while  
                  ( 
                  1 
                  ) { 
                 
                  ret = av_buffersink_get_frame(buffersink_ctx, filt_frame); 
                 
                  if  
                  (ret == AVERROR(EAGAIN) || ret == AVERROR_EOF) 
                 
                  break 
                  ; 
                 
                  if  
                  (ret <  
                  0 
                  ) 
                 
                  goto  
                  end; 
                 
                  display_frame(filt_frame, buffersink_ctx->inputs[ 
                  0 
                  ]->time_base); 
                 
                  av_frame_unref(filt_frame); 
                 
                  } 
                 
                  av_frame_unref(frame); 
                 
                  } 
                 
                  } 
                 
                  av_packet_unref(&packet); 
                 
                  }

解決方案

第一次的錯誤嘗試

修改init_filters時設置給avfilter_graph_parse_ptr的參數，將filters_descr從

fps=12.5,scale_npp=format=yuv420p,hwdownload,format=yuv420p

改為

scale_npp=format=yuv420p,hwdownload,format=yuv420p,fps=12.5

備注：調整filters_descr后，因為fps filter后移，可能會對效率有一定影響。

第二次修改方案

參照示例代碼，將

avcodec_receive_frame和

av_buffersink_get_frame的調用過程根據返回值進行循環調用，取出內部緩存的frame

排查步驟

復現問題

經過多次測試，發現啟動三個進程后，用postman給每個進程批量發送25路rtmp視頻流並發，3-5分鍾后即可復現。

確定導致出錯的范圍

1. 查看日志報錯信息，進行匯總，發現首先出現的異常是No decoder surfaces left，正常情況不應該有這個報錯。

2. 添加調試日志

3. 臨時替換掉ffmpeg filter的代碼，直接調用av_hwframe_transfer_data將解碼結果拷貝回內存，測試發現沒有出現問題。

4. 改回ffmpeg filter進行像素格式轉換，復現問題。

5. 針對ffmpeg filter，修改filters_descr，去除fps的過濾進行測試，結果正常。因此出錯和fps filter有關。

6. 嘗試替換新的fps過濾方案。同時將filters_descr中的fps=后移，測試結果也正常。結合之前的測試結果，應該是fps filter插入到scale_npp之前時，縮小幀率會drop frame，但是顯存沒有正確釋放。

TODO，嘗試fps=在scale_npp之前時修復顯存泄漏的問題。得深入看FFmpeg fps filter的代碼。

其他，

一路並發，解碼進程會占用205MB顯存。
75路並發時，三個顯卡各占用5128MB顯存。

第二次分析問題

因為第一次修改將fps filter后移后，出現了內存問題。並且之前沒有查到根本原因，所以繼續深入排查。

在libavutil/buffer.c libavcodec/nvdec.c libavcodec/nvdec_h264.c等源碼中添加日志。

經過多次測試，發現是nvdec_decoder_frame_alloc中，判斷if (pool->nb_allocated >= pool->dpb_size) return NULL;

為什么nb_allocated會大於dpb_size呢？

日志顯示，nvdec_decoder_frame_alloc申請次數過多，導致報錯后，會重新申請新的NVDECFramePool *pool; 但是每次打印新的pool地址后，會很快重新nb_allocated大於dpb_size。而對比正常運行的解碼線程，只會創建3次，nb_allocated最終是3. （實際75路並發中，會有部分線程解碼正常）

是什么導致了這種差別？

對比ffmpeg/doc/examples/filtering_video.c以及其他demo源碼，注意到avcodec_receive_frame和av_buffersink_get_frame的使用不規范。而且只有加上fps filter時才有內存問題。因此嘗試將get frame的接口改成的while循環中調用，測試解決了內存問題。

[ffmpeg]$ git status libav*
On branch master
Changes not staged for commit:
modified: libavcodec/decode.c
modified: libavcodec/h264_slice.c
modified: libavcodec/h264dec.c
modified: libavcodec/nvdec.c
modified: libavcodec/nvdec_h264.c
modified: libavutil/buffer.c
modified: libavutil/mem.c

涉及到的函數：

static int decode_simple_internal(AVCodecContext *avctx, AVFrame *frame)

static AVBufferRef *nvdec_decoder_frame_alloc(void *opaque, int size) 重要

int ff_nvdec_decode_init(AVCodecContext *avctx) 重要

pool->dpb_size = frames_ctx->initial_pool_size; //dpb_size初始是10

ctx->decoder_pool = av_buffer_pool_init2(sizeof(int), pool, nvdec_decoder_frame_alloc, av_free); //設置decoder pool，會設置nvdec_decoder_frame_alloc來申請空間

ff_nvdec_start_frame

nvdec_h264_start_frame

av_buffer_create

AVBufferRef *av_buffer_pool_get(AVBufferPool *pool)

fps的問題

解碼時設置framerate的filter，fps=12.5，處理完的tmp frame的pts就是加1遞增了。之前frame的pts是間隔40ms。

不設置fps=xxx測試， npp scale像素轉換的輸出pts也是間隔40ms；

參考信息

AVBufferPool is an API for a lock-free thread-safe pool of AVBuffers.

Frequently allocating and freeing large buffers may be slow. AVBufferPool is meant to solve this in cases when the caller needs a set of buffers of the same size (the most obvious use case being buffers for raw video or audio frames).

At the beginning, the user must call av_buffer_pool_init() to create the buffer pool. Then whenever a buffer is needed, call av_buffer_pool_get() to get a reference to a new buffer, similar to av_buffer_alloc(). This new reference works in all aspects the same way as the one created by av_buffer_alloc(). However, when the last reference to this buffer is unreferenced, it is returned to the pool instead of being freed and will be reused for subsequent av_buffer_pool_get() calls.

When the caller is done with the pool and no longer needs to allocate any new buffers, av_buffer_pool_uninit() must be called to mark the pool as freeable. Once all the buffers are released, it will automatically be freed.

Allocating and releasing buffers with this API is thread-safe as long as either the default alloc callback is used, or the user-supplied one is thread-safe.

How do I reduce frames with blending in ffmpeg

Changing the frame rate

Framerate vs r vs Filter fps

Using ffmpeg to change framerate

using -hwaccel nvdec produces 'No decoder surfaces left' with interlaced input and 3 or more b-frames

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 RuntimeError: CUDA error:out of memory的一種解決辦法存在空閑CUDA前提下報錯：RuntimeError: CUDA error: out of memory *** RuntimeError: CUDA error: out of memory. 顯存充足，但是卻出現CUDA error:out of memory錯誤 LLVM ERROR: out of memory fatal error: runtime: out of memory kibana 報錯 FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory webpack打包報錯：FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory RuntimeError: CUDA error: an illegal memory access was encountered 【解決方案】報錯：cannot find symbol [ERROR] symbol: class BASE64Decoder