NVIDIA A100 GPUs上硬件JPEG解碼器和NVIDIA nvJPEG庫


NVIDIA A100 GPUs上硬件JPEG解碼器和NVIDIA nvJPEG庫

Leveraging the Hardware JPEG Decoder and NVIDIA nvJPEG Library on NVIDIA A100 GPUs

根據調查,平均每個人產生1.2萬億張圖片,這些圖片是由手機或數碼相機拍攝的。這種圖像的存儲,特別是以高分辨率的原始格式存儲,占用了大量的內存。             

JPEG指的是聯合攝影專家組(Joint Photography Experts Group),該組於2017年慶祝了25歲生日。JPEG標准指定了編解碼器,它定義如何將圖像壓縮成字節的比特流並將其解壓縮回圖像。             

JPEG編解碼器的主要目的是最小化照片圖像文件的文件大小。JPEG是一種有損壓縮格式,這意味着它不存儲原始圖像的完整像素數據。JPEG的優點之一是它允許您微調所使用的壓縮量。這將在正確使用時產生良好的圖像質量,同時也會產生最小的合理文件大小。             

JPEG壓縮的關鍵組成部分如下:             

顏色空間轉換允許您分離亮度(Y)和色度(Cb,Cr)組件。降采樣的Cb和Cr允許您減少文件大小,幾乎不明顯的質量損失,因為人類的感知是不太敏感的這些圖像組成部分。這不是核心標准的一部分,但定義為JFIF格式的一部分。             

基於塊的離散余弦變換(DCT)允許在較低的頻率下壓縮數據。             

量化允許高頻細節的舍入系數。失去這些細節通常是可以的,因為人眼通常無法輕易區分高頻內容。             

漸進式編碼允許您在對其位流進行部分解碼后預覽整個圖像的低質量版本。             

以下照片(圖1)演示了JPEG壓縮的圖像質量損失。原始蝴蝶圖像為BMP格式(512×512,24位,769kb,無壓縮),然后以JPEG格式顯示相同的圖像,質量壓縮系數為50%,子采樣4:2:0,24位,圖像大小為33kb。

 

Figure 1a. Original butterfly image (no compression, Size 512×512, 24-bit), 769 KB.

 

Figure 1b. Compressed butterfly image (quality compression coefficient 50%, subsampling 4:2:0, 24-bit), 33 KB.

How JPEG works

圖2顯示了JPEG編碼器的一種常見配置。

 

Figure 2. Diagram of the JPEG encoding process employing a parallel utilization of GPU CUDA software and CPU.

首先,JPEG編碼從RGB彩色圖像開始。             

第二步涉及到顏色轉換到表示亮度(亮度)的Y Cb Cr顏色空間Y和表示色度(紅色和藍色投影)的Cb和Cr通道。然后,Cb和Cr信道被預定因子(通常是2或3)降采樣。這個下采樣給你第一階段的壓縮。             

在下一階段,每個信道被分成8×8個塊並計算DCT,這是頻率空間中類似於Fourier變換的變換。DCT本身是無損和可逆的,它將一個8×8的空間塊轉換成64個信道。             

然后對DCT系數進行量化,這是一個有損的過程,包括第二壓縮級。量化由JPEG質量參數控制,較低的質量設置對應於更嚴重的壓縮並導致較小的文件。             

量化閾值是特定於每個空間頻率的,並且經過精心設計。低頻壓縮比高頻壓縮少,因為人眼比高頻信號的幅度變化更敏感於大范圍內的細微誤差。             

最后一步是用哈夫曼編碼對量化后的DCT系數進行無損壓縮並存儲在JPEG文件中,如image.jpg如圖2所示。             

圖3顯示了NVIDIA GPU上的JPEG解碼過程。

 

Figure 3. The JPEG decoding process employs a parallel utilization of GPU CUDA and software. A hybrid (CPU/GPU) approach for Huffman decoding overcomes the serial process stall.

JPEG解碼過程從壓縮的JPEG比特流開始,提取頭部信息。             

然后,Huffman解碼處理串行處理,因為DCT系數從比特流一次解碼一個。             

下一步處理去量化和反DCT為8×8塊。             

上采樣步驟處理YCbCr轉換並生成解碼的RGB圖像。             

NVIDIA使用基於CUDA技術的nvJPEG庫加快了JPEG編解碼器的速度。我們開發了JPEG算法的完整並行實現。JPEG編解碼器工作流程中典型的GPU加速部分如圖2和圖3所示。

New JPEG hardware decoder最近,我們介紹了NVIDIA A100 GPU,它有一個專用的硬件JPEG解碼器。以前,在數據中心GPU上沒有這樣的硬件單元,JPEG解碼是一個純軟件CUDA解決方案,它同時使用CPU和GPU。             

現在,硬件解碼器與GPU的其余部分同時運行,GPU可以執行各種計算任務,如圖像分類、目標檢測和圖像分割。與NVIDIA Tesla V100相比,它在4-8x JPEG解碼速度方面以多種方式大幅提高了吞吐量。             

它是通過nvJPEG庫(CUDA工具包的一部分)公開的。

nvJPEG library overview

nvJPEG是用於JPEG編解碼器的GPU加速庫。與NVIDIA DALI(一個數據增強和圖像加載庫)一起,通過加速數據的解碼和增強,可以加速對圖像分類模型的深度學習訓練。A100包括一個5核硬件JPEG解碼引擎。nvJPEG利用硬件后端對JPEG圖像進行批量處理。

 

Figure 4. The JPEG hardware decoding process employs a parallel utilization of hardware decoder and GPU CUDA software. The HW decoder is independent of the CUDA SMs so that software GPU decoders can be used simultaneously.

通過使用nvjpegCreateEx init函數選擇硬件解碼器,nvJPEG提供了基線JPEG解碼的加速和各種顏色轉換格式(例如,YUV 420、422、444)。如圖4所示,這使得圖像解碼速度比僅使用CPU的處理速度快20倍。DALI的用戶可以直接受益於這種硬件加速,因為nvJPEG是抽象的。              nvJPEG庫支持以下操作:

  • nvJPEG Encoding
  • nvJPEG Transcoding轉碼
  • nvJPEG Decoding (includes HW (A100) support)

庫支持以下JPEG選項:             

基線和漸進式JPEG編碼和解碼,僅適用於A100的基線解碼             

每像素8位             

哈夫曼比特流解碼             

多達四通道JPEG比特流             

8位和16位量化表             

三個顏色通道Y、Cb、Cr(Y、U、V)的以下色度子采樣:

  • 4:4:4
  • 4:2:2
  • 4:2:0
  • 4:4:0
  • 4:1:1
  • 4:1:0

該庫具有以下功能:             

使用CPU和GPU的混合解碼。             

庫的輸入在主機內存中,輸出在GPU內存中。             

單圖像和成批圖像解碼。             

用戶為設備提供的內存管理器和固定主機內存分配。

Performance numbers

對於本節中的性能圖,我們使用以下測試設置和GPU/CPU硬件:

  • NVIDIA V100 GPU: CPU – E5-2698 v4@2GHz 3.6GHz Turbo (Broadwell) HT On GPU – Tesla V100-SXM2-16GB(GV100) 116160 MiB 180 SM GPU Video Clock 1312 Batch 128 and Single Thread
  • NVIDIA A100 GPU CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On GPU – A100-SXM4-40GB(GA100) 140557 MiB 1108 SM GPU Video Clock 1095 Batch 128 and Single Thread
  • CPU: CPU – Platinum 8168@2GHz 3.7GHz Turbo (Skylake) HT On TurboJPEG decode for CPU testing
  • Image dataset: 2K FHD = 1920 x 1080 4K UHD = 3840 x 2160 CUDA Toolkit 11.0 CUDA driver r450.24

接下來的兩個圖表顯示了硬件JPEG解碼器的解碼速度。

 

Figure 5. Graph showing the speed up achieved by hardware decode on A100 over the CUDA hybrid decode on V100. 

 

Figure 6. The number of CPU threads required by the hybrid decoder on V100 to keep up with hardware decoder throughput on A100.

通過將解碼卸載到硬件,您可以釋放寶貴的CPU周期,以便更好地使用。             

圖7顯示了編碼加速。

 

Figure 7a. JPEG baseline encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).

 

Figure 7b. JPEG progressive encoding throughput comparison between CPU, CUDA (V100, A100) for an image size of 1920×1080 (2K FHD), 3840×2160 (4K UHD).

Image decoding example

下面是一個使用nvJPEG庫的圖像解碼示例。此示例顯示了在A100 GPU上使用硬件解碼器以及對其他NVIDIA GPU使用后端回退。

//

// The following code example shows how to use the nvJPEG library for JPEG image decoding.

//

// Libraries used

// nvJPEG decoding

int main()

{

    ...

    // create nvJPEG decoder and decoder state

    nvjpegDevAllocator_t dev_allocator = {&dev_malloc, &dev_free};

    nvjpegPinnedAllocator_t pinned_allocator ={&host_malloc, &host_free};

 

    // Selecting A100 Hardware decoder

    nvjpegStatus_t status = nvjpegCreateEx(NVJPEG_BACKEND_HARDWARE,      

                                           &dev_allocator,

                                           &pinned_allocator,

                                           NVJPEG_FLAGS_DEFAULT,        

                                           &params.nvjpeg_handle);

 

    params.hw_decode_available = true;

    if( status == NVJPEG_STATUS_ARCH_MISMATCH) {

        std::cout<<"Hardware Decoder not supported. Falling back to default backend"<<std::endl;

    // GPU SW decoder selected

    nvjpegCreateEx(NVJPEG_BACKEND_DEFAULT, &dev_allocator,

                   &pinned_allocator, NVJPEG_FLAGS_DEFAULT,

                   &params.nvjpeg_handle);

    params.hw_decode_available = false;

 

   }

 

    // create JPEG decoder state

    nvjpegJpegStateCreate(params.nvjpeg_handle, &params.nvjpeg_state)

 

    // extract bitstream metadata to figure out whether a bitstream can be decoded

    nvjpegJpegStreamParseHeader(params.nvjpeg_handle, (const unsigned char *)img_data[i].data(), img_len[i], params.jpeg_streams[0]);

 

    // decode Batch images

    nvjpegDecodeBatched(params.nvjpeg_handle, params.nvjpeg_state,    

                        batched_bitstreams.data(),                

                        batched_bitstreams_size.data(),

                        batched_output.data(), params.stream)

    ...

}

 

$ git clone https://github.com/NVIDIA/CUDALibrarySamples.git

$ cd nvJPEG/nvJPEG-Decoder/

$ mkdir build

$ cd build

$ cmake ..

$ make

 

// Running nvJPEG decoder

$ ./nvjpegDecoder -i ../input_images/ -o ~/tmp

 

Decoding images in directory: ../input_images/, total 12, batchsize 1

Processing: ../input_images/cat_baseline.jpg

Image is 3 channels.

Channel #0 size: 64 x 64

Channel #1 size: 64 x 64

Channel #2 size: 64 x 64

YUV 4:4:4 chroma subsampling

Done writing decoded image to file:/tmp/cat_baseline.bmp

Processing: ../input_images/img8.jpg

Image is 3 channels.

Channel #0 size: 480 x 640

Channel #1 size: 240 x 320

Channel #2 size: 240 x 320

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img8.bmp

Processing: ../input_images/img5.jpg

Image is 3 channels.

Channel #0 size: 640 x 480

Channel #1 size: 320 x 240

Channel #2 size: 320 x 240

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img5.bmp

Processing: ../input_images/img7.jpg

Image is 3 channels.

Channel #0 size: 480 x 640

Channel #1 size: 240 x 320

Channel #2 size: 240 x 320

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img7.bmp

Processing: ../input_images/img2.jpg

Image is 3 channels.

Channel #0 size: 480 x 640

Channel #1 size: 240 x 320

Channel #2 size: 240 x 320

YUV 4:2:0 chroma subsampling

Done writing decoded image to file: /tmp/img2.bmp

Processing: ../input_images/img4.jpg

Image is 3 channels.

Channel #0 size: 640 x 426

Channel #1 size: 320 x 213

Channel #2 size: 320 x 213

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img4.bmp

Processing: ../input_images/cat.jpg

Image is 3 channels.

Channel #0 size: 64 x 64

Channel #1 size: 64 x 64

Channel #2 size: 64 x 64

YUV 4:4:4 chroma subsampling

Done writing decoded image to file:/tmp/cat.bmp

Processing: ../input_images/cat_grayscale.jpg

Image is 1 channels.

Channel #0 size: 64 x 64

Grayscale JPEG

Done writing decoded image to file:/tmp/cat_grayscale.bmp

Processing: ../input_images/img1.jpg

Image is 3 channels.

Channel #0 size: 480 x 640

Channel #1 size: 240 x 320

Channel #2 size: 240 x 320

YUV 4:2:0 chroma subsampling

Done writing decoded image to file: /tmp/img1.bmp

Processing: ../input_images/img3.jpg

Image is 3 channels.

Channel #0 size: 640 x 426

Channel #1 size: 320 x 213

Channel #2 size: 320 x 213

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img3.bmp

Processing: ../input_images/img9.jpg

Image is 3 channels.

Channel #0 size: 640 x 480

Channel #1 size: 320 x 240

Channel #2 size: 320 x 240

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img9.bmp

Processing: ../input_images/img6.jpg

Image is 3 channels.

Channel #0 size: 640 x 480

Channel #1 size: 320 x 240

Channel #2 size: 320 x 240

YUV 4:2:0 chroma subsampling

Done writing decoded image to file:/tmp/img6.bmp

Total decoding time: 14.8286

Avg decoding time per image: 1.23571

Avg images per sec: 0.809248

Avg decoding time per batch: 1.23571

Image resizing example

此圖像大小調整和水印示例根據客戶機的請求生成圖像的縮放版本。圖8顯示了圖像大小調整和水印的典型工作流程。

 

Figure 8. Image resizing and watermarking pipeline employing a parallel utilization of GPU software and CUDA.

下面的代碼示例演示如何調整圖像大小並用徽標圖像對其進行水印。

//
// The following code example shows how to resize images and watermark them with a logo image.
//
// Libraries used 
// nvJPEG decoding, NPP Resize, NPP watermarking, nvJPEG encoding
 
int main()
{
    ...
    // nvJPEG decoder 
    nReturnCode = nvjpegDecode(nvjpeg_handle, nvjpeg_decoder_state, dpImage, nSize, oformat, &imgDesc, NULL);
    // NPP image resize
    st = nppiResize_8u_C3R_Ctx(imgDesc.channel[0], imgDesc.pitch[0], srcSize,   
         srcRoi, imgResize.channel[0], imgResize.pitch[0], dstSize, dstRoi,  
         NPPI_INTER_LANCZOS, nppStreamCtx);
 
    st = nppiResize_8u_C3R_Ctx(imgDescW.channel[0], imgDescW.pitch[0], srcSizeW, 
         srcRoiW,imgResizeW.channel[0], imgResizeW.pitch[0], dstSize, dstRoi,   
         NPPI_INTER_LANCZOS, nppStreamCtx);
 
    // Alpha Blending watermarking
    st = nppiAlphaCompC_8u_C3R_Ctx(imgResize.channel[0], imgResize.pitch[0], 
         255, imgResizeW.channel[0], imgResizeW.pitch[0], ALPHA_BLEND, 
         imgResize.channel[0], imgResize.pitch[0], dstSize, NPPI_OP_ALPHA_PLUS,  
         nppStreamCtx);
 
    // nvJPEG encoding
    nvjpegEncodeImage(nvjpeg_handle, nvjpeg_encoder_state, nvjpeg_encode_params,
         &imgResize, iformat, dstSize.width, dstSize.height,NULL));
    ... 
}
$ git clone https://github.com/NVIDIA/CUDALibrarySamples.git 
$ cd nvJPEG/Image-Resize-WaterMark/
$ mkdir build
$ cd build
$ cmake ..
$ make 
// Running Image resizer and watermarking
$ ./imageResizeWatermark -i ../input_images/ -o resize_images -q 85 -rw 512 -rh 512

Summary

Download the latest version of prebuilt DALI binaries with NVIDIA Ampere architecture support. For a detailed list of new features and enhancements, see the  nvJPEG Library documentation and the latest release notes.

To learn more about how DALI uses nvJPEG for accelerating a deep learning data pipeline, see Loading Data Fast with DALI and the New Hardware JPEG Decoder in NVIDIA A100 GPUs.

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM