cuda nvprof 輸出結果的理解和優化空間

本文轉載自查看原文 2015-09-29 16:49 2932 cuda/ 優化/ nvprof/ gpu

該內容來自 http://stackoverflow.com/questions/30371030/understanding-cuda-profiler-output-nvprof

放在這里作為一個提示和總結

問題

I'm just looking at the following output and trying to wrap my mind around the numbers:

==2906== Profiling result: Time(%) Time Calls Avg Min Max Name 23.04% 10.9573s 16436 666.67us 64.996us 1.5927ms sgemm_sm35_ldg_tn_32x16x64x8x16 22.28% 10.5968s 14088 752.18us 612.13us 1.6235ms sgemm_sm_heavy_nt_ldg 18.09% 8.60573s 14088 610.86us 513.05us 1.2504ms sgemm_sm35_ldg_nn_128x8x128x16x16 16.48% 7.84050s 68092 115.15us 1.8240us 503.00us void axpy_kernel_val<float, int=0>(cublasAxpyParamsVal<float>) ... 0.25% 117.53ms 4744 24.773us 896ns 11.803ms [CUDA memcpy HtoD] 0.23% 107.32ms 37582 2.8550us 1.8880us 8.9556ms [CUDA memcpy DtoH] ... ==2906== API calls: Time(%) Time Calls Avg Min Max Name 83.47% 41.8256s 42326 988.18us 16.923us 13.332ms cudaMemcpy 9.27% 4.64747s 326372 14.239us 10.846us 11.601ms cudaLaunch 1.49% 745.12ms 1502720 495ns 379ns 1.7092ms cudaSetupArgument 1.37% 688.09ms 4702 146.34us 879ns 615.09ms cudaFree ...

When it comes to optimizing memory access, what are the numbers I really need to look at when comparing different implementations? It first looks like memcpy only takes 117.53+107.32ms (in both directions), but then there is this API call cudaMemcpy: 41.8256s, which is much more. Also, the min/avg/max columns don't add up between the upper and the lower output block.

Why is there a difference and what is the "true" number that is important for me to optimize the memory transfer?

EDIT: second question is: is there a way to figure out who is calling e.g. axpy_kernel_val (and how many times)?

回答

The difference in total time is due to the fact that work is launched to the GPU asynchronously. If you have a long running kernel or set of kernels with no explicit synchronisation to the host, and follow them with a call to cudaMemcpy, the cudaMemcpy call will be launched well before the kernel(s) have finished executing. The total time of the API call is from the moment it is launched to the moment it completes, so will overlap with executing kernels. You can see this very clearly if you run the output through the NVIDIA Visual Profiler (nvprof -o xxx ./myApp, then import xxx into nvvp).

The difference is min time is due to launch overhead. While the API profiling takes all of the launch overhead into account, the kernel timing only contains a small part of it. Launch overhead can be ~10-20us, as you can see here.

In general, the API calls section lets you know what the CPU is doing, while, the profiling results tells you what the GPU is doing. In this case, I'd argue you're underusing the CPU, as arguably the cudaMemcpy is launched too early and CPU cycles are wasted. In practice, however, it's often hard or impossible to get anything useful out of these spare cycles.

總結

第一個======profiling result是說明GPU端的時間

第二個======API Calls 是說明CPU處測量的時間

原因就是kernel<<<>>>是異步執行的而cudaMemcpy是同步的(也就是阻塞的)對於下面的圖而言，應該是gpu充分利用了,cudaMemcpy並不占用什么時間但是考慮其他的過程和總CPU耗時，可以知曉CPU端的耗時是很大的，所以有很大的優化空間來減少CPU的事情，或者說在CPU做同等工作的時候是不是能夠更進一步利用GPU的性能。

該程序總運行時間935s

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 CUDA無法使用nvprof：由於找不到cupti64_2020.2.0.dll 漫談CUDA優化【CUDA學習】__syncthreads的理解 GPU，CUDA，cuDNN的理解 cuda程序優化的15個策略讀入優化&輸出優化如何理解CUDA中的cudaMalloc()的參數如何理解假設空間與版本空間？關於尺度空間的理解 CUDA 矩陣乘法終極優化指南