Pytorch剖析器及Pytorch模型的逐層分析


Pytorch的Autograd模塊包括一個分析器(profiler),它可以讓你檢查模型中不同操作符的成本——包括CPU和GPU。

目前有兩種模式——使用profile.實現僅cpu模式和基於nvprof(注冊CPU和GPU活動)使用emit_nvtx。

torch.autograd.profiler.profile(enabled=Trueuse_cuda=Falserecord_shapes=False)

上下文管理器,用於管理autograd profiler狀態並保存結果摘要。 在后台,它僅記錄正在C ++中執行的函數的事件,並將這些事件公開給Python。 您可以將任何代碼包裝到其中,並且它只會報告PyTorch函數的運行時間。

參數:

enabled (booloptional) – 將其設置為False將使該上下文管理器成為無操作。默認值:True。

use_cuda (bool, optional) – 使用cudaEvent API啟用CUDA事件的計時。 每個張量操作會增加大約4us的開銷。 默認值:False

record_shapes (bool, optional) – 如果設置了形狀記錄,則將收集有關輸入尺寸的信息。這允許查看底層使用了哪些維度,並進一步使用prof.key_averages(group_by_input_shape=True)對它們進行分組。請注意,形狀記錄可能會使分析數據有偏差。對於最底部的事件(在嵌套函數調用的情況下),很可能是可以忽略的。但是對於更高級別的函數,由於形狀的收集,總self cpu time可能會人為地增加。

Example

x = torch.randn((1, 1), requires_grad=True)
with torch.autograd.profiler.profile() as prof:
for _ in range(100):  # any normal python code, really!
  y = x ** 2
  y.backward()
# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

 

結果(沒有使用gpu):

------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
pow                                         64.76%           3.096ms          64.76%           3.096ms          3.096ms          1                []                                   
struct torch::autograd::GraphRoot           0.37%            17.700us         0.37%            17.700us         17.700us         1                []                                   
PowBackward0                                23.10%           1.104ms          23.10%           1.104ms          1.104ms          1                []                                   
pow                                         1.37%            65.700us         1.37%            65.700us         65.700us         1                []                                   
mul                                         10.11%           483.100us        10.11%           483.100us        483.100us        1                []                                   
mul                                         0.13%            6.200us          0.13%            6.200us          6.200us          1                []                                   
struct torch::autograd::AccumulateGrad      0.14%            6.500us          0.14%            6.500us          6.500us          1                []                                   
detach                                      0.03%            1.500us          0.03%            1.500us          1.500us          1                []                                   
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 4.780ms

 

結果(使用gpu):

------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
pow                                         29.13%           3.246ms          29.13%           3.246ms          3.246ms          31.62%           2.866ms          2.866ms          1                []                                   
struct torch::autograd::GraphRoot           0.09%            9.600us          0.09%            9.600us          9.600us          0.02%            2.048us          2.048us          1                []                                   
PowBackward0                                34.12%           3.803ms          34.12%           3.803ms          3.803ms          32.89%           2.982ms          2.982ms          1                []                                   
pow                                         8.53%            950.500us        8.53%            950.500us        950.500us        2.63%            238.592us        238.592us        1                []                                   
mul                                         16.06%           1.789ms          16.06%           1.789ms          1.789ms          19.44%           1.762ms          1.762ms          1                []                                   
mul                                         8.94%            996.700us        8.94%            996.700us        996.700us        10.73%           972.864us        972.864us        1                []                                   
struct torch::autograd::CopyBackwards       1.47%            163.900us        1.47%            163.900us        163.900us        1.31%            118.688us        118.688us        1                []                                   
to                                          1.40%            155.900us        1.40%            155.900us        155.900us        1.27%            114.944us        114.944us        1                []                                   
empty_strided                               0.09%            10.300us         0.09%            10.300us         10.300us         0.01%            1.023us          1.023us          1                []                                   
struct torch::autograd::AccumulateGrad      0.13%            15.000us         0.13%            15.000us         15.000us         0.06%            5.281us          5.281us          1                []                                   
detach                                      0.04%            4.700us          0.04%            4.700us          4.700us          0.02%            1.760us          1.760us          1                []                                   
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 11.144ms
CUDA time total: 9.066ms

 

torch.autograd.profiler.record_function(name)

上下文管理器/函數裝飾器,在運行autograd profiler時向Python代碼(或函數)塊添加標簽。它在跟蹤代碼概要文件時非常有用。

>>> x = torch.randn((1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     with torch.autograd.profiler.record_function("label-z"): # label the block
...         z = y ** 3
...     y.backward()
...
>>> # NOTE: some columns were removed for brevity
>>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
-----------------------------------  ---------------  ---------------  ---------------
Name                                 Self CPU total %  CPU time avg     Number of Calls
-----------------------------------  ---------------  ---------------  ---------------
pow                                  60.77%           47.470us         3
mul                                  21.73%           25.465us         2
PowBackward0                         12.03%           121.891us        1
torch::autograd::AccumulateGrad      2.70%            6.324us          1
label-z                              2.13%            12.421us         1
torch::autograd::GraphRoot           0.64%            1.503us          1
-----------------------------------  ---------------  ---------------  ---------------
Self CPU time total: 234.344us
CUDA time total: 0.000us

 

torch.autograd.profiler.emit_nvtx(enabled=Truerecord_shapes=False)

上下文管理器,使每個autograd操作發出一個NVTX范圍。

在nvprof下運行程序時非常有用:

nvprof --profile-from-start off -o trace_name.prof -- <regular command here>

不幸的是,無法強制nvprof將收集到的數據刷新到磁盤,因此對於CUDA分析,必須使用此上下文管理器注釋nvprof跟蹤並等待進程退出后再檢查它們。 然后,可以使用NVIDIA Visual Profiler(nvvp)可視化時間軸,或者torch.autograd.profiler.load_nvprof()可以加載結果以進行檢查,例如 在Python REPL中。

>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

torch.autograd.profiler.load_nvprof(path)

打開nvprof跟蹤文件並解析autograd注釋。

 

Pytorch模型的逐層分析

采用torchprof庫進行pytorch模型的逐層分析

pip install torchprof
 1 import torch
 2 import torchvision
 3 import torchprof
 4 
 5 model = torchvision.models.alexnet(pretrained=False).cuda()
 6 x = torch.rand([1, 3, 224, 224]).cuda()
 7 
 8 with torchprof.Profile(model, use_cuda=True) as prof:
 9     model(x)
10 
11 print(prof.display(show_events=False)) # equivalent to `print(prof)` and `print(prof.display())`

 

Module         | Self CPU total | CPU total | CUDA total | Occurrences
---------------|----------------|-----------|------------|------------
AlexNet        |                |           |            |
├── features   |                |           |            |
│├── 0         |        1.671ms |   6.589ms |    6.701ms |           1
│├── 1         |       62.430us |  62.430us |   63.264us |           1
│├── 2         |       62.909us | 109.948us |  112.640us |           1
│├── 3         |      225.389us | 858.376us |    1.814ms |           1
│├── 4         |       18.999us |  18.999us |   19.456us |           1
│├── 5         |       29.560us |  52.720us |   54.272us |           1
│├── 6         |      136.959us | 511.216us |  707.360us |           1
│├── 7         |       18.480us |  18.480us |   18.624us |           1
│├── 8         |       84.380us | 300.700us |  590.688us |           1
│├── 9         |       18.249us |  18.249us |   17.632us |           1
│├── 10        |       81.289us | 289.946us |  470.016us |           1
│├── 11        |       17.850us |  17.850us |   18.432us |           1
│└── 12        |       29.350us |  52.260us |   52.288us |           1
├── avgpool    |       41.840us |  70.840us |   76.832us |           1
└── classifier |                |           |            |
 ├── 0         |       66.400us | 122.110us |  125.920us |           1
 ├── 1         |      293.658us | 293.658us |  664.704us |           1
 ├── 2         |       17.600us |  17.600us |   18.432us |           1
 ├── 3         |       27.920us |  49.030us |   51.168us |           1
 ├── 4         |       40.590us |  40.590us |  208.672us |           1
 ├── 5         |       17.570us |  17.570us |   18.432us |           1
 └── 6         |       40.489us |  40.489us |   81.920us |           1
View Code

查看每個層中發生的低級操作:prof.display(show_events=True)

Module                        | Self CPU total | CPU total | CUDA total | Occurrences
------------------------------|----------------|-----------|------------|------------
AlexNet                       |                |           |            |
├── features                  |                |           |            |
│├── 0                        |                |           |            |
││├── conv2d                  |       13.370us |   1.671ms |    1.698ms |           1
││├── convolution             |       12.730us |   1.658ms |    1.685ms |           1
││├── _convolution            |       30.660us |   1.645ms |    1.673ms |           1
││├── contiguous              |        6.970us |   6.970us |    7.136us |           1
││└── cudnn_convolution       |        1.608ms |   1.608ms |    1.638ms |           1
│├── 1                        |                |           |            |
││└── relu_                   |       62.430us |  62.430us |   63.264us |           1
│├── 2                        |                |           |            |
││├── max_pool2d              |       15.870us |  62.909us |   63.488us |           1
││└── max_pool2d_with_indices |       47.039us |  47.039us |   49.152us |           1
...
View Code

可以通過在概要文件實例上調用raw()返回原始的Pytorch事件列表。

1 trace, event_lists_dict = prof.raw()
2 print(trace[2])
3 # Trace(path=('AlexNet', 'features', '0'), leaf=True, module=Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)))
4 
5 print(event_lists_dict[trace[2].path][0])
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Name                   Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
conv2d                 0.80%            13.370us         100.00%          1.671ms          1.671ms          25.34%           1.698ms          1.698ms          1                []
convolution            0.76%            12.730us         99.20%           1.658ms          1.658ms          25.15%           1.685ms          1.685ms          1                []
_convolution           1.83%            30.660us         98.44%           1.645ms          1.645ms          24.97%           1.673ms          1.673ms          1                []
contiguous             0.42%            6.970us          0.42%            6.970us          6.970us          0.11%            7.136us          7.136us          1                []
cudnn_convolution      96.19%           1.608ms          96.19%           1.608ms          1.608ms          24.44%           1.638ms          1.638ms          1                []
---------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------
Self CPU time total: 1.671ms
CUDA time total: 6.701ms
View Code

層可以選擇單獨使用可選kwarg路徑參數。忽略所有其他層的分析。

 1 model = torchvision.models.alexnet(pretrained=False)
 2 x = torch.rand([1, 3, 224, 224])
 3 
 4 # Layer does not have to be a leaf layer
 5 paths = [("AlexNet", "features", "3"), ("AlexNet", "classifier")]
 6 
 7 with torchprof.Profile(model, paths=paths) as prof:
 8     model(x)
 9 
10 print(prof)

 

Module         | Self CPU total | CPU total | CUDA total | Occurrences
---------------|----------------|-----------|------------|------------
AlexNet        |                |           |            |
├── features   |                |           |            |
│├── 0         |                |           |            |
│├── 1         |                |           |            |
│├── 2         |                |           |            |
│├── 3         |        3.189ms |  12.717ms |    0.000us |           1
│├── 4         |                |           |            |
│├── 5         |                |           |            |
│├── 6         |                |           |            |
│├── 7         |                |           |            |
│├── 8         |                |           |            |
│├── 9         |                |           |            |
│├── 10        |                |           |            |
│├── 11        |                |           |            |
│└── 12        |                |           |            |
├── avgpool    |                |           |            |
└── classifier |       13.403ms |  14.011ms |    0.000us |           1
 ├── 0         |                |           |            |
 ├── 1         |                |           |            |
 ├── 2         |                |           |            |
 ├── 3         |                |           |            |
 ├── 4         |                |           |            |
 ├── 5         |                |           |            |
 └── 6         |                |           |            |
View Code

 

 

 

 

參考:

https://pytorch.org/docs/stable/autograd.html#profiler

https://github.com/awwong1/torchprof

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM