FP32轉FP16能否加速libtorch調用
###1. PYTORCH 采用FP16后的速度提升問題
pytorch可以使用half()
函數將模型由FP32迅速簡潔的轉換成FP16.但FP16速度是否提升還依賴於GPU。以下面的代碼為例,
- import time
- import torch
- from torch.autograd import Variable
- import torchvision.models as models
- import torch.backends.cudnn as cudnn
- cudnn.benchmark = True
- net = models.resnet18().cuda()
- inp = torch.randn(64, 3, 224, 224).cuda()
- for i in range(5):
- net.zero_grad()
- out = net.forward(Variable(inp, requires_grad=True))
- loss = out.sum()
- loss.backward()
- torch.cuda.synchronize()
- start=time.time()
- for i in range(100):
- net.zero_grad()
- out = net.forward(Variable(inp, requires_grad=True))
- loss = out.sum()
- loss.backward()
- torch.cuda.synchronize()
- end=time.time()
- print("FP32 Iterations per second: ", 100/(end-start))
- net = models.resnet18().cuda().half()
- inp = torch.randn(64, 3, 224, 224).cuda().half()
- torch.cuda.synchronize()
- start=time.time()
- for i in range(100):
- net.zero_grad()
- out = net.forward(Variable(inp, requires_grad=True))
- loss = out.sum()
- loss.backward()
- torch.cuda.synchronize()
- end=time.time()
- print("FP16 Iterations per second: ", 100/(end-start))
在1080Ti上的性能對比:
- FP32 Iterations per second: 10.37743206218922
- FP16 Iterations per second: 9.855269155760238
- FP32 Memory:2497M
- FP16 Memory:1611M
可以發現FP16顯著的降低了顯存,但是速度沒有提升,反而有些許下降。
然后觀察在 V100 上的性能對比:
- FP32 Iterations per second: 16.325794715481173
- FP16 Iterations per second: 24.853492643300903
- FP32 Memory: 3202M
- FP16 Memory: 2272M
此時顯存顯著降低且速度也提升較明顯。
關於pytorch 中采用FP16有時速度沒有提升的問題,參考https://discuss.pytorch.org/t/cnn-fp16-slower-than-fp32-on-tesla-p100/12146
###2. Libtorch采用FP16后的速度提升問題
我們在V100上測試FP16是否能提升libtorch的推理速度。
####2.1 下載libtorch
- wget https://download.pytorch.org/libtorch/cu101/libtorch-cxx11-abi-shared-with-deps-1.6.0%2Bcu101.zip
- unzip libtorch-cxx11-abi-shared-with-deps-1.6.0+cu101.zip
在pytorch官網找到對應版本的libtorch,libtorch一般會向下支持,我這里的libtorch版本1.6.0, pytorch安裝的是1.1.0
####2.2 pytorch生成trace.pt
- import torch
- import torchvision.models as models
- net = models.resnet18().cuda()
- net.eval()
- inp = torch.randn(64, 3, 224, 224).cuda()
- traced_script_module = torch.jit.trace(net, inp)
- traced_script_module.save("RESNET18_trace.pt")
- print("trace has been saved!")
####2.3 libtorch 調用trace
- using namespace std;
- int main()
- {
- at::globalContext().setBenchmarkCuDNN(true);
- std::string model_file = "/home/zwzhou/Code/test_libtorch/RESNET18_trace.pt";
- torch::Tensor inputs = torch::rand({64, 3, 224, 224}).to(at::kCUDA);
- torch::jit::script::Module net = torch::jit::load(model_file); // load model
- net.to(at::kCUDA);
- auto outputs = net.forward({inputs});
- cudaDeviceSynchronize();
- auto before = std::chrono::system_clock::now();
- for (int i=0; i<100; ++i)
- {
- outputs = net.forward({inputs});
- }
- cudaDeviceSynchronize();
-
- cudaDeviceSynchronize();
- auto after = std::chrono::system_clock::now();
- std::chrono::duration<double> all_time = after - before;
- std::cout<<"FP32 iteration per second: "<<(100/all_time.count())<<"\n";
- net.to(torch::kHalf);
- cudaDeviceSynchronize();
- before = std::chrono::system_clock::now();
- for (int i=0; i<100; ++i)
- {
- outputs = net.forward({inputs.to(torch::kHalf)});
- }
- cudaDeviceSynchronize();
- after = std::chrono::system_clock::now();
- std::chrono::duration<double> all_time2 = after - before;
- std::cout<<"FP16 iteration per second: "<<(100/all_time2.count())<<"\n";
- }
####2.4 編寫CMakeLists.txt
- cmake_minimum_required(VERSION 3.0 FATAL_ERROR)
- project(FP_TEST)
- set(CMAKE_PREFIX_PATH "/home/zwzhou/packages/libtorch/share/cmake/Torch")
- set(DCMAKE_PREFIX_PATH /home/zwzhou/packages/libtorch)
- find_package(Torch REQUIRED)
- add_executable(mtest ./libtorch_test.cpp)
- target_link_libraries(mtest ${TORCH_LIBRARIES})
- set_property(TARGET mtest PROPERTY CXX_STANDARD 14)
####2.5 測評時間
- cd build
- cmake ..
- make
- ./mtest
####2.6 輸出時間
- FP32 iteration per second: 60.6978
- FP16 iteration per second: 91.5507
可以發現,libtorch版本比pytorch版本速度提升比較明顯;另外,可以看出在V100上FP16同樣能夠提升libtorch的推理速度。
####2.7 注意事項
CPU上tensor不支持FP16,所以CUDA上推理完成后轉成CPU后還需要轉到FP32上。
https://discuss.pytorch.org/t/runtimeerror-add-cpu-sub-cpu-not-implemented-for-half-when-using-float16-half/66229