pytorch 社區踩坑匯總

本文轉載自查看原文 2021-04-27 11:04 565 pytorch

pytorch 社區踩坑匯總

文中代碼不一定能直接應用，僅僅記錄思路

model.eval() 和 with torch.no_grad() 的區別

model.eval() will notify all your layers that you are in eval mode, that way, batchnorm or dropout layers will work in eval mode instead of training mode.
torch.no_grad() impacts the autograd engine and deactivate it. It will reduce memory usage and speed up computations but you won’t be able to backprop (which you don’t want in an eval script).

轉換 tensor 類型

tensor_one.float() : converts the tensor_one type to torch.float32
tensor_one.double() : converts the tensor_one type to torch.float64
tensor_one.int() : converts the tensor_one type to torch.int32

在指定維度連接 tensor

third_tensor = torch.cat((first_tensor, second_tensor), 0)

序列化模型，加載以便再次訓練

@Bixqu You can check the ImageNet Example line 139

        save_checkpoint({
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'best_prec1': best_prec1,
            'optimizer' : optimizer.state_dict(),
        }, is_best)

With

def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
    torch.save(state, filename)
    if is_best:
        shutil.copyfile(filename, 'model_best.pth.tar')

Loading/Resuming from the dictionary is there

    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            best_prec1 = checkpoint['best_prec1']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

釋放 GPU 資源

直接清 cuda 緩存： `torch.cuda.empty_cache()`

把優化器扔到 cpu 上：

https://discuss.pytorch.org/t/how-can-we-release-gpu-memory-cache/14530/27
I was about to ask a question but I found my issue. Maybe it will help others.

I was on Google Colab and finding that I could train my model several times, but that on the 3rd or 4th time I’d run into the memory error. Using torch.cuda.empty_cache() between runs did not help. All I could do was restart my kernel.

I had a setup of the sort:

class Fitter:
    def __init__(self, model):
        self.model = model
        optimizer = # init optimizer here

The point is that I was carrying the model over in between runs but making a new optimizer (in my case I was making new instances of Fitter). And in my case, the (Adam) optimizer state actually took up more memory than my model!

So to fix it I tried some things.

This did not work:

def wipe_memory(self): # DOES NOT WORK
    self.optimizer = None
    torch.cuda.empty_cache()

Neither did this:

def wipe_memory(self): # DOES NOT WORK
    del self.optimizer
    self.optimizer = None
    gc.collect()
    torch.cuda.empty_cache()

This did work!

def wipe_memory(self): # DOES WORK
    self._optimizer_to(torch.device('cpu'))
    del self.optimizer
    gc.collect()
    torch.cuda.empty_cache()

def _optimizer_to(self, device):
    for param in self.optimizer.state.values():
        # Not sure there are any global tensors in the state dict
        if isinstance(param, torch.Tensor):
            param.data = param.data.to(device)
            if param._grad is not None:
                param._grad.data = param._grad.data.to(device)
        elif isinstance(param, dict):
            for subparam in param.values():
                if isinstance(subparam, torch.Tensor):
                    subparam.data = subparam.data.to(device)
                    if subparam._grad is not None:
                        subparam._grad.data = subparam._grad.data.to(device)

I got that optimizer_to function from here

交換坐標軸

a = torch.rand(1,2,3,4)
print(a.transpose(0,3).transpose(1,2).size())
print(a.permute(3,2,1,0).size())

Variables 轉 numpy

Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().

(Variable(x).data).cpu().numpy()

為啥 transform.Normalize()

Normalize does the following for each channel:

image = (image - mean) / std

The parameters mean, std are passed as 0.5, 0.5 in your case. This will normalize the image in the range [-1,1]. For example, the minimum value 0 will be converted to (0-0.5)/0.5=-1, the maximum value of 1 will be converted to (1-0.5)/0.5=1.

if you would like to get your image back in [0,1] range, you could use,

image = ((image * std) + mean)

Denormalize

pip install kornia （一個配合 pytorch 使用的可微圖像庫）

pip install DatasetsHelper==0.0.3 （非必須，只是用於獲取normalization的值）

簡單示例

from matplotlib import pyplot as plt

from torchvision.utils import make_grid
from torchvision.transforms import transforms as T
import torch
from DatasetsHelperQ import get_dataset_mean_std
from DatasetsHelperQ import tensor_to_rgb_image_without_normalization
import kornia
from torchvision import datasets
from torch.utils.data import Dataset

train_transform = T.Compose([T.ToTensor()])
train_set = datasets.CIFAR10(root="./cifar10", train=True, download=True, transform=train_transform)
train_iter = iter(train_set)
img, _ = next(train_iter)
# torch.manual_seed(1234)
# img = torch.randn(3, 4, 4).abs_()
# print(img)
# print(img.type)
t = T.Compose([
    # T.ToPILImage(),
    # T.ToTensor(),
    T.Normalize(*NormalizeValues('cifar10')()),
])
# batch = torch.cat([torch.unsqueeze(img, 0), torch.unsqueeze(t(img), 0)], 0)
# new_img = make_grid([img, t(img)])
# print(new_img.shape)
pil = T.ToPILImage()

plt.figure(figsize=(16, 16))

plt.subplot(141)
plt.title("ORIG")
plt.xticks([])
plt.yticks([])
# print(img.permute(1, 2, 0))
plt.imshow(img.permute(1, 2, 0).numpy())
# plt.imshow(pil(img))


plt.subplot(143)
plt.title("kornia_denorm")
plt.xticks([])
plt.yticks([])
# temp = T.ToPILImage(img.double().div_(255))()
# print(img.double().div_(255))
mean, std = [torch.tensor(i) for i in NormalizeValues('cifar10')()]
kornia_img = kornia.enhance.denormalize(t(img).unsqueeze_(0), mean, std)
plt.imshow(kornia_img.squeeze_(0).permute(1, 2, 0))
# plt.imshow(pil(kornia_img.squeeze_(0)))

plt.subplot(142)
plt.title("Norm")
plt.xticks([])
plt.yticks([])
unorm = UnNormalize(*NormalizeValues('cifar10')())
# plt.imshow(t(img).permute(1, 2, 0))
plt.imshow(pil(t(img)))

plt.subplot(144)
plt.title("PIL")
plt.xticks([])
plt.yticks([])
plt.imshow(pil(img))

pretrained model 基於啥 dataset

Imagenet-12

加載部分 pretrained model

After model_dict.update(pretrained_dict), the model_dict may still have keys that pretrained_model doesn’t have, which will cause a error.

Assume following situation:

pretrained_dict: ['A', 'B', 'C', 'D']
model_dict: ['A', 'B', 'C', 'E']

After pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict} and model_dict.update(pretrained_dict), they are:

pretrained_dict: ['A', 'B', 'C']
model_dict: ['A', 'B', 'C', 'E']

So when performing model.load_state_dict(pretrained_dict), model_dict still has key E that pretrained_dict doen’t have.

So how about using model.load_state_dict(model_dict) instead of model.load_state_dict(pretrained_dict)?

The complete snippet is therefore as follow:

pretrained_dict = ...
model_dict = model.state_dict()

# 1. filter out unnecessary keys
pretrained_dict = {k: v for k, v in pretrained_dict.items() if k in model_dict}
# 2. overwrite entries in the existing state dict
model_dict.update(pretrained_dict) 
# 3. load the new state dict
model.load_state_dict(model_dict)

numpy 轉 tensor 的最佳方式

When you are on GPU, torch.Tensor() will convert your data type to Float.
Actually, torch.Tensor and torch.FloatTensor both do same thing.
But I think better way is using torch.tensor() (note the case of ‘t’ character). It converts your data to tensor but retains data type which is crucial in some methods. You may know that PyTorch and numpy are switchable to each other so if your array is int, your tensor should be int too unless you explicitly change type.

But on top of all these, torch.tensor is convention because you can define following variables: device, dtype, requires_grad, etc.

Note: using torch.tensor() allocates new memory to copy the data of tensor. So if you want to avoid copying, use torch.as_tensor(numpy_ndarray).

PIL ↔ Tensor

pil_img = Image.open(img)
print(pil_img.size)

pil_to_tensor = transforms.ToTensor()(img).unsqueeze_(0)
print(pil_to_tensor.shape)

tensor_to_pil = transforms.ToPILImage()(pil_to_tensor.squeeze_(0))
print(tensor_to_pil.size)

只使用特定 gpu

CUDA_VISIBLE_DEVICES=1,2 python myscript.py

如何提取特征圖

once you have a trained model, if you want to extract the result of an intermediate layer (say fc7 after the relu), you have a couple of possibilities.

You can either reconstruct the classifier once the model was instantiated, as in the following example:

import torch
import torch.nn as nn
from torchvision import models

model = models.alexnet(pretrained=True)

# remove last fully-connected layer
new_classifier = nn.Sequential(*list(model.classifier.children())[:-1])
model.classifier = new_classifier

Or, if instead you want to extract other parts of the model, you might need to recreate the model structure, and reusing the parts of the pre-trained model in the new model.

import torch
import torch.nn as nn
from torchvision import models

original_model = models.alexnet(pretrained=True)

class AlexNetConv4(nn.Module):
            def __init__(self):
                super(AlexNetConv4, self).__init__()
                self.features = nn.Sequential(
                    # stop at conv4
                    *list(original_model.features.children())[:-3]
                )
            def forward(self, x):
                x = self.features(x)
                return x

model = AlexNetConv4()

Training with Half Precision

直接看論壇
https://discuss.pytorch.org/t/training-with-half-precision/11815/2

pytorch 數據增廣

imgaug

log_softmax or softmax?

推薦log版更穩定

placeholder

Convert int into one-hot format

https://discuss.pytorch.org/t/convert-int-into-one-hot-format/507/4

`nn.ModuleList` & `nn.Sequential()`

nn.ModuleList 就像一個 python 列表，用於存儲 nn.Module，使用案例如下

class LinearNet(nn.Module):
  def __init__(self, input_size, num_layers, layers_size, output_size):
     super(LinearNet, self).__init__()

     self.linears = nn.ModuleList([nn.Linear(input_size, layers_size)])
     self.linears.extend([nn.Linear(layers_size, layers_size) for i in range(1, self.num_layers-1)])
     self.linears.append(nn.Linear(layers_size, output_size)

nn.Sequential 可以按順序構建一個神經網絡

class Flatten(nn.Module):
  def forward(self, x):
    N, C, H, W = x.size() # read in N, C, H, W
    return x.view(N, -1)

simple_cnn = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=7, stride=2),
            nn.ReLU(inplace=True),
            Flatten(), 
            nn.Linear(5408, 10),
          )

Not really. Maybe there are some situations where you could use both, but the main idea is the following:

In nn.Sequential, the nn.Module's stored inside are connected in a cascaded way. For instance, in the example that I gave, I define a neural network that receives as input an image with 3 channels and outputs 10 neurons. That network is composed by the following blocks, in the following order: Conv2D -> ReLU -> Linear layer. Moreover, an object of type nn.Sequential has a forward() method, so if I have an input image x I can directly call y = simple_cnn(x) to obtain the scores for x. When you define an nn.Sequential you must be careful to make sure that the output size of a block matches the input size of the following block. Basically, it behaves just like a nn.Module

On the other hand, nn.ModuleList does not have a forward() method, because it does not define any neural network, that is, there is no connection between each of the nn.Module's that it stores. You may use it to store nn.Module's, just like you use Python lists to store other types of objects (integers, strings, etc). The advantage of using nn.ModuleList's instead of using conventional Python lists to store nn.Module's is that Pytorch is “aware” of the existence of the nn.Module's inside an nn.ModuleList, which is not the case for Python lists. If you want to understand exactly what I mean, just try to redefine my class LinearNet using a Python list instead of a nn.ModuleList and train it. When defining the optimizer() for that net, you’ll get an error saying that your model has no parameters, because PyTorch does not see the parameters of the layers stored in a Python list. If you use a nn.ModuleList instead, you’ll get no error.

optim.zero_grad()

optimizer.zero_grad()意思是把梯度置零，也就是把loss關於weight的導數變成0.在學習pytorch的時候注意到，對於每個batch大都執行了這樣的操作：

optimizer.zero_grad()             ## 梯度清零
preds = model(inputs)             ## inference
loss = criterion(preds, targets)  ## 求解loss
loss.backward()                   ## 反向傳播求解梯度
optimizer.step()                  ## 更新權重參數

由於pytorch的動態計算圖，當我們使用loss.backward()和opimizer.step()進行梯度下降更新參數的時候，梯度並不會自動清零。並且這兩個操作是獨立操作。
backward()：反向傳播求解梯度。
step()：更新權重參數。

基於以上幾點，正好說明了pytorch的一個特點是每一步都是獨立功能的操作，因此也就有需要梯度清零的說法，如若不顯式地進 optimizer.zero_grad()這一步操作，backward()的時候就會累加梯度。

tensor.cuda()

該方法不會直接把原始 tensor 放到 gpu 中：

In [1]: import torch

In [2]: torch.cuda.is_available()
Out[2]: True
                                                                                                                                                      ## | 100%  
In [3]: a = torch.zeros(1,2,3,4)                                                                                                                      ## | 100%  
                                                                                                                                                      ## | 100%  
In [4]: a
Out[4]: 
tensor([[[[0., 0., 0., 0.],
          [0., 0., 0., 0.],                                                                                                                             [11:05]
          [0., 0., 0., 0.]],

         [[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]]]])

In [5]: a.cuda()
Out[5]: 
tensor([[[[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]],

         [[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]]]], device='cuda:0')

In [6]: a
Out[6]: 
tensor([[[[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]],

         [[0., 0., 0., 0.],
          [0., 0., 0., 0.],
          [0., 0., 0., 0.]]]])

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 [pytorch][持續更新]pytorch踩坑匯總 PyTorch踩坑筆記 pytorch踩坑記【PyTorch】Pytorch踩坑記 pytorch編程踩過的坑 PyTorch安裝踩坑記 Hyperledger Fabric 踩坑匯總 jenkins in docker踩坑匯總復現基於Pytorch的YOLOv3所踩的坑~ PyTorch2ONNX2TensorRT 踩坑日志