Pytorch 細節記錄

本文轉載自查看原文 2018-03-14 20:48 16570 Pytorch

1. PyTorch進行訓練和測試時指定實例化的model模式為：train/eval

eg:

class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()    
    ...
    def reparameterize(self, mu, logvar):
        if self.training:
            std = logvar.mul(0.5).exp_()
            eps = Variable(std.data.new(std.size()).normal_())
            return eps.mul(std).add_(mu)
        else:
            return mu

model = VAE()
...
def train(epoch):
    model.train()
    ...
def test(epoch):
    model.eval()

View Code

eval即evaluation模式，train即訓練模式。僅僅當模型中有Dropout和BatchNorm是才會有影響。因為訓練時dropout和BN都開啟，而一般而言測試時dropout被關閉，BN中的參數也是利用訓練時保留的參數，所以測試時應進入評估模式。

（在訓練時，𝜇和𝜎2是在整個mini-batch 上計算出來的包含了像是64 或28 或其它一定數量的樣本，但在測試時，你可能需要逐一處理樣本，方法是根據你的訓練集估算𝜇和𝜎2，估算的方式有很多種，理論上你可以在最終的網絡中運行整個訓練集來得到𝜇和𝜎2，但在實際操作中，我們通常運用指數加權平均來追蹤在訓練過程中你看到的𝜇和𝜎2的值。還可以用指數加權平均，有時也叫做流動平均來粗略估算𝜇和𝜎2，然后在測試中使用𝜇和𝜎2的值來進行你所需要的隱藏單元𝑧值的調整。在實踐中，不管你用什么方式估算𝜇和𝜎2，這套過程都是比較穩健的，因此我不太會擔心你具體的操作方式，而且如果你使用的是某種深度學習框架，通常會有默認的估算𝜇和𝜎2的方式，應該一樣會起到比較好的效果） -- Deeplearning.ai

2. PyTorch權重初始化的幾種方法

class discriminator(nn.Module):

    def __init__(self, dataset = 'mnist'):
        super(discriminator, self).__init__()
      。...
        self.conv = nn.Sequential(
            nn.Conv2d(self.input_dim, 64, 4, 2, 1),
            nn.ReLU(),
        )
        ...
        self.fc = nn.Sequential(
            nn.Linear(32, 64 * (self.input_height // 2) * (self.input_width // 2)),
            nn.BatchNorm1d(64 * (self.input_height // 2) * (self.input_width // 2)),
            nn.ReLU(),
        )
        self.deconv = nn.Sequential(
            nn.ConvTranspose2d(64, self.output_dim, 4, 2, 1),
            #nn.Sigmoid(),         # EBGAN does not work well when using Sigmoid().
        )
    utils.initialize_weights(self)

    def forward(self, input):
    ...

def initialize_weights(net):
    for m in net.modules():
        if isinstance(m, nn.Conv2d):
            m.weight.data.normal_(0, 0.02)
            m.bias.data.zero_()
        elif isinstance(m, nn.ConvTranspose2d):
            m.weight.data.normal_(0, 0.02)
            m.bias.data.zero_()
        elif isinstance(m, nn.Linear):
            m.weight.data.normal_(0, 0.02)
m.bias.data.zero_()

View Code

def init_weights(m):
     print(m)
     if type(m) == nn.Linear:
         m.weight.data.fill_(1.0)
         print(m.weight)

 net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
 net.apply(init_weights)

View Code

def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1:
        m.weight.data.normal_(0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        m.weight.data.normal_(1.0, 0.02)
        m.bias.data.fill_(0)

net.apply(weights_init)

View Code

class torch.nn.Module 是所有神經網絡的基類。

modules()返回網絡中所有模塊的迭代器。

add_module(name, module) 將一個子模塊添加到當前模塊。該模塊可以使用給定的名稱作為屬性訪問。

apply(fn) 適用fn遞歸到每個子模塊（如返回.children()，以及自我。

3. PyTorch 中Variable的重要屬性

class torch.autograd.Variable

為什么要引入Variable？首先回答為什么引入Tensor。僅僅利用numpy也可以實現前向反向操作，但numpy不支持GPU運算。而Pytorch為Tensor提供多種操作運算，此外Tensor支持GPU。問題來了，兩三層網絡可以推公式寫反向傳播，當網絡很復雜時需要自動化。autograd可以幫助我們，當利用autograd時，前向傳播會定義一個計算圖，圖中的節點就是Tensor。圖中的邊就是函數。當我們將Tensor塞到Variable時，Variable就變為了節點。若x為一個Variable，那x.data即為Tensor，x.grad也為一個Variable。那x.grad.data就為梯度的值咯。總結：PyTorch Variables與PyTorch Tensors有着相同的API，Tensor上的所有操作幾乎都可用在Variable上。兩者不同之處在於利用Variable定義一個計算圖，可以實現自動求導！

重要的屬性如下：

requires_grad
指定要不要更新這個變數，對於不需要更新的變數可以把他設定成False，可以加快運算。

Variable默認是不需要求導的，即requires_grad屬性默認為False，如果某一個節點requires_grad被設置為True，那么所有依賴它的節點requires_grad都為True。

在用戶手動定義Variable時，參數requires_grad默認值是False。而在Module中的層在定義時，相關Variable的requires_grad參數默認是True。
在計算圖中，如果有一個輸入的requires_grad是True，那么輸出的requires_grad也是True。只有在所有輸入的requires_grad都為False時，輸出的requires_grad才為False。

volatile
指定需不需要保留紀錄用的變數。指定變數為True代表運算不需要記錄，可以加快運算。如果一個變數的volatile是True，則它的requires_grad一定是False。

簡單來說，對於需要更新的Variable記得將requires_grad設成True，當只需要得到結果而不需要更新的Variable可以將volatile設成True加快運算速度。參考：PyTorch 基礎篇

variable的volatile屬性默認為False，如果某一個variable的volatile屬性被設為True，那么所有依賴它的節點volatile屬性都為True。volatile屬性為True的節點不會求導，volatile的優先級比requires_grad高。

當有一個輸入的volatile=True時，那么輸出的volatile=True。volatile=True推薦在模型的推理過程（測試）中使用，這時只需要令輸入的voliate=True，保證用最小的內存來執行推理，不會保存任何中間狀態。在使用volatile=True的時候，變量是不存儲 creator屬性的，這樣也減少了內存的使用。

參考：自動求導機制、『PyTorch』第五彈_深入理解autograd_上：Variable屬性方法

PyTorch學習系列(十)——如何在訓練時固定一些層？、Pytorch筆記01-Variable和Function（自動梯度計算）

detach()

返回一個新變量，與當前圖形分離。結果將永遠不需要漸變。如果輸入是易失的，輸出也將變得不穩定。返回的 Variable 永遠不會需要梯度。

根據GAN的代碼來看：

方法1. 利用detach階段梯度流：（代碼片段：DCGAN）

# train with real
        netD.zero_grad()
        real_cpu, _ = data
        batch_size = real_cpu.size(0)
        if opt.cuda:
            real_cpu = real_cpu.cuda()
        input.resize_as_(real_cpu).copy_(real_cpu)
        label.resize_(batch_size).fill_(real_label)
        inputv = Variable(input)
        labelv = Variable(label)

        output = netD(inputv)
        errD_real = criterion(output, labelv)
        errD_real.backward()
        D_x = output.data.mean()

        # train with fake
        noise.resize_(batch_size, nz, 1, 1).normal_(0, 1)
        noisev = Variable(noise)
        fake = netG(noisev)
        labelv = Variable(label.fill_(fake_label))
        output = netD(fake.detach())
        errD_fake = criterion(output, labelv)
        errD_fake.backward()
        D_G_z1 = output.data.mean()
        errD = errD_real + errD_fake
        optimizerD.step()

        ############################
        # (2) Update G network: maximize log(D(G(z)))
        ###########################
        netG.zero_grad()
        labelv = Variable(label.fill_(real_label))  # fake labels are real for generator cost
        output = netD(fake)
        errG = criterion(output, labelv)
        errG.backward()
        D_G_z2 = output.data.mean()
        optimizerG.step()

View Code

首先在用fake更新D的時候，給G的輸出加了detach，是因為我們希望更新時只更新D的參數，而不需保留G的參數的梯度。其實這個detach也是可以不用加的，因為直到netG.zero_grad() 被調用G的梯度是不會被用到的，optimizerD.step()只更新D的參數。

然后在利用fake更新G的時候，卻沒有給G的輸出加detach，因為你本身就是需要更新G的參數，所以不能截斷它。

參考：stackoverflow 、github_issue(why is detach necessary)

方法2.利用 volatile = True 來凍結G的梯度：（代碼片段：WGAN）

            # train with real
            real_cpu, _ = data
            netD.zero_grad()
            batch_size = real_cpu.size(0)

            if opt.cuda:
                real_cpu = real_cpu.cuda()
            input.resize_as_(real_cpu).copy_(real_cpu)
            inputv = Variable(input)

            errD_real = netD(inputv)
            errD_real.backward(one)

            # train with fake
            noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1)
            noisev = Variable(noise, volatile = True) # totally freeze netG
            fake = Variable(netG(noisev).data)
            inputv = fake
            errD_fake = netD(inputv)
            errD_fake.backward(mone)
            errD = errD_real - errD_fake
            optimizerD.step()

        ############################
        # (2) Update G network
        ###########################
        for p in netD.parameters():
            p.requires_grad = False # to avoid computation
        netG.zero_grad()
        # in case our last batch was the tail batch of the dataloader,
        # make sure we feed a full batch of noise
        noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1)
        noisev = Variable(noise)
        fake = netG(noisev)
        errG = netD(fake)
        errG.backward(one)
        optimizerG.step()
        gen_iterations += 1

View Code

凍結G的梯度，即在更新D的時候，反向傳播計算梯度時不會計算G的參數的梯度。作用與方法1相同。

eg:

如果我們有兩個網絡 $A, B$

# y=A(x), z=B(y) 求B中參數的梯度，不求A中參數的梯度
# 第一種方法
y = A(x)
z = B(y.detach())
z.backward()

# 第二種方法
y = A(x)
y.detach_()
z = B(y)
z.backward()

View Code

參考： pytorch: Variable detach 與 detach_ 、Pytorch入門學習（九）---detach()的作用（從GAN代碼分析)

另一個簡單說明detach用法的github issue demo：

fc1 = nn.Linear(1, 2)
fc2 = nn.Linear(2, 1)
opt1 = optim.Adam(fc1.parameters(),lr=1e-1)
opt2 = optim.Adam(fc2.parameters(),lr=1e-1)

x = Variable(torch.FloatTensor([5]))
z = fc1(x)
x_p = fc2(z)
cost = (x_p - x) ** 2
'''
print (z)
print (x_p)
print (cost)
'''
opt1.zero_grad()
opt2.zero_grad()

cost.backward()
for n, p in fc1.named_parameters():
    print (n, p.grad.data)

for n, p in fc2.named_parameters():
    print (n, p.grad.data)


opt1.zero_grad()
opt2.zero_grad()

z = fc1(x)
x_p = fc2(z.detach())
cost = (x_p - x) ** 2

cost.backward()
for n, p in fc1.named_parameters():
    print (n, p.grad.data)

for n, p in fc2.named_parameters():
    print (n, p.grad.data)


結果：

weight 
 12.0559
 -8.3572
[torch.FloatTensor of size 2x1]

bias 
 2.4112
-1.6714
[torch.FloatTensor of size 2]

weight 
-33.5588 -19.4411
[torch.FloatTensor of size 1x2]

bias 
-9.9940
[torch.FloatTensor of size 1]

================================================

weight 
 0
 0
[torch.FloatTensor of size 2x1]

bias 
 0
 0
[torch.FloatTensor of size 2]

weight 
-33.5588 -19.4411
[torch.FloatTensor of size 1x2]

bias 
-9.9940
[torch.FloatTensor of size 1]

View Code

pytorch學習經驗(一) detach, requires_grad和volatile

grad_fn

梯度函數圖跟蹤。每一個變量在圖中的位置可通過其grad_fn屬性在圖中的位置推測得到。

is_leaf

查看是否為葉子節點。即如果由用戶創建。

x = V(t.ones(1))
b = V(t.rand(1), requires_grad = True)
w = V(t.rand(1), requires_grad = True)
y = w * x # 等價於y=w.mul(x)
z = y + b # 等價於z=y.add(b)

x.requires_grad, b.requires_grad, w.requires_grad
(False, True, True)

x.is_leaf, w.is_leaf, b.is_leaf
(True, True, True)

z.grad_fn 
<AddBackward1 object at 0x7f615e1d9cf8>

z.grad_fn.next_functions 
((<MulBackward1 object at 0x7f615e1d9780>, 0), (<AccumulateGrad object at 0x7f615e1d9390>, 0))
#next_functions保存grad_fn的輸入，是一個tuple，tuple的元素也是Function
# 第一個是y，它是乘法(mul)的輸出，所以對應的反向傳播函數y.grad_fn是MulBackward
# 第二個是b，它是葉子節點，由用戶創建，grad_fn為None

View Code

autograd.grad、register_hook

在反向傳播過程中非葉子節點的導數計算完之后即被清空。若想查看這些變量的梯度，有兩種方法：

使用autograd.grad函數
使用register_hook

x = V(t.ones(3), requires_grad=True)
w = V(t.rand(3), requires_grad=True)
y = x * w
# y依賴於w，而w.requires_grad = True
z = y.sum()
x.requires_grad, w.requires_grad, y.requires_grad
(True, True, True)

View Code

# 非葉子節點grad計算完之后自動清空，y.grad是None
z.backward()
(x.grad, w.grad, y.grad)

(Variable containing:
  0.1636
  0.3563
  0.6623
 [torch.FloatTensor of size 3], Variable containing:
  1
  1
  1
 [torch.FloatTensor of size 3], None)

View Code

  此時y.grad為None，因為backward()只求圖中葉子的梯度（即無父節點）,如果需要對y求梯度，則可以使用autograd_grad或`register_hook`

使用autograd.grad:

# 第一種方法：使用grad獲取中間變量的梯度
x = V(t.ones(3), requires_grad=True)
w = V(t.rand(3), requires_grad=True)
y = x * w
z = y.sum()
# z對y的梯度，隱式調用backward()
t.autograd.grad(z, y)

(Variable containing:
  1
  1
  1
 [torch.FloatTensor of size 3],)

View Code

使用hook：

# 第二種方法：使用hook
# hook是一個函數，輸入是梯度，不應該有返回值
def variable_hook(grad):
    print('y的梯度： \r\n',grad)

x = V(t.ones(3), requires_grad=True)
w = V(t.rand(3), requires_grad=True)
y = x * w
# 注冊hook
hook_handle = y.register_hook(variable_hook)
z = y.sum()
z.backward()

# 除非你每次都要用hook，否則用完之后記得移除hook
hook_handle.remove()

y的梯度： 
 Variable containing:
 1
 1
 1
[torch.FloatTensor of size 3]

View Code

參考：pytorch-book/chapter3-Tensor和autograd/

關於梯度固定與優化設置：

model = nn.Sequential(*list(model.children()))
for p in model[0].parameters():
    p.requires_grad=False

for i in m.parameters():
    i.requires_grad=False

optimizer.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)

可以在中間插入凍結操作，這樣只凍結之前的層，后續的操作不會被凍結：

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)

        for p in self.parameters():
            p.requires_grad=False

        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

View Code

count = 0
    para_optim = []
    for k in model.children():   # model.modules():
        count += 1
        # 6 should be changed properly
        if count > 6:
            for param in k.parameters():
                para_optim.append(param)
        else:
            for param in k.parameters():
                param.requires_grad = False
optimizer = optim.RMSprop(para_optim, lr)

################
# another way

        for idx,m in enumerate(model.modules()):
            if idx >50:
                for param in m.parameters():
                    param.requires_grad = True
            else:
                for param in m.parameters():
                    param.requires_grad = False

View Code

參考：pytorch 固定部分參數訓練

對特定層的權重進行限制：

def clamp_weights(self):
    for module in self.net.modules():
        if(hasattr(module, 'weight') and module.kernel_size==(1,1)):
            module.weight.data = torch.clamp(module.weight.data,min=0)

參考：github

載入權重后發現錯誤率或正確率不正常，可能是學習率已改變，而保存和載入時沒有考慮優化器：所以保存優化器:

save_checkpoint({
            'epoch': epoch + 1,
            'arch': args.arch,
            'state_dict': model.state_dict(),
            'optimizer': optimizer.state_dict(),
            'prec1': prec1,
        }, save_name)               # save

if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            args.start_epoch = checkpoint['epoch']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))    # load

View Code

對特定的層學習率設置：

    params = []
    for name, value in model.named_parameters():
        if 'bias' in name:
            if 'fc2' in name:
                params += [{'params':value, 'lr': 20 * args.lr, 'weight_decay': 0}]
            else:
                params += [{'params':value, 'lr': 2 * args.lr, 'weight_decay': 0}]
        else:
            if 'fc2' in name:
                params += [{'params':value, 'lr': 10 * args.lr}]
            else:
                params += [{'params':value, 'lr': 1 * args.lr}]

    optimizer = torch.optim.SGD(params, args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)

View Code

或者：

class net(nn.Module):
    def __init__(self):
        super(net, self).__init__()
        self.conv1 = nn.Conv2d(3, 64, 1)
        self.conv2 = nn.Conv2d(64, 64, 1)
        self.conv3 = nn.Conv2d(64, 64, 1)
        self.conv4 = nn.Conv2d(64, 64, 1)
        self.conv5 = nn.Conv2d(64, 64, 1)
    def forward(self, x):
        out = conv5(conv4(conv3(conv2(conv1(x)))))
        return out

我們希望conv5學習率是其他層的100倍，我們可以：
net = net()
lr = 0.001

conv5_params = list(map(id, net.conv5.parameters()))
base_params = filter(lambda p: id(p) not in conv5_params,
                     net.parameters())
optimizer = torch.optim.SGD([
            {'params': base_params},
            {'params': net.conv5.parameters(), 'lr': lr * 100},
, lr=lr, momentum=0.9)

如果多層，則：
conv5_params = list(map(id, net.conv5.parameters()))
conv4_params = list(map(id, net.conv4.parameters()))
base_params = filter(lambda p: id(p) not in conv5_params + conv4_params,
                     net.parameters())
optimizer = torch.optim.SGD([
            {'params': base_params},
            {'params': net.conv5.parameters(), 'lr': lr * 100},
            {'params': net.conv4.parameters(), 'lr': lr * 100},
            , lr=lr, momentum=0.9)

View Code

參考：pytorch在不同的層使用不同的學習率

一些簡潔的網絡組織方法：

class _DenseLayer(nn.Sequential):
    """Basic unit of DenseBlock (using bottleneck layer) """
    def __init__(self, num_input_features, growth_rate, bn_size, drop_rate):
        super(_DenseLayer, self).__init__()
        self.add_module("norm1", nn.BatchNorm2d(num_input_features))
        self.add_module("relu1", nn.ReLU(inplace=True))
        self.add_module("conv1", nn.Conv2d(num_input_features, bn_size*growth_rate,
                                           kernel_size=1, stride=1, bias=False))
        self.add_module("norm2", nn.BatchNorm2d(bn_size*growth_rate))
        self.add_module("relu2", nn.ReLU(inplace=True))
        self.add_module("conv2", nn.Conv2d(bn_size*growth_rate, growth_rate,
                                           kernel_size=3, stride=1, padding=1, bias=False))
        self.drop_rate = drop_rate

    def forward(self, x):
        new_features = super(_DenseLayer, self).forward(x)
        if self.drop_rate > 0:
            new_features = F.dropout(new_features, p=self.drop_rate, training=self.training)
        return torch.cat([x, new_features], 1)


class _DenseBlock(nn.Sequential):
    """DenseBlock"""
    def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate):
        super(_DenseBlock, self).__init__()
        for i in range(num_layers):
            layer = _DenseLayer(num_input_features+i*growth_rate, growth_rate, bn_size,
                                drop_rate)
            self.add_module("denselayer%d" % (i+1,), layer)


class _Transition(nn.Sequential):
    """Transition layer between two adjacent DenseBlock"""
    def __init__(self, num_input_feature, num_output_features):
        super(_Transition, self).__init__()
        self.add_module("norm", nn.BatchNorm2d(num_input_feature))
        self.add_module("relu", nn.ReLU(inplace=True))
        self.add_module("conv", nn.Conv2d(num_input_feature, num_output_features,
                                          kernel_size=1, stride=1, bias=False))
        self.add_module("pool", nn.AvgPool2d(2, stride=2))


class DenseNet(nn.Module):
    "DenseNet-BC model"
    def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16), num_init_features=64,
                 bn_size=4, compression_rate=0.5, drop_rate=0, num_classes=1000):
        """
        :param growth_rate: (int) number of filters used in DenseLayer, `k` in the paper
        :param block_config: (list of 4 ints) number of layers in each DenseBlock
        :param num_init_features: (int) number of filters in the first Conv2d
        :param bn_size: (int) the factor using in the bottleneck layer
        :param compression_rate: (float) the compression rate used in Transition Layer
        :param drop_rate: (float) the drop rate after each DenseLayer
        :param num_classes: (int) number of classes for classification
        """
        super(DenseNet, self).__init__()
        # first Conv2d
        self.features = nn.Sequential(OrderedDict([
            ("conv0", nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)),
            ("norm0", nn.BatchNorm2d(num_init_features)),
            ("relu0", nn.ReLU(inplace=True)),
            ("pool0", nn.MaxPool2d(3, stride=2, padding=1))
        ]))

        # DenseBlock
        num_features = num_init_features
        for i, num_layers in enumerate(block_config):
            block = _DenseBlock(num_layers, num_features, bn_size, growth_rate, drop_rate)
            self.features.add_module("denseblock%d" % (i + 1), block)
            num_features += num_layers*growth_rate
            if i != len(block_config) - 1:
                transition = _Transition(num_features, int(num_features*compression_rate))
                self.features.add_module("transition%d" % (i + 1), transition)
                num_features = int(num_features * compression_rate)

        # final bn+ReLU
        self.features.add_module("norm5", nn.BatchNorm2d(num_features))
        self.features.add_module("relu5", nn.ReLU(inplace=True))

        # classification layer
        self.classifier = nn.Linear(num_features, num_classes)

        # params initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.bias, 0)
                nn.init.constant_(m.weight, 1)
            elif isinstance(m, nn.Linear):
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        features = self.features(x)
        out = F.avg_pool2d(features, 7, stride=1).view(features.size(0), -1)
        out = self.classifier(out)
        return out


def densenet121(pretrained=False, **kwargs):
    """DenseNet121"""
    model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 24, 16),
                     **kwargs)

    if pretrained:
        # '.'s are no longer allowed in module names, but pervious _DenseLayer
        # has keys 'norm.1', 'relu.1', 'conv.1', 'norm.2', 'relu.2', 'conv.2'.
        # They are also in the checkpoints in model_urls. This pattern is used
        # to find such keys.
        pattern = re.compile(
            r'^(.*denselayer\d+\.(?:norm|relu|conv))\.((?:[12])\.(?:weight|bias|running_mean|running_var))$')
        state_dict = model_zoo.load_url(model_urls['densenet121'])
        for key in list(state_dict.keys()):
            res = pattern.match(key)
            if res:
                new_key = res.group(1) + res.group(2)
                state_dict[new_key] = state_dict[key]
                del state_dict[key]
        model.load_state_dict(state_dict)
    return model


densenet = densenet121(pretrained=True)
densenet.eval()

img = Image.open("./images/cat.jpg")

trans_ops = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225])
])

images = trans_ops(img).view(-1, 3, 224, 224)
outputs = densenet(images)

_, predictions = outputs.topk(5, dim=1)

labels = list(map(lambda s: s.strip(), open("./data/imagenet/synset_words.txt").readlines()))
for idx in predictions.numpy()[0]:
    print("Predicted labels:", labels[idx])

View Code

DenseNet：比ResNet更優的CNN模型

@author: wujiyang
@contact: wujiyang@hust.edu.cn
@file: spherenet.py
@time: 2018/12/26 10:14
@desc: A 64 layer residual network struture used in sphereface and cosface, for fast convergence, I add BN after every Conv layer.
'''

import torch
import torch.nn as nn

class Block(nn.Module):
    def __init__(self, channels):
        super(Block, self).__init__()
        self.conv1 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(channels)
        self.prelu1 = nn.PReLU(channels)
        self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False)
        self.bn2 = nn.BatchNorm2d(channels)
        self.prelu2 = nn.PReLU(channels)

    def forward(self, x):
        short_cut = x
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.prelu1(x)
        x = self.conv2(x)
        x = self.bn2(x)
        x = self.prelu2(x)

        return x + short_cut


class SphereNet(nn.Module):
    def __init__(self, num_layers = 20, feature_dim=512):
        super(SphereNet, self).__init__()
        assert num_layers in [20, 64], 'SphereNet num_layers should be 20 or 64'
        if num_layers == 20:
            layers = [1, 2, 4, 1]
        elif num_layers == 64:
            layers = [3, 7, 16, 3]
        else:
            raise ValueError('sphere' + str(num_layers) + " IS NOT SUPPORTED! (sphere20 or sphere64)")

        filter_list = [3, 64, 128, 256, 512]
        block = Block
        self.layer1 = self._make_layer(block, filter_list[0], filter_list[1], layers[0], stride=2)
        self.layer2 = self._make_layer(block, filter_list[1], filter_list[2], layers[1], stride=2)
        self.layer3 = self._make_layer(block, filter_list[2], filter_list[3], layers[2], stride=2)
        self.layer4 = self._make_layer(block, filter_list[3], filter_list[4], layers[3], stride=2)
        self.fc = nn.Linear(512 * 7 * 7, feature_dim)
        self.last_bn = nn.BatchNorm1d(feature_dim)

        for m in self.modules():
            if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
                if m.bias is not None:
                    nn.init.xavier_uniform_(m.weight)
                    nn.init.constant_(m.bias, 0)
                else:
                    nn.init.normal_(m.weight, 0, 0.01)

    def _make_layer(self, block, inplanes, planes, num_units, stride):
        layers = []
        layers.append(nn.Conv2d(inplanes, planes, 3, stride, 1))
        layers.append(nn.BatchNorm2d(planes))
        layers.append(nn.PReLU(planes))
        for i in range(num_units):
            layers.append(block(planes))

        return nn.Sequential(*layers)


    def forward(self, x):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.last_bn(x)

        return x


if __name__ == '__main__':
    input = torch.Tensor(2, 3, 112, 112)
    net = SphereNet(num_layers=64, feature_dim=512)

    out = net(input)
    print(out.shape)

View Code

分離含有BN層的參數：

def separate_bn_paras(modules): if not isinstance(modules, list): modules = [*modules.modules()] paras_only_bn = [] paras_wo_bn = [] for layer in modules: if 'model' in str(layer.__class__): continue
        if 'container' in str(layer.__class__): continue
        else: if 'batchnorm' in str(layer.__class__): paras_only_bn.extend([*layer.parameters()]) else: paras_wo_bn.extend([*layer.parameters()]) return paras_only_bn, paras_wo_bn

View Code

凍結BN層的beta和gamma，也就是weights和bias：

def set_bn_eval(m):
    classname = m.__class__.__name__
    if classname.find('BatchNorm') != -1:
      m.eval()

model.apply(set_bn_eval)

View Code

固定BN均值方差或者beta和gamma的統一形式可表示為：

    def train(self, mode=True):
        """
        Override the default train() to freeze the BN parameters
        """
        super(MyNet, self).train(mode)
        if self.freeze_bn:
            print("Freezing Mean/Var of BatchNorm2D.")
            if self.freeze_bn_affine:
                print("Freezing Weight/Bias of BatchNorm2D.")
        if self.freeze_bn:
            for m in self.backbone.modules():
                if isinstance(m, nn.BatchNorm2d):
                    m.eval()
                    if self.freeze_bn_affine:
                        m.weight.requires_grad = False
                        m.bias.requires_grad = False

View Code

一個ReID的強baseline，有許多trick，以及學習率，采樣等超參的設計：

知乎：一個更加強力的ReID Baseline

代碼：reid-strong-baseline

Pytorch中Module類中的register_buffer(name, tensor) 用法：

you want a stateful part of your model that is not a parameter, but you want it in your state_dict

就是需要將某部分參數作為網絡的一部分，但不作為parameter進計算梯度、並反向傳播。但是又要保存在state_dict中。

參考： Use and Abuse of .register_buffer( ) 、Pytorch模型中的parameter與buffer

BN 同步操作： Synchronized-BatchNorm-PyTorch

‘model.eval()’ vs ‘with torch.no_grad()’ 的區別：測試時

model.eval() for batch in val_loader: #some code

或者：

model.eval() with torch.no_grad(): for batch in val_loader: #some code

都是可以的。后者因為無需計存儲任何中間變量可以更節省內存。eval改變bn和dropout的操作，而torch.no_grad() 和自動求導機制有關，可以阻止計算梯度。

選擇resnet某層：

from torchvision import models
res=models.resnet50(False)
f=nn.Sequential(*list(res.children())[:-2])

s=torch.randn(16,3,256,256)
f(s).shape

torch.utils.data.TensorDataset()函數用法：參考

class TensorDataset(Dataset):
    """Dataset wrapping tensors.

    Each sample will be retrieved by indexing tensors along the first dimension.

    Arguments:
        *tensors (Tensor): tensors that have the same size of the first dimension.
    """

    def __init__(self, *tensors):
        assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors)
        self.tensors = tensors

    def __getitem__(self, index):
        return tuple(tensor[index] for tensor in self.tensors)

    def __len__(self):
        return self.tensors[0].size(0)
可以看到它把之前的data_tensor 和target_tensor去掉了，輸入變成了元組×tensors，只需將data和target直接輸入到函數中就可以。

附一個例子：
import torch
import torch.utils.data as Data


BATCH_SIZE = 5

x = torch.linspace(1, 10, 10)
y = torch.linspace(10, 1, 10)

torch_dataset = Data.TensorDataset(x, y)

loader = Data.DataLoader(
    dataset=torch_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=2,
)

for epoch in range(3):
    for step, (batch_x, batch_y) in enumerate(loader):
        print('Epoch: ', epoch, '| Step: ', step, '| batch x: ',
              batch_x.numpy(), '| batch y: ', batch_y.numpy())
————————————————
版權聲明：本文為CSDN博主「l770796776」的原創文章，遵循 CC 4.0 BY-SA 版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/l770796776/article/details/81261981

View Code

pytorch中划分划分訓練驗證集：

利用torch.utils.data.random_split

 1 import torch
 2 from torchvision import datasets, transforms
 3 
 4 batch_size = 200
 5 
 6 """讀取訓練集和測試集"""
 7 train_db = datasets.MNIST('../data', train=True, download=True,
 8                           transform=transforms.Compose([
 9                               transforms.ToTensor(),
10                               transforms.Normalize((0.1307,), (0.3081,))
11                           ]))
12 
13 test_db = datasets.MNIST('../data', train=False,
14                          transform=transforms.Compose([
15                              transforms.ToTensor(),
16                              transforms.Normalize((0.1307,), (0.3081,))
17                          ]))
18 
19 
20 print('train:', len(train_db), 'test:', len(test_db))
21 
22 """將訓練集划分為訓練集和驗證集"""
23 train_db, val_db = torch.utils.data.random_split(train_db, [50000, 10000])
24 print('train:', len(train_db), 'validation:', len(val_db))
25 
26 
27 # 訓練集
28 train_loader = torch.utils.data.DataLoader(
29     train_db,
30     batch_size=batch_size, shuffle=True)
31 # 驗證集
32 val_loader = torch.utils.data.DataLoader(
33     val_db,
34     batch_size=batch_size, shuffle=True)
35 # 測試集
36 test_loader = torch.utils.data.DataLoader(
37     test_db,
38     batch_size=batch_size, shuffle=True)

View Code

ref： How do I split a custom dataset into training and test datasets?

pytorch 利用ddp分布式訓練：

train.py

單機4個gpu時用法：python3 train.py -g 4

ref:

1）https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html 對應code：https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py

2）https://pytorch.apachecn.org/docs/1.0/dist_tuto.html

3）https://zhuanlan.zhihu.com/p/98535650

4）https://github.com/narumiruna/pytorch-distributed-example/blob/master/mnist/main.py

horvord pytorch 分布式訓練

貌似速度和上面的ddp相近，所以一般可以直接用原生的ddp：

horvord官方mnist，可直接跑。https://github.com/horovod/horovod/blob/master/examples/pytorch_mnist.py

用法：

# run training with 4 GPUs on a single machine
$ horovodrun -np 4 python train.py       # 單機4個gpu

# run training with 8 GPUs on two machines (4 GPUs each)
$ horovodrun -np 8 -H hostname1:4,hostname2:4 python train.py     # 兩機，每台4gpu

ref：

1）https://horovod.readthedocs.io/en/stable/pytorch.html

2）https://github.com/horovod/horovod

英偉達 dali加速庫

https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/video/video_reader_simple_example.html