1. PyTorch進行訓練和測試時指定實例化的model模式為:train/eval
eg:

class VAE(nn.Module): def __init__(self): super(VAE, self).__init__() ... def reparameterize(self, mu, logvar): if self.training: std = logvar.mul(0.5).exp_() eps = Variable(std.data.new(std.size()).normal_()) return eps.mul(std).add_(mu) else: return mu model = VAE() ... def train(epoch): model.train() ... def test(epoch): model.eval()
eval即evaluation模式,train即訓練模式。僅僅當模型中有Dropout
和BatchNorm
是才會有影響。因為訓練時dropout和BN都開啟,而一般而言測試時dropout被關閉,BN中的參數也是利用訓練時保留的參數,所以測試時應進入評估模式。
(在訓練時,𝜇和𝜎2是在整個mini-batch 上計算出來的包含了像是64 或28 或其它一定數量的樣本,但在測試時,你可能需要逐一處理樣本,方法是根據你的訓練集估算𝜇和𝜎2,估算的方式有很多種,理論上你可以在最終的網絡中運行整個訓練集來得到𝜇和𝜎2,但在實際操作中,我們通常運用指數加權平均來追蹤在訓練過程中你看到的𝜇和𝜎2的值。還可以用指數加權平均,有時也叫做流動平均來粗略估算𝜇和𝜎2,然后在測試中使用𝜇和𝜎2的值來進行你所需要的隱藏單元𝑧值的調整。在實踐中,不管你用什么方式估算𝜇和𝜎2,這套過程都是比較穩健的,因此我不太會擔心你具體的操作方式,而且如果你使用的是某種深度學習框架,通常會有默認的估算𝜇和𝜎2的方式,應該一樣會起到比較好的效果) -- Deeplearning.ai
2. PyTorch權重初始化的幾種方法

class discriminator(nn.Module): def __init__(self, dataset = 'mnist'): super(discriminator, self).__init__() 。... self.conv = nn.Sequential( nn.Conv2d(self.input_dim, 64, 4, 2, 1), nn.ReLU(), ) ... self.fc = nn.Sequential( nn.Linear(32, 64 * (self.input_height // 2) * (self.input_width // 2)), nn.BatchNorm1d(64 * (self.input_height // 2) * (self.input_width // 2)), nn.ReLU(), ) self.deconv = nn.Sequential( nn.ConvTranspose2d(64, self.output_dim, 4, 2, 1), #nn.Sigmoid(), # EBGAN does not work well when using Sigmoid(). ) utils.initialize_weights(self) def forward(self, input): ... def initialize_weights(net): for m in net.modules(): if isinstance(m, nn.Conv2d): m.weight.data.normal_(0, 0.02) m.bias.data.zero_() elif isinstance(m, nn.ConvTranspose2d): m.weight.data.normal_(0, 0.02) m.bias.data.zero_() elif isinstance(m, nn.Linear): m.weight.data.normal_(0, 0.02) m.bias.data.zero_()

def init_weights(m): print(m) if type(m) == nn.Linear: m.weight.data.fill_(1.0) print(m.weight) net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2)) net.apply(init_weights)

def weights_init(m): classname = m.__class__.__name__ if classname.find('Conv') != -1: m.weight.data.normal_(0.0, 0.02) elif classname.find('BatchNorm') != -1: m.weight.data.normal_(1.0, 0.02) m.bias.data.fill_(0) net.apply(weights_init)
class torch.nn.Module 是所有神經網絡的基類。
modules()返回網絡中所有模塊的迭代器。
add_module(name, module) 將一個子模塊添加到當前模塊。 該模塊可以使用給定的名稱作為屬性訪問。
apply(fn) 適用fn
遞歸到每個子模塊(如返回.children(),
以及自我。
3. PyTorch 中Variable的重要屬性
class torch.autograd.Variable
為什么要引入Variable?首先回答為什么引入Tensor。僅僅利用numpy也可以實現前向反向操作,但numpy不支持GPU運算。而Pytorch為Tensor提供多種操作運算,此外Tensor支持GPU。問題來了,兩三層網絡可以推公式寫反向傳播,當網絡很復雜時需要自動化。autograd可以幫助我們,當利用autograd時,前向傳播會定義一個計算圖,圖中的節點就是Tensor。圖中的邊就是函數。當我們將Tensor塞到Variable時,Variable就變為了節點。若x為一個Variable,那x.data即為Tensor,x.grad也為一個Variable。那x.grad.data就為梯度的值咯。總結:PyTorch Variables與PyTorch Tensors有着相同的API,Tensor上的所有操作幾乎都可用在Variable上。兩者不同之處在於利用Variable定義一個計算圖,可以實現自動求導!
重要的屬性如下:
requires_grad
指定要不要更新這個變數,對於不需要更新的變數可以把他設定成False
,可以加快運算。
Variable默認是不需要求導的,即requires_grad
屬性默認為False,如果某一個節點requires_grad被設置為True,那么所有依賴它的節點requires_grad
都為True。
在用戶手動定義Variable時,參數requires_grad默認值是False。而在Module中的層在定義時,相關Variable的requires_grad參數默認是True。
在計算圖中,如果有一個輸入的requires_grad是True,那么輸出的requires_grad也是True。只有在所有輸入的requires_grad都為False時,輸出的requires_grad才為False。
volatile
指定需不需要保留紀錄用的變數。指定變數為True
代表運算不需要記錄,可以加快運算。如果一個變數的volatile是True
,則它的requires_grad一定是False
。
簡單來說,對於需要更新的Variable記得將requires_grad
設成True
,當只需要得到結果而不需要更新的Variable可以將volatile
設成True
加快運算速度。 參考:PyTorch 基礎篇
variable的volatile
屬性默認為False,如果某一個variable的volatile
屬性被設為True,那么所有依賴它的節點volatile
屬性都為True。volatile屬性為True的節點不會求導,volatile的優先級比requires_grad
高。
當有一個輸入的volatile=True時,那么輸出的volatile=True。volatile=True推薦在模型的推理過程(測試)中使用,這時只需要令輸入的voliate=True,保證用最小的內存來執行推理,不會保存任何中間狀態。在使用volatile=True
的時候,變量是不存儲 creator
屬性的,這樣也減少了內存的使用。
參考:自動求導機制 、『PyTorch』第五彈_深入理解autograd_上:Variable屬性方法
PyTorch學習系列(十)——如何在訓練時固定一些層?、Pytorch筆記01-Variable和Function(自動梯度計算)
detach()
返回一個新變量,與當前圖形分離。結果將永遠不需要漸變。如果輸入是易失的,輸出也將變得不穩定。返回的 Variable 永遠不會需要梯度。
根據GAN的代碼來看:
方法1. 利用detach階段梯度流:(代碼片段:DCGAN)

# train with real netD.zero_grad() real_cpu, _ = data batch_size = real_cpu.size(0) if opt.cuda: real_cpu = real_cpu.cuda() input.resize_as_(real_cpu).copy_(real_cpu) label.resize_(batch_size).fill_(real_label) inputv = Variable(input) labelv = Variable(label) output = netD(inputv) errD_real = criterion(output, labelv) errD_real.backward() D_x = output.data.mean() # train with fake noise.resize_(batch_size, nz, 1, 1).normal_(0, 1) noisev = Variable(noise) fake = netG(noisev) labelv = Variable(label.fill_(fake_label)) output = netD(fake.detach()) errD_fake = criterion(output, labelv) errD_fake.backward() D_G_z1 = output.data.mean() errD = errD_real + errD_fake optimizerD.step() ############################ # (2) Update G network: maximize log(D(G(z))) ########################### netG.zero_grad() labelv = Variable(label.fill_(real_label)) # fake labels are real for generator cost output = netD(fake) errG = criterion(output, labelv) errG.backward() D_G_z2 = output.data.mean() optimizerG.step()
首先在用fake更新D的時候,給G的輸出加了detach,是因為我們希望更新時只更新D的參數,而不需保留G的參數的梯度。其實這個detach也是可以不用加的,因為直到netG.zero_grad()
被調用G的梯度是不會被用到的,optimizerD.step()只更新D的參數。
然后在利用fake更新G的時候,卻沒有給G的輸出加detach,因為你本身就是需要更新G的參數,所以不能截斷它。
參考:stackoverflow 、github_issue(why is detach necessary)
方法2.利用 volatile = True 來凍結G的梯度:(代碼片段:WGAN

# train with real real_cpu, _ = data netD.zero_grad() batch_size = real_cpu.size(0) if opt.cuda: real_cpu = real_cpu.cuda() input.resize_as_(real_cpu).copy_(real_cpu) inputv = Variable(input) errD_real = netD(inputv) errD_real.backward(one) # train with fake noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1) noisev = Variable(noise, volatile = True) # totally freeze netG fake = Variable(netG(noisev).data) inputv = fake errD_fake = netD(inputv) errD_fake.backward(mone) errD = errD_real - errD_fake optimizerD.step() ############################ # (2) Update G network ########################### for p in netD.parameters(): p.requires_grad = False # to avoid computation netG.zero_grad() # in case our last batch was the tail batch of the dataloader, # make sure we feed a full batch of noise noise.resize_(opt.batchSize, nz, 1, 1).normal_(0, 1) noisev = Variable(noise) fake = netG(noisev) errG = netD(fake) errG.backward(one) optimizerG.step() gen_iterations += 1
凍結G的梯度,即在更新D的時候,反向傳播計算梯度時不會計算G的參數的梯度。作用與方法1相同。
eg:
如果我們有兩個網絡 A,B, 兩個關系是這樣的 y=A(x),z=B(y). 現在我們想用 z.backward()來為 B 網絡的參數來求梯度,但是又不想求 A 網絡參數的梯度。我們可以這樣:

# y=A(x), z=B(y) 求B中參數的梯度,不求A中參數的梯度 # 第一種方法 y = A(x) z = B(y.detach()) z.backward() # 第二種方法 y = A(x) y.detach_() z = B(y) z.backward()
參考: pytorch: Variable detach 與 detach_ 、Pytorch入門學習(九)---detach()的作用(從GAN代碼分析)
另一個簡單說明detach用法的github issue demo:

fc1 = nn.Linear(1, 2) fc2 = nn.Linear(2, 1) opt1 = optim.Adam(fc1.parameters(),lr=1e-1) opt2 = optim.Adam(fc2.parameters(),lr=1e-1) x = Variable(torch.FloatTensor([5])) z = fc1(x) x_p = fc2(z) cost = (x_p - x) ** 2 ''' print (z) print (x_p) print (cost) ''' opt1.zero_grad() opt2.zero_grad() cost.backward() for n, p in fc1.named_parameters(): print (n, p.grad.data) for n, p in fc2.named_parameters(): print (n, p.grad.data) opt1.zero_grad() opt2.zero_grad() z = fc1(x) x_p = fc2(z.detach()) cost = (x_p - x) ** 2 cost.backward() for n, p in fc1.named_parameters(): print (n, p.grad.data) for n, p in fc2.named_parameters(): print (n, p.grad.data) 結果: weight 12.0559 -8.3572 [torch.FloatTensor of size 2x1] bias 2.4112 -1.6714 [torch.FloatTensor of size 2] weight -33.5588 -19.4411 [torch.FloatTensor of size 1x2] bias -9.9940 [torch.FloatTensor of size 1] ================================================ weight 0 0 [torch.FloatTensor of size 2x1] bias 0 0 [torch.FloatTensor of size 2] weight -33.5588 -19.4411 [torch.FloatTensor of size 1x2] bias -9.9940 [torch.FloatTensor of size 1]
pytorch學習經驗(一) detach, requires_grad和volatile
grad_fn
梯度函數圖跟蹤。每一個變量在圖中的位置可通過其grad_fn
屬性在圖中的位置推測得到。
is_leaf
查看是否為葉子節點。即如果由用戶創建。

x = V(t.ones(1)) b = V(t.rand(1), requires_grad = True) w = V(t.rand(1), requires_grad = True) y = w * x # 等價於y=w.mul(x) z = y + b # 等價於z=y.add(b) x.requires_grad, b.requires_grad, w.requires_grad (False, True, True) x.is_leaf, w.is_leaf, b.is_leaf (True, True, True) z.grad_fn <AddBackward1 object at 0x7f615e1d9cf8> z.grad_fn.next_functions ((<MulBackward1 object at 0x7f615e1d9780>, 0), (<AccumulateGrad object at 0x7f615e1d9390>, 0)) #next_functions保存grad_fn的輸入,是一個tuple,tuple的元素也是Function # 第一個是y,它是乘法(mul)的輸出,所以對應的反向傳播函數y.grad_fn是MulBackward # 第二個是b,它是葉子節點,由用戶創建,grad_fn為None
autograd.grad、register_hook
在反向傳播過程中非葉子節點的導數計算完之后即被清空。若想查看這些變量的梯度,有兩種方法:
- 使用autograd.grad函數
- 使用register_hook

x = V(t.ones(3), requires_grad=True) w = V(t.rand(3), requires_grad=True) y = x * w # y依賴於w,而w.requires_grad = True z = y.sum() x.requires_grad, w.requires_grad, y.requires_grad (True, True, True)

# 非葉子節點grad計算完之后自動清空,y.grad是None z.backward() (x.grad, w.grad, y.grad) (Variable containing: 0.1636 0.3563 0.6623 [torch.FloatTensor of size 3], Variable containing: 1 1 1 [torch.FloatTensor of size 3], None)
此時y.grad為None,因為backward()只求圖中葉子的梯度(即無父節點),如果需要對y求梯度,則可以使用autograd_grad或`register_hook`
使用autograd.grad:

# 第一種方法:使用grad獲取中間變量的梯度 x = V(t.ones(3), requires_grad=True) w = V(t.rand(3), requires_grad=True) y = x * w z = y.sum() # z對y的梯度,隱式調用backward() t.autograd.grad(z, y) (Variable containing: 1 1 1 [torch.FloatTensor of size 3],)
使用hook:

# 第二種方法:使用hook # hook是一個函數,輸入是梯度,不應該有返回值 def variable_hook(grad): print('y的梯度: \r\n',grad) x = V(t.ones(3), requires_grad=True) w = V(t.rand(3), requires_grad=True) y = x * w # 注冊hook hook_handle = y.register_hook(variable_hook) z = y.sum() z.backward() # 除非你每次都要用hook,否則用完之后記得移除hook hook_handle.remove() y的梯度: Variable containing: 1 1 1 [torch.FloatTensor of size 3]
參考:pytorch-book/chapter3-Tensor和autograd/
關於梯度固定與優化設置:
model = nn.Sequential(*list(model.children())) for p in model[0].parameters(): p.requires_grad=False
for i in m.parameters(): i.requires_grad=False
optimizer.SGD(filter(lambda p: p.requires_grad, model.parameters()), lr=1e-3)
可以在中間插入凍結操作,這樣只凍結之前的層,后續的操作不會被凍結:

class Net(nn.Module): def __init__(self): super(Net, self).__init__() self.conv1 = nn.Conv2d(1, 6, 5) self.conv2 = nn.Conv2d(6, 16, 5) for p in self.parameters(): p.requires_grad=False self.fc1 = nn.Linear(16 * 5 * 5, 120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, 10)

count = 0 para_optim = [] for k in model.children(): # model.modules(): count += 1 # 6 should be changed properly if count > 6: for param in k.parameters(): para_optim.append(param) else: for param in k.parameters(): param.requires_grad = False optimizer = optim.RMSprop(para_optim, lr) ################ # another way for idx,m in enumerate(model.modules()): if idx >50: for param in m.parameters(): param.requires_grad = True else: for param in m.parameters(): param.requires_grad = False
對特定層的權重進行限制:
def clamp_weights(self): for module in self.net.modules(): if(hasattr(module, 'weight') and module.kernel_size==(1,1)): module.weight.data = torch.clamp(module.weight.data,min=0)
參考:github
載入權重后發現錯誤率或正確率不正常,可能是學習率已改變,而保存和載入時沒有考慮優化器:所以保存優化器:

save_checkpoint({ 'epoch': epoch + 1, 'arch': args.arch, 'state_dict': model.state_dict(), 'optimizer': optimizer.state_dict(), 'prec1': prec1, }, save_name) # save if args.resume: if os.path.isfile(args.resume): print("=> loading checkpoint '{}'".format(args.resume)) checkpoint = torch.load(args.resume) args.start_epoch = checkpoint['epoch'] model.load_state_dict(checkpoint['state_dict']) optimizer.load_state_dict(checkpoint['optimizer']) print("=> loaded checkpoint '{}' (epoch {})" .format(args.resume, checkpoint['epoch'])) else: print("=> no checkpoint found at '{}'".format(args.resume)) # load
對特定的層學習率設置:

params = [] for name, value in model.named_parameters(): if 'bias' in name: if 'fc2' in name: params += [{'params':value, 'lr': 20 * args.lr, 'weight_decay': 0}] else: params += [{'params':value, 'lr': 2 * args.lr, 'weight_decay': 0}] else: if 'fc2' in name: params += [{'params':value, 'lr': 10 * args.lr}] else: params += [{'params':value, 'lr': 1 * args.lr}] optimizer = torch.optim.SGD(params, args.lr, momentum=args.momentum, weight_decay=args.weight_decay)
或者:

class net(nn.Module): def __init__(self): super(net, self).__init__() self.conv1 = nn.Conv2d(3, 64, 1) self.conv2 = nn.Conv2d(64, 64, 1) self.conv3 = nn.Conv2d(64, 64, 1) self.conv4 = nn.Conv2d(64, 64, 1) self.conv5 = nn.Conv2d(64, 64, 1) def forward(self, x): out = conv5(conv4(conv3(conv2(conv1(x))))) return out 我們希望conv5學習率是其他層的100倍,我們可以: net = net() lr = 0.001 conv5_params = list(map(id, net.conv5.parameters())) base_params = filter(lambda p: id(p) not in conv5_params, net.parameters()) optimizer = torch.optim.SGD([ {'params': base_params}, {'params': net.conv5.parameters(), 'lr': lr * 100}, , lr=lr, momentum=0.9) 如果多層,則: conv5_params = list(map(id, net.conv5.parameters())) conv4_params = list(map(id, net.conv4.parameters())) base_params = filter(lambda p: id(p) not in conv5_params + conv4_params, net.parameters()) optimizer = torch.optim.SGD([ {'params': base_params}, {'params': net.conv5.parameters(), 'lr': lr * 100}, {'params': net.conv4.parameters(), 'lr': lr * 100}, , lr=lr, momentum=0.9)
一些簡潔的網絡組織方法:

class _DenseLayer(nn.Sequential): """Basic unit of DenseBlock (using bottleneck layer) """ def __init__(self, num_input_features, growth_rate, bn_size, drop_rate): super(_DenseLayer, self).__init__() self.add_module("norm1", nn.BatchNorm2d(num_input_features)) self.add_module("relu1", nn.ReLU(inplace=True)) self.add_module("conv1", nn.Conv2d(num_input_features, bn_size*growth_rate, kernel_size=1, stride=1, bias=False)) self.add_module("norm2", nn.BatchNorm2d(bn_size*growth_rate)) self.add_module("relu2", nn.ReLU(inplace=True)) self.add_module("conv2", nn.Conv2d(bn_size*growth_rate, growth_rate, kernel_size=3, stride=1, padding=1, bias=False)) self.drop_rate = drop_rate def forward(self, x): new_features = super(_DenseLayer, self).forward(x) if self.drop_rate > 0: new_features = F.dropout(new_features, p=self.drop_rate, training=self.training) return torch.cat([x, new_features], 1) class _DenseBlock(nn.Sequential): """DenseBlock""" def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate): super(_DenseBlock, self).__init__() for i in range(num_layers): layer = _DenseLayer(num_input_features+i*growth_rate, growth_rate, bn_size, drop_rate) self.add_module("denselayer%d" % (i+1,), layer) class _Transition(nn.Sequential): """Transition layer between two adjacent DenseBlock""" def __init__(self, num_input_feature, num_output_features): super(_Transition, self).__init__() self.add_module("norm", nn.BatchNorm2d(num_input_feature)) self.add_module("relu", nn.ReLU(inplace=True)) self.add_module("conv", nn.Conv2d(num_input_feature, num_output_features, kernel_size=1, stride=1, bias=False)) self.add_module("pool", nn.AvgPool2d(2, stride=2)) class DenseNet(nn.Module): "DenseNet-BC model" def __init__(self, growth_rate=32, block_config=(6, 12, 24, 16), num_init_features=64, bn_size=4, compression_rate=0.5, drop_rate=0, num_classes=1000): """ :param growth_rate: (int) number of filters used in DenseLayer, `k` in the paper :param block_config: (list of 4 ints) number of layers in each DenseBlock :param num_init_features: (int) number of filters in the first Conv2d :param bn_size: (int) the factor using in the bottleneck layer :param compression_rate: (float) the compression rate used in Transition Layer :param drop_rate: (float) the drop rate after each DenseLayer :param num_classes: (int) number of classes for classification """ super(DenseNet, self).__init__() # first Conv2d self.features = nn.Sequential(OrderedDict([ ("conv0", nn.Conv2d(3, num_init_features, kernel_size=7, stride=2, padding=3, bias=False)), ("norm0", nn.BatchNorm2d(num_init_features)), ("relu0", nn.ReLU(inplace=True)), ("pool0", nn.MaxPool2d(3, stride=2, padding=1)) ])) # DenseBlock num_features = num_init_features for i, num_layers in enumerate(block_config): block = _DenseBlock(num_layers, num_features, bn_size, growth_rate, drop_rate) self.features.add_module("denseblock%d" % (i + 1), block) num_features += num_layers*growth_rate if i != len(block_config) - 1: transition = _Transition(num_features, int(num_features*compression_rate)) self.features.add_module("transition%d" % (i + 1), transition) num_features = int(num_features * compression_rate) # final bn+ReLU self.features.add_module("norm5", nn.BatchNorm2d(num_features)) self.features.add_module("relu5", nn.ReLU(inplace=True)) # classification layer self.classifier = nn.Linear(num_features, num_classes) # params initialization for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.kaiming_normal_(m.weight) elif isinstance(m, nn.BatchNorm2d): nn.init.constant_(m.bias, 0) nn.init.constant_(m.weight, 1) elif isinstance(m, nn.Linear): nn.init.constant_(m.bias, 0) def forward(self, x): features = self.features(x) out = F.avg_pool2d(features, 7, stride=1).view(features.size(0), -1) out = self.classifier(out) return out def densenet121(pretrained=False, **kwargs): """DenseNet121""" model = DenseNet(num_init_features=64, growth_rate=32, block_config=(6, 12, 24, 16), **kwargs) if pretrained: # '.'s are no longer allowed in module names, but pervious _DenseLayer # has keys 'norm.1', 'relu.1', 'conv.1', 'norm.2', 'relu.2', 'conv.2'. # They are also in the checkpoints in model_urls. This pattern is used # to find such keys. pattern = re.compile( r'^(.*denselayer\d+\.(?:norm|relu|conv))\.((?:[12])\.(?:weight|bias|running_mean|running_var))$') state_dict = model_zoo.load_url(model_urls['densenet121']) for key in list(state_dict.keys()): res = pattern.match(key) if res: new_key = res.group(1) + res.group(2) state_dict[new_key] = state_dict[key] del state_dict[key] model.load_state_dict(state_dict) return model densenet = densenet121(pretrained=True) densenet.eval() img = Image.open("./images/cat.jpg") trans_ops = transforms.Compose([ transforms.Resize(256), transforms.CenterCrop(224), transforms.ToTensor(), transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) ]) images = trans_ops(img).view(-1, 3, 224, 224) outputs = densenet(images) _, predictions = outputs.topk(5, dim=1) labels = list(map(lambda s: s.strip(), open("./data/imagenet/synset_words.txt").readlines())) for idx in predictions.numpy()[0]: print("Predicted labels:", labels[idx])
DenseNet:比ResNet更優的CNN模型

@author: wujiyang @contact: wujiyang@hust.edu.cn @file: spherenet.py @time: 2018/12/26 10:14 @desc: A 64 layer residual network struture used in sphereface and cosface, for fast convergence, I add BN after every Conv layer. ''' import torch import torch.nn as nn class Block(nn.Module): def __init__(self, channels): super(Block, self).__init__() self.conv1 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False) self.bn1 = nn.BatchNorm2d(channels) self.prelu1 = nn.PReLU(channels) self.conv2 = nn.Conv2d(channels, channels, 3, 1, 1, bias=False) self.bn2 = nn.BatchNorm2d(channels) self.prelu2 = nn.PReLU(channels) def forward(self, x): short_cut = x x = self.conv1(x) x = self.bn1(x) x = self.prelu1(x) x = self.conv2(x) x = self.bn2(x) x = self.prelu2(x) return x + short_cut class SphereNet(nn.Module): def __init__(self, num_layers = 20, feature_dim=512): super(SphereNet, self).__init__() assert num_layers in [20, 64], 'SphereNet num_layers should be 20 or 64' if num_layers == 20: layers = [1, 2, 4, 1] elif num_layers == 64: layers = [3, 7, 16, 3] else: raise ValueError('sphere' + str(num_layers) + " IS NOT SUPPORTED! (sphere20 or sphere64)") filter_list = [3, 64, 128, 256, 512] block = Block self.layer1 = self._make_layer(block, filter_list[0], filter_list[1], layers[0], stride=2) self.layer2 = self._make_layer(block, filter_list[1], filter_list[2], layers[1], stride=2) self.layer3 = self._make_layer(block, filter_list[2], filter_list[3], layers[2], stride=2) self.layer4 = self._make_layer(block, filter_list[3], filter_list[4], layers[3], stride=2) self.fc = nn.Linear(512 * 7 * 7, feature_dim) self.last_bn = nn.BatchNorm1d(feature_dim) for m in self.modules(): if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear): if m.bias is not None: nn.init.xavier_uniform_(m.weight) nn.init.constant_(m.bias, 0) else: nn.init.normal_(m.weight, 0, 0.01) def _make_layer(self, block, inplanes, planes, num_units, stride): layers = [] layers.append(nn.Conv2d(inplanes, planes, 3, stride, 1)) layers.append(nn.BatchNorm2d(planes)) layers.append(nn.PReLU(planes)) for i in range(num_units): layers.append(block(planes)) return nn.Sequential(*layers) def forward(self, x): x = self.layer1(x) x = self.layer2(x) x = self.layer3(x) x = self.layer4(x) x = x.view(x.size(0), -1) x = self.fc(x) x = self.last_bn(x) return x if __name__ == '__main__': input = torch.Tensor(2, 3, 112, 112) net = SphereNet(num_layers=64, feature_dim=512) out = net(input) print(out.shape)
Face_Pytorch/backbone/spherenet.py
分離含有BN層的參數:

def separate_bn_paras(modules): if not isinstance(modules, list): modules = [*modules.modules()] paras_only_bn = [] paras_wo_bn = [] for layer in modules: if 'model' in str(layer.__class__): continue
if 'container' in str(layer.__class__): continue
else: if 'batchnorm' in str(layer.__class__): paras_only_bn.extend([*layer.parameters()]) else: paras_wo_bn.extend([*layer.parameters()]) return paras_only_bn, paras_wo_bn
凍結BN層的beta和gamma,也就是weights和bias:

def set_bn_eval(m): classname = m.__class__.__name__ if classname.find('BatchNorm') != -1: m.eval() model.apply(set_bn_eval)
固定BN均值方差或者beta和gamma的統一形式可表示為:

def train(self, mode=True): """ Override the default train() to freeze the BN parameters """ super(MyNet, self).train(mode) if self.freeze_bn: print("Freezing Mean/Var of BatchNorm2D.") if self.freeze_bn_affine: print("Freezing Weight/Bias of BatchNorm2D.") if self.freeze_bn: for m in self.backbone.modules(): if isinstance(m, nn.BatchNorm2d): m.eval() if self.freeze_bn_affine: m.weight.requires_grad = False m.bias.requires_grad = False
一個ReID的強baseline,有許多trick,以及學習率,采樣等超參的設計:
知乎:一個更加強力的ReID Baseline
代碼:reid-strong-baseline
Pytorch中Module類中的register_buffer
(name, tensor) 用法:
- you want a stateful part of your model that is not a parameter, but you want it in your state_dict
就是需要將某部分參數作為網絡的一部分,但不作為parameter進計算梯度、並反向傳播。但是又要保存在state_dict中。
參考: Use and Abuse of .register_buffer( ) 、Pytorch模型中的parameter與buffer
‘model.eval()’ vs ‘with torch.no_grad()’ 的區別:測試時
model.eval() for batch in val_loader: #some code
或者:
model.eval() with torch.no_grad(): for batch in val_loader: #some code
都是可以的。后者因為無需計存儲任何中間變量可以更節省內存。eval改變bn和dropout的操作,而torch.no_grad() 和自動求導機制有關,可以阻止計算梯度。
from torchvision import models res=models.resnet50(False) f=nn.Sequential(*list(res.children())[:-2]) s=torch.randn(16,3,256,256) f(s).shape
torch.utils.data.TensorDataset()函數用法:參考

class TensorDataset(Dataset): """Dataset wrapping tensors. Each sample will be retrieved by indexing tensors along the first dimension. Arguments: *tensors (Tensor): tensors that have the same size of the first dimension. """ def __init__(self, *tensors): assert all(tensors[0].size(0) == tensor.size(0) for tensor in tensors) self.tensors = tensors def __getitem__(self, index): return tuple(tensor[index] for tensor in self.tensors) def __len__(self): return self.tensors[0].size(0) 可以看到它把之前的data_tensor 和target_tensor去掉了,輸入變成了元組×tensors,只需將data和target直接輸入到函數中就可以。 附一個例子: import torch import torch.utils.data as Data BATCH_SIZE = 5 x = torch.linspace(1, 10, 10) y = torch.linspace(10, 1, 10) torch_dataset = Data.TensorDataset(x, y) loader = Data.DataLoader( dataset=torch_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2, ) for epoch in range(3): for step, (batch_x, batch_y) in enumerate(loader): print('Epoch: ', epoch, '| Step: ', step, '| batch x: ', batch_x.numpy(), '| batch y: ', batch_y.numpy()) ———————————————— 版權聲明:本文為CSDN博主「l770796776」的原創文章,遵循 CC 4.0 BY-SA 版權協議,轉載請附上原文出處鏈接及本聲明。 原文鏈接:https://blog.csdn.net/l770796776/article/details/81261981
pytorch中划分划分訓練驗證集:
利用torch.utils.data.random_split

1 import torch 2 from torchvision import datasets, transforms 3 4 batch_size = 200 5 6 """讀取訓練集和測試集""" 7 train_db = datasets.MNIST('../data', train=True, download=True, 8 transform=transforms.Compose([ 9 transforms.ToTensor(), 10 transforms.Normalize((0.1307,), (0.3081,)) 11 ])) 12 13 test_db = datasets.MNIST('../data', train=False, 14 transform=transforms.Compose([ 15 transforms.ToTensor(), 16 transforms.Normalize((0.1307,), (0.3081,)) 17 ])) 18 19 20 print('train:', len(train_db), 'test:', len(test_db)) 21 22 """將訓練集划分為訓練集和驗證集""" 23 train_db, val_db = torch.utils.data.random_split(train_db, [50000, 10000]) 24 print('train:', len(train_db), 'validation:', len(val_db)) 25 26 27 # 訓練集 28 train_loader = torch.utils.data.DataLoader( 29 train_db, 30 batch_size=batch_size, shuffle=True) 31 # 驗證集 32 val_loader = torch.utils.data.DataLoader( 33 val_db, 34 batch_size=batch_size, shuffle=True) 35 # 測試集 36 test_loader = torch.utils.data.DataLoader( 37 test_db, 38 batch_size=batch_size, shuffle=True)
ref: How do I split a custom dataset into training and test datasets?
pytorch 利用ddp分布式訓練:

單機4個gpu時用法:python3 train.py -g 4
ref:
1)https://yangkky.github.io/2019/07/08/distributed-pytorch-tutorial.html 對應code:https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py
2)https://pytorch.apachecn.org/docs/1.0/dist_tuto.html
3)https://zhuanlan.zhihu.com/p/98535650
4)https://github.com/narumiruna/pytorch-distributed-example/blob/master/mnist/main.py
horvord pytorch 分布式訓練
貌似速度和上面的ddp相近,所以一般可以直接用原生的ddp:
horvord官方mnist,可直接跑。https://github.com/horovod/horovod/blob/master/examples/pytorch_mnist.py
用法:
# run training with 4 GPUs on a single machine $ horovodrun -np 4 python train.py # 單機4個gpu # run training with 8 GPUs on two machines (4 GPUs each) $ horovodrun -np 8 -H hostname1:4,hostname2:4 python train.py # 兩機,每台4gpu
ref:
1)https://horovod.readthedocs.io/en/stable/pytorch.html
2)https://github.com/horovod/horovod
英偉達 dali加速庫