pytorch源碼解析：Python層 pytorchmodule源碼

本文轉載自查看原文 2019-08-30 11:30 702

嘗試使用了pytorch，相比其他深度學習框架，pytorch顯得簡潔易懂。花時間讀了部分源碼，主要結合簡單例子帶着問題閱讀，不涉及源碼中C拓展庫的實現。

一個簡單例子

實現單層softmax二分類，輸入特征維度為4，輸出為2，經過softmax函數得出輸入的類別概率。代碼示意：定義網絡結構；使用SGD優化；迭代一次，隨機初始化三個樣例，每個樣例四維特征，target分別為1,0,1；前向傳播，使用交叉熵計算loss；反向傳播，最后由優化算法更新權重，完成一次迭代。

import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.linear = nn.Linear( 4, 2)
def forward(self, input):
out = F.softmax(self.linear(input))
return out
net = Net()
sgd = torch.optim.SGD(net.parameters(), lr= 0.001)
for epoch in range(1):
features = torch.autograd.Variable(torch.randn( 3, 4), requires_grad=True)
target = torch.autograd.Variable(torch.LongTensor([ 1, 0, 1]))
sgd.zero_grad()
out = net(features)
loss = F.cross_entropy(out, target)
loss.backward()
sgd.step()

從上面的例子，帶着下面的問題閱讀源碼：

pytorch的主要概念：Tensor、autograd、Variable、Function、Parameter、Module（Layers）、Optimizer；
自定義Module如何組織網絡結構和網絡參數；
前向傳播、反向傳播實現流程
優化算法類如何實現，如何和自定義Module聯系並更新參數。

pytorch的主要概念

pytorch的主要概念官網有很人性化的教程Deep Learning with PyTorch: A 60 Minute Blitz，這里簡單概括這些概念：

Tensor

類似numpy的ndarrays，強化了可進行GPU計算的特性，由C拓展模塊實現。如上面的torch.randn(3, 4) 返回一個3*4的Tensor。和numpy一樣，也有一系列的Operation，如

x = torch.rand( 5, 3)
y = torch.rand( 5, 3)
print x + y
print torch.add(x, y)
print x.add_(y)

Varaiable與autograd

Variable封裝了Tensor，包括了幾乎所有的Tensor可以使用的Operation方法，主要使用在自動求導(autograd)，Variable類繼承_C._VariableBase，由C拓展類定義實現。
Variable是autograd的計算單元，Variable通過Function組織成函數表達式（計算圖）：

data 為其封裝的tensor值
grad 為其求導后的值
creator 為創建該Variable的Function，實現中grad_fn屬性則指向該Function。
如：
1. import torch
2. from torch.autograd import Variable
3. x = Variable(torch.ones( 2, 2), requires_grad=True)
4. y = x + 2
5. print y.grad_fn
6. print "before backward: ", x.grad
7. y.backward()
8. print "after backward: ", x.grad
輸出結果：
1. < torch.autograd.function.AddConstantBackward object at 0x7faa6f3bdd68>
2. before backward: None
3. after backward: Variable containing:
4. 1
5. [torch.FloatTensor of size 1x1]
調用y的backward方法，則會對創建y的Function計算圖中所有requires_grad=True的Variable求導（這里的x）。例子中顯然dy/dx = 1。

Parameter

Parameter 為Variable的一個子類，后面還會涉及，大概兩點區別：

作為Module參數會被自動加入到該Module的參數列表中；
不能被volatile，默認require gradient。

Module

Module為所有神經網絡模塊的父類，如開始的例子，Net繼承該類，____init____中指定網絡結構中的模塊，並重寫forward方法實現前向傳播得到指定輸入的輸出值，以此進行后面loss的計算和反向傳播。

Optimizer

Optimizer是所有優化算法的父類（SGD、Adam、...），____init____中傳入網絡的parameters, 子類實現父類step方法，完成對parameters的更新。

自定義Module

該部分說明自定義的Module是如何組織定義在構造函數中的子Module,以及自定義的parameters的保存形式，eg:

class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.linear = nn.Linear(4, 2)
def forward(self, input):
out = F.softmax( self.linear(input))
return out

首先看構造函數，Module的構造函數初始化了Module的基本屬性，這里關注_parameters和_modules，兩個屬性初始化為OrderedDict()，pytorch重寫的有序字典類型。_parameters保存網絡的所有參數，_modules保存當前Module的子Module。
module.py：

class Module(object):
def __init__(self):
self._parameters = OrderedDict()
self._modules = OrderedDict()
...

下面來看自定義Net類中self.linear = nn.Linear(4, 2)語句和_modules、_parameters如何產生聯系，或者self.linear及其參數如何被添加到_modules、_parameters字典中。答案在Module的____setattr____方法，該Python內建方法會在類的屬性被賦值時調用。
module.py:

def __setattr__(self, name, value):
def remove_from(*dicts):
for d in dicts:
if name in d:
del d[name]
params = self.__dict__.get( '_parameters')
if isinstance(value, Parameter): # ----------- <1>
if params is None:
raise AttributeError(
"cannot assign parameters before Module.__init__() call")
remove_from(self.__dict__, self._buffers, self._modules)
self.register_parameter(name, value)
elif params is not None and name in params:
if value is not None:
raise TypeError("cannot assign '{}' as parameter '{}' "
"(torch.nn.Parameter or None expected)"
.format(torch.typename(value), name))
self.register_parameter(name, value)
else:
modules = self.__dict__.get( '_modules')
if isinstance(value, Module):# ----------- <2>
if modules is None:
raise AttributeError(
"cannot assign module before Module.__init__() call")
remove_from(self.__dict__, self._parameters, self._buffers)
modules[name] = value
elif modules is not None and name in modules:
if value is not None:
raise TypeError("cannot assign '{}' as child module '{}' "
"(torch.nn.Module or None expected)"
.format(torch.typename(value), name))
modules[name] = value
......

調用self.linear = nn.Linear(4, 2)時，父類____setattr____被調用，參數name為“linear”， value為nn.Linear(4, 2)，內建的Linear類同樣是Module的子類。所以<2>中的判斷為真，接着modules[name] = value，該linear被加入_modules字典。
同樣自定義Net類的參數即為其子模塊Linear的參數，下面看Linear的實現：
linear.py:

class Linear(Module):
def __init__(self, in_features, out_features, bias=True):
super(Linear, self).__init__()
self.in_features = in_features
self.out_features = out_features
self.weight = Parameter(torch.Tensor(out_features, in_features))
if bias:
self.bias = Parameter(torch.Tensor(out_features))
else:
self.register_parameter('bias', None)
self.reset_parameters()
def reset_parameters(self):
stdv = 1. / math.sqrt(self.weight.size(1))
self.weight.data.uniform_(-stdv, stdv)
if self.bias is not None:
self.bias.data.uniform_(-stdv, stdv)
def forward(self, input):
return F.linear(input, self.weight, self.bias)

同樣繼承Module類，____init____中參數為輸入輸出維度，是否需要bias參數。在self.weight = Parameter(torch.Tensor(out_features, in_features))的初始化時，同樣會調用父類Module的____setattr____， name為“weight”，value為Parameter，此時<1>判斷為真，調用self.register_parameter(name, value)，該方法中對參數進行合法性校驗后放入self._parameters字典中。

Linear在reset_parameters方法對權重進行了初始化。

最終可以得出結論自定義的Module以樹的形式組織子Module，子Module及其參數以字典的方式保存。

前向傳播、反向傳播

前向傳播

例子中out = net(features)實現了網絡的前向傳播，該語句會調用Module類的forward方法，該方法被繼承父類的子類實現。net(features)使用對象作為函數調用，會調用Python內建的____call____方法，Module重寫了該方法。

module.py:

def __call__( self, *input, **kwargs):
for hook in self._forward_pre_hooks.values():
hook( self, input)
result = self.forward(*input, **kwargs)
for hook in self._forward_hooks.values():
hook_result = hook( self, input, result)
if hook_result is not None:
raise RuntimeError(
"forward hooks should never return any values, but '{}'"
"didn't return None".format(hook))
if len(self._backward_hooks) > 0:
var = result
while not isinstance(var, Variable):
var = var[0]
grad_fn = var.grad_fn
if grad_fn is not None:
for hook in self._backward_hooks.values():
wrapper = functools.partial(hook, self)
functools.update_wrapper(wrapper, hook)
grad_fn.register_hook(wrapper)
return result

____call____方法中調用result = self.forward(*input, **kwargs)前后會查看有無hook函數需要調用（預處理和后處理）。
例子中Net的forward方法中out = F.softmax(self.linear(input))，同樣會調用self.linear的forward方法F.linear(input, self.weight, self.bias)進行矩陣運算（仿射變換）。
functional.py:

def linear(input, weight, bias=None):
if input.dim() == 2 and bias is not None:
# fused op is marginally faster
return torch.addmm(bias, input, weight.t())
output = input.matmul(weight.t())
if bias is not None:
output += bias
return output

最終經過F.softmax，得到前向輸出結果。F.softmax和F.linear類似前面說到的Function（Parameters的表達式或計算圖）。

反向傳播

得到前向傳播結果后，計算loss = F.cross_entropy(out, target)，接下來反向傳播求導數d(loss)/d(weight)和d(loss)/d(bias)：

loss.backward()

backward()方法同樣底層由C拓展，這里暫不深入，調用該方法后，loss計算圖中的所有Variable(這里linear的weight和bias)的grad被求出。

Optimizer參數更新

在計算出參數的grad后，需要根據優化算法對參數進行更新，不同的優化算法有不同的更新策略。
optimizer.py:

class Optimizer(object):
def __init__(self, params, defaults):
if isinstance(params, Variable) or torch.is_tensor(params):
raise TypeError("params argument given to the optimizer should be "
"an iterable of Variables or dicts, but got " +
torch.typename(params))
self.state = defaultdict(dict)
self.param_groups = list(params)
......
def zero_grad(self):
"""Clears the gradients of all optimized :class:`Variable` s."""
for group in self.param_groups:
for p in group['params']:
if p.grad is not None:
if p.grad.volatile:
p.grad.data.zero_()
else:
data = p.grad.data
p.grad = Variable(data.new().resize_as_(data).zero_())
def step(self, closure):
"""Performs a single optimization step (parameter update).
Arguments:
closure (callable): A closure that reevaluates the model and
returns the loss. Optional for most optimizers.
"""
raise NotImplementedError

Optimizer在init中將傳入的params保存到self.param_groups，另外兩個重要的方法zero_grad負責將參數的grad置零方便下次計算，step負責參數的更新，由子類實現。
以列子中的sgd = torch.optim.SGD(net.parameters(), lr=0.001)為例，其中net.parameters()返回Net參數的迭代器，為待優化參數；lr指定學習率。
SGD.py:

class SGD(Optimizer):
def __init__(self, params, lr=required, momentum=0, dampening=0,
weight_decay=0, nesterov=False):
defaults = dict(lr=lr, momentum=momentum, dampening=dampening,
weight_decay=weight_decay, nesterov=nesterov)
if nesterov and (momentum <= 0 or dampening != 0):
raise ValueError("Nesterov momentum requires a momentum and zero dampening")
super(SGD, self).__init__(params, defaults)
def __setstate__(self, state):
super(SGD, self).__setstate__(state)
for group in self.param_groups:
group.setdefault( 'nesterov', False)
def step(self, closure=None):
"""Performs a single optimization step.
Arguments:
closure (callable, optional): A closure that reevaluates the model
and returns the loss.
"""
loss = None
if closure is not None:
loss = closure()
for group in self.param_groups:
weight_decay = group[ 'weight_decay']
momentum = group[ 'momentum']
dampening = group[ 'dampening']
nesterov = group[ 'nesterov']
for p in group['params']:
if p.grad is None:
continue
d_p = p.grad.data
if weight_decay != 0:
d_p.add_(weight_decay, p.data)
if momentum != 0:
param_state = self.state[p]
if 'momentum_buffer' not in param_state:
buf = param_state[ 'momentum_buffer'] = d_p.clone()
else:
buf = param_state[ 'momentum_buffer']
buf.mul_(momentum).add_( 1 - dampening, d_p)
if nesterov:
d_p = d_p.add(momentum, buf)
else:
d_p = buf
p.data.add_(-group[ 'lr'], d_p)
return loss

SGD的step方法中，判斷是否使用權重衰減和動量更新，如果不使用，直接更新權重param := param - lr * d(param)。例子中調用sgd.step()后完成一次epoch。這里由於傳遞到Optimizer的參數集是可更改（mutable）的，step中對參數的更新同樣是Net中參數的更新。

小結

到此，根據一個簡單例子閱讀了pytorch中Python實現的部分源碼，沒有深入到底層Tensor、autograd等部分的C拓展實現，后面再繼續讀一讀C拓展部分的代碼。

轉自鏈接：https://www.jianshu.com/p/f5eb8c2e671c

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Pytorch之Dataparallel源碼解析 PyTorch ResNet 使用與源碼解析 [源碼解析] PyTorch 如何使用GPU [源碼解析] PyTorch 分布式(2) ----- DataParallel(上) [源碼解析] PyTorch 分布式之 ZeroRedundancyOptimizer Surface在C++層的創建源碼解析『Python』源碼解析_源碼文件介紹 Python之Scrapy框架源碼解析 [源碼解析] PyTorch 分布式 Autograd (3) ---- 上下文相關 [源碼解析] PyTorch 分布式(1) --- 數據加載之DistributedSampler