VisualPytorch beta發布了!
功能概述:通過可視化拖拽網絡層方式搭建模型,可選擇不同數據集、損失函數、優化器生成可運行pytorch代碼
擴展功能:1. 模型搭建支持模塊的嵌套;2. 模型市場中能共享及克隆模型;3. 模型推理助你直觀的感受神經網絡在語義分割、目標探測上的威力;4.添加圖像增強、快速入門、參數彈窗等輔助性功能
修復缺陷:1.大幅改進UI界面,提升用戶體驗;2.修改注銷不跳轉、圖片丟失等已知缺陷;3.實現雙服務器訪問,緩解訪問壓力
訪問地址:http://visualpytorch.top/static 或 http://114.115.148.27/static
發布聲明詳見:https://www.cnblogs.com/NAG2020/p/13030602.html
一、正則化之weight_decay
1. Regularization:減小方差的策略
誤差可分解為:偏差,方差與噪聲之和。即誤差 = 偏差 + 方差 + 噪聲之和
偏差度量了學習算法的期望預測與真實結果的偏離程度,即刻畫了學習算法本身的擬合能力
方差度量了同樣大小的訓練集的變動所導致的學習性能的變化,即刻畫了數據擾動所造成的影響
噪聲則表達了在當前任務上任何學習算法所能達到的期望泛化誤差的下界

准確地來說方差指的是不同針對 不同 數據集的預測值期望的方差,而非訓練集和測試集Loss的差異。
2. 損失函數:衡量模型輸出與真實標簽的差異
損失函數(Loss Function):\(Loss = f(\hat y , y)\)
代價函數(Cost Function):\(Cost = \frac{1}{N}\sum_i f(\hat y_i, y_i)\)
目標函數(Objective Function):\(Obj = Cost + Regularization\)

L1 Regularization: \(\sum_i |w_i|\) 因為常在坐標軸(頂點)上取極值,容易訓練出稀疏參數
L2 Regularization: \(\sum_i w_i^2\) \(w_{i+1} = w_i - Obj' = w_i - (\frac{\partial Loss}{\partial w_i}+\lambda * w_i) = w_i(1-\lambda) - \frac{\partial Loss}{\partial w_i}\) ,因此常被稱為權重衰減
3. 以簡單的三層感知機為例
同時構建使用weight_dacay和沒有的模型:可以看到,隨着訓練次數增加,不含正則項的模型loss趨於0.
# ============================ step 1/5 數據 ============================
def gen_data(num_data=10, x_range=(-1, 1)):
w = 1.5
train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())
return train_x, train_y, test_x, test_y
train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))
# ============================ step 2/5 模型 ============================
class MLP(nn.Module):
def __init__(self, neural_num):
super(MLP, self).__init__()
self.linears = nn.Sequential(
nn.Linear(1, neural_num),
nn.ReLU(inplace=True),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Linear(neural_num, 1),
)
def forward(self, x):
return self.linears(x)
net_normal = MLP(neural_num=n_hidden)
net_weight_decay = MLP(neural_num=n_hidden)
# ============================ step 3/5 優化器 ============================
optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)
optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)
# 包含了weight_decay
# ============================ step 4/5 損失函數 ============================
loss_func = torch.nn.MSELoss()
# ============================ step 5/5 迭代訓練 ============================
writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):
# forward
pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)
loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)
optim_normal.zero_grad()
optim_wdecay.zero_grad()
loss_normal.backward()
loss_wdecay.backward()
optim_normal.step()
optim_wdecay.step()
...
通過tensorborad查看參數變化,可以明顯看出使用L2的模型參數更集中:
二、 正則化之Dropout
1. 隨機:dropout probability
失活:weight = 0
指該層任何一個神經元都有prob的可能性失活,而非有prob的神經元會失活。
2. 帶來以下三種影響:
- 特征依賴性降低
- 權重數值平均化
- 數據尺度減小
假設prob = 0.3,在測試時不使用dropout,為了抵消這種尺度上的變化,需要在訓練期間對權重除 \((1-p)\)
Test: \(100 = \sum_{100} W_x\)
Train: \(70 = \sum_{70} W_x \Longrightarrow 100 = \sum_{70} W_x/(1-p)\)
因此,在兩種狀態下經過網絡層,得到的結果近似:
net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)
net.train() # 測試結束后調整回運行狀態
y = net(x)
print("output in training mode", y)
net.eval() # 測試開始時使用
y = net(x)
print("output in eval mode", y)
output in training mode tensor([9942.], grad_fn=<ReluBackward1>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward1>)
3. 仍以線性回歸為例:
同時構建有dropout和沒有的模型:隨着訓練次數增加,含有dropout的模型會更加平滑。
class MLP(nn.Module):
def __init__(self, neural_num, d_prob=0.5):
super(MLP, self).__init__()
self.linears = nn.Sequential(
nn.Linear(1, neural_num),
nn.ReLU(inplace=True),
nn.Dropout(d_prob),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Dropout(d_prob),
nn.Linear(neural_num, neural_num),
nn.ReLU(inplace=True),
nn.Dropout(d_prob),
nn.Linear(neural_num, 1),
)
def forward(self, x):
return self.linears(x)
net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)
我們觀察線性層的權重分布,可以明顯看出來含有dropout的模型參數更集中,峰值更高
三、Batch Normalization
1. Batch Normalization:批標准化
批:一批數據,通常為mini-batch
標准化:0均值,1方差
優點:
- 可以用更大學習率,加速模型收斂
- 可以不用精心設計權值初始化
- 可以不用dropout或較小的dropout
- 可以不用L2或者較小的weight decay
- 可以不用LRN(local response normalization)
注意到,\(\gamma, \beta\)是可學習的參數,如果該層不想進行BN,最后學習出來\(\gamma = \sigma_{\Beta}, \beta = \mu_{\Beta}\),即恆等變換。
詳見《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》閱讀筆記與實現
2. Internal Covariate Shift (ICS)
防止因為數據尺度/分布的不均使得梯度消失或爆炸,導致訓練困難。
第四節提到的其他Normalization都是為了避免ICS.
3. _BatchNorm
pytorch中nn.BatchNorm1d
nn.BatchNorm2d
nn.BatchNorm3d
都繼承於_BatchNorm
,並且有以下參數:
__init__(self, num_features, # 一個樣本特征數量(最重要)
eps=1e-5, # 分母修正項
momentum=0.1, # 指數加權平均估計當前mean/var
affine=True, # 是否需要affine transform
track_running_stats=True) # 是訓練狀態,還是測試狀態
BatchNorm層主要參數:
- running_mean:均值
- running_var:方差
- weight:affine transform中的gamma
- bias: affine transform中的beta
訓練:均值和方差采用指數加權平均計算
running_mean = (1 - momentum) * running_mean + momentum * mean_t
running_var = (1 - momentum) * running_var + momentum * var_t
測試:當前統計值
如上圖所示:在1D,2D,3D中,特征數分別指 特征、特征圖、特征核的數目。而BN是對於每個特征對應的所有樣本求的均值和方差,故如上圖中三種情況樣本數分別為5,3,3,而對應的\(\gamma, \beta\)維數即5,3,3
4. 仍以人民幣二分類為例:
我們原始的網絡在卷積和線性層之后加入了BN層,注意其中卷積層后是2D BN,線性層后是1D
class LeNet_bn(nn.Module):
def __init__(self, classes):
super(LeNet_bn, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.bn1 = nn.BatchNorm2d(num_features=6)
self.conv2 = nn.Conv2d(6, 16, 5)
self.bn2 = nn.BatchNorm2d(num_features=16)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.bn3 = nn.BatchNorm1d(num_features=120)
self.fc2 = nn.Linear(120, 84)
self.fc3 = nn.Linear(84, classes)
def forward(self, x):
out = self.conv1(x)
out = self.bn1(out)
out = F.relu(out)
out = F.max_pool2d(out, 2)
out = self.conv2(out)
out = self.bn2(out)
out = F.relu(out)
out = F.max_pool2d(out, 2)
out = out.view(out.size(0), -1)
out = self.fc1(out)
out = self.bn3(out)
out = F.relu(out)
out = F.relu(self.fc2(out))
out = self.fc3(out)
return out
def initialize_weights(self):
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.xavier_normal_(m.weight.data)
if m.bias is not None:
m.bias.data.zero_()
elif isinstance(m, nn.BatchNorm2d):
m.weight.data.fill_(1)
m.bias.data.zero_()
elif isinstance(m, nn.Linear):
nn.init.normal_(m.weight.data, 0, 1)
m.bias.data.zero_()
- 使用
net = LeNet(classes=2)
不經過初始化:
- 經過精心設計的初始化
net.initialize_weights()
:
- 使用
net = LeNet_bn(classes=2)
結果如下:即使Loss有不穩定的區間,其最大值不像前兩種超過1.5
四、Normalizaiton_layers
1. Layer Normalization
起因:BN不適用於變長的網絡,如RNN
思路:逐層計算均值和方差
注意事項:
- 不再有running_mean和running_var
- gamma和beta為逐元素的
nn.LayerNorm(normalized_shape, # 該層特征形狀
eps=1e-05,
elementwise_affine=True # 是否需要affine transform
)
注意,這里的normalized_shape
可以是輸入后面任意維特征。比如[8, 6, 3, 4]
為batch的輸入,可以是[6,3,4]
, [3,4]
,[4]
,但不能是[6,3]
2. Instance Normalization
起因:BN在圖像生成(Image Generation)中不適用,圖像中輸入的Batch各不相同,不能逐Batch標准化
思路:逐Instance(channel)計算均值和方差
nn.InstanceNorm2d(num_features,
eps=1e-05,
momentum=0.1,
affine=False,
track_running_stats=False)
# 同樣還有1d, 3d
圖像風格遷移就是一種不能BN的應用,輸入的圖片各不相同,只能逐通道求方差和均值
3. Group Normalization
起因:小batch樣本中,BN估計的值不准
思路:數據不夠,通道來湊
nn.GroupNorm(num_groups, # 分組個數,必須是num_channel的因子
num_channels,
eps=1e-05,
affine=True)
注意事項:
- 不再有running_mean和running_var
- gamma和beta為逐通道(channel)的
應用場景:大模型(小batch size)任務
當num_groups=num時,相當於LN
當num_groups=1時,相當於IN