網絡中的網絡NIN
之前介紹的LeNet,AlexNet,VGG設計思路上的共同之處,是加寬(增加卷積層的輸出的channel數量)和加深(增加卷積層的數量),再接全連接層做分類.
NIN提出了一個不同的思路,串聯多個由卷積層和'全連接層'(1x1卷積)構成的小網絡來構建一個深層網絡.
論文地址:https://arxiv.org/pdf/1312.4400.pdf
nin的重點我總結主要就2點:
- mlpconv的提出(我們用1x1卷積實現),整合多個feature map上的特征.進一步增強非線性.
- 全局平均池化替代全連接層
推薦一篇我覺得不錯的解讀博客:https://blog.csdn.net/hjimce/article/details/50458190
1x1卷積
1x1卷積對channel維度上的元素做乘加操作.
如上圖所示,由於1x1卷積對空間維度上的元素並沒有做關聯,所以空間維度(h,w)上的信息得以傳遞到后面的層中.
舉個例子,以[h,w,c]這種順序為例,1x1卷積只會將[0,0,0],[0,0,1],[0,0,2]做乘加操作.
[0,0,x]的元素和[0,1,x]的元素是不會發生關系的.
NIN結構
NIN Net是在AlexNet的基礎上提出的他們的結構分別如下所示:
AlexNet結構如下:
注意,這個圖里的maxpool是在第一二五個卷積層以后.這個圖稍微有點誤導.即11x11的卷積核后做maxpool,再做卷積.而不是卷積-卷積-池化.
NIN結構如下:
這是網上找的一個示意圖,nin的論文里並沒有完整的結構圖.
這個圖有一點不對,最后一個卷積那里應該用的卷積核的shape應該是3x3x384.共1000個,下圖紅圈處應該是3x3x384x1000,1000,1000.對應到我們的實現,應該是3x3x384x10,10,10.因為我們的數據集只有10個類別.
下面我們先來實現卷積部分:
首先我們定義nin的'小網絡'模塊.即'常規卷積-1x1卷積-1x1卷積'這一部分.
def make_layers(in_channels,out_channels,kernel_size, stride, padding):
conv = nn.Sequential(
nn.Conv2d(in_channels,out_channels,kernel_size, stride, padding),
nn.ReLU(inplace=True),
nn.Conv2d(out_channels,out_channels,kernel_size=1, stride=1, padding=0),#1x1卷積,整合多個feature map的特征
nn.ReLU(inplace=True),
nn.Conv2d(out_channels,out_channels,kernel_size=1, stride=1, padding=0),#1x1卷積,整合多個feature map的特征
nn.ReLU(inplace=True)
)
return conv
然后對於網絡的卷積部分,我們就可以寫出如下代碼
conv1 = make_layers(1,96,11,4,2)
pool1 = nn.MaxPool2d(kernel_size=3,stride=2)
conv2 = make_layers(96,256,kernel_size=5,stride=1,padding=2)
pool2 = nn.MaxPool2d(kernel_size=3,stride=2)
conv3 = make_layers(256,384,kernel_size=3,stride=1,padding=1)
pool3 = nn.MaxPool2d(kernel_size=3,stride=2)
conv4 = make_layers(384,10,kernel_size=3,stride=1,padding=1)
我們來驗證一下模型
X = torch.rand(1, 1, 224, 224)
o1 = conv1(X)
print(o1.shape) #[1,96,55,55]
o1_1 = pool1(o1)
print(o1_1.shape) #[1,96,27,27]
o2 = conv2(o1_1)
print(o2.shape) #[1,256,27,27]
o2_1 = pool2(o2)
print(o2_1.shape) #[1,256,13,13]
o3 = conv3(o2_1)
print(o3.shape) #[1,384,13,13]
o3_1 = pool3(o3)
print(o3_1.shape) #[1,384,6,6]
o4 = conv4(o3_1)
print(o4.shape) #[1,10,6,6]
每一層的輸出shape都是對的,說明我們模型寫對了.如果不對,我們就去調整make_layers()的參數,主要是padding.
卷積部分得到[1,10,6,6]的輸出以后,我們要做一個全局平均池化,全局平均池化什么意思呢?
我們先看普通池化,比方說一個10x10的輸入,用2x2的窗口去做池化,然后這個窗口不斷地滑動,從而對不同的2x2區域可以做求平均(平均池化),取最大值(最大值池化)等.這個就可以理解為'局部'的池化,2x2是10x10的一部分嘛.
相應地,所謂全局池化,自然就是用一個和輸入大小一樣的窗口做池化,即對全部的輸入做池化操作.
所以我們可以實現出全局平均池化部分:
ap = nn.AvgPool2d(kernel_size=6,stride=1)
o5 = ap(o4)
print(o5.shape) #[1,10,1,1]
torch中的nn模塊已經提供了平均池化操作函數,我們要做的就是把kernel_size賦值成和輸入的feature map的size一樣大小就好了,這樣就實現了全局平均池化.
全局平均池化的重要意義
用全局平均池化替代全連接層,一個顯而易見的好處就是,參數量極大地減少了,從而也防止了過擬合.
另一個角度看,是從網絡結構上做正則化防止過擬合.比方說[1,10,6,6]的輸入,即10個6x6的feature map,我們做全局平均池化后得到[1,10,1,1]的輸出,展平后即10x1的輸出,這10個標量,我們認為代表十個類別.訓練的過程就是使這十個標量不斷逼近代表真實類別的標量的過程.這使得模型的可解釋性更好了.
參考:https://zhuanlan.zhihu.com/p/46235425
基於以上討論,我們可以給出NinNet定義如下:
class NinNet(nn.Module):
def __init__(self):
super(NinNet, self).__init__()
self.conv = nn.Sequential(
make_layers(1,96,11,4,2),
nn.MaxPool2d(kernel_size=3,stride=2),
make_layers(96,256,kernel_size=5,stride=1,padding=2),
nn.MaxPool2d(kernel_size=3,stride=2),
make_layers(256,384,kernel_size=3,stride=1,padding=1),
nn.MaxPool2d(kernel_size=3,stride=2),
make_layers(384,10,kernel_size=3,stride=1,padding=1)
)
self.gap = nn.Sequential(
nn.AvgPool2d(kernel_size=6,stride=1)
)
def forward(self, img):
feature = self.conv(img)
output = self.gap(feature)
output = output.view(img.shape[0],-1)#[batch,10,1,1]-->[batch,10]
return output
我們可以簡單測試一下:
X = torch.rand(1, 1, 224, 224)
net = NinNet()
for name,module in net.named_children():
X = module(X)
print(name,X.shape)
輸出
conv torch.Size([1, 10, 6, 6])
gap torch.Size([1, 10, 1, 1])
接下來就是熟悉的套路:
加載數據
batch_size,num_workers=16,4
train_iter,test_iter = learntorch_utils.load_data(batch_size,num_workers,resize=224)
定義模型
net = NinNet().cuda()
print(net)
定義損失函數
loss = nn.CrossEntropyLoss()
定義優化器
opt = torch.optim.Adam(net.parameters(),lr=0.001)
定義評估函數
def test():
start = time.time()
acc_sum = 0
batch = 0
for X,y in test_iter:
X,y = X.cuda(),y.cuda()
y_hat = net(X)
acc_sum += (y_hat.argmax(dim=1) == y).float().sum().item()
batch += 1
#print('acc_sum %d,batch %d' % (acc_sum,batch))
acc = 1.0*acc_sum/(batch*batch_size)
end = time.time()
print('acc %3f,test for test dataset:time %d' % (acc,end - start))
return acc
訓練
num_epochs = 3
def train():
for epoch in range(num_epochs):
train_l_sum,batch,acc_sum = 0,0,0
start = time.time()
for X,y in train_iter:
# start_batch_begin = time.time()
X,y = X.cuda(),y.cuda()
y_hat = net(X)
acc_sum += (y_hat.argmax(dim=1) == y).float().sum().item()
l = loss(y_hat,y)
opt.zero_grad()
l.backward()
opt.step()
train_l_sum += l.item()
batch += 1
mean_loss = train_l_sum/(batch*batch_size) #計算平均到每張圖片的loss
start_batch_end = time.time()
time_batch = start_batch_end - start
train_acc = acc_sum/(batch*batch_size)
if batch % 100 == 0:
print('epoch %d,batch %d,train_loss %.3f,train_acc:%.3f,time %.3f' %
(epoch,batch,mean_loss,train_acc,time_batch))
if batch % 1000 == 0:
model_state = net.state_dict()
model_name = 'nin_epoch_%d_batch_%d_acc_%.2f.pt' % (epoch,batch,train_acc)
torch.save(model_state,model_name)
print('***************************************')
mean_loss = train_l_sum/(batch*batch_size) #計算平均到每張圖片的loss
train_acc = acc_sum/(batch*batch_size) #計算訓練准確率
test_acc = test() #計算測試准確率
end = time.time()
time_per_epoch = end - start
print('epoch %d,train_loss %f,train_acc %f,test_acc %f,time %f' %
(epoch + 1,mean_loss,train_acc,test_acc,time_per_epoch))
train()
部分輸出如下
epoch 0,batch 3600,train_loss 0.070,train_acc:0.603,time 176.200
epoch 0,batch 3700,train_loss 0.069,train_acc:0.606,time 181.160
***************************************
acc 0.701800,test for test dataset:time 11
epoch 1,train_loss 0.069109,train_acc 0.607550,test_acc 0.701800,time 195.619591
epoch 1,batch 100,train_loss 0.044,train_acc:0.736,time 5.053
epoch 1,batch 200,train_loss 0.047,train_acc:0.727,time 10.011
epoch 1,batch 300,train_loss 0.048,train_acc:0.735,time 15.210
可以看到由於沒有了全連接層,訓練時間明顯縮短.