Saining——【arXiv2017】Aggregated Residual Transformations for Deep Neural Networks
目錄
- 作者和相關鏈接
- 主要思想
- ResNet和ResNext對比
作者和相關鏈接
- 作者

主要思想
- 要解決的問題是什么?
對於ResNet,VGG,Inception等網絡,需要由一些重復的building block堆疊而成,而這些building block的濾波器個數,大小等不能任意設置,需要人工調整。由於其中有很多超參數需要調整,而且在不同的vision task甚至是不同的dataset上參數不能直接共享需要進行個性化定制,因此,這種需要為一定task或者dataset定制的module雖然效果好,但通用性太差。這篇文章介紹了一種新的building block,可以用來替換ResNet的building block,新的模型稱為ResNeXt。ResNeXt的最大優勢在於整個網絡的building block都是一樣的,不用在每個stage里再對每個building block的超參數進行調整,只用一個building block,重復堆疊即可形成整個網絡。實驗結果表明ResNeXt比ResNet在同樣模型大小的情況下效果更好。
- 解決思路?
將ResNet的blcok(如圖Figure 1的左圖所示)換成ResNeXt的block(如圖Figure 1的右圖所示),實際上是將左邊的64個卷積核分成了右邊32條不同path,每個path有4個卷積核,最后的32個path將輸出向量直接pixel-wise相加(所有通道對應位置點相加),再與Short Cut相加
Figure 1. Left: A block of ResNet [13]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown as (# in channels, filter size, # out channels)
- Cardinality和Bottleneck
這篇文章提出了一種新的衡量模型容量(capacity,指的是模型擬合各種函數的能力)。在此之前,模型容量有寬度(width)和高度(height)這兩種屬性,本文提出的“Cardinality”指的是網絡結構中的building block的變換的集合大小(the size of the set of transformation)。如圖Figure 2所示,(a)、(b)、(c)三種結構是等價的,本文用的是圖(c)。實際上Cardinality指的就是Figure 2(b)中path數或Figure 2(c)中group數,即每一條path或者每一個group表示一種transformation,因此path數目或者group個數即為Cardinality數。Bottleneck指的是在每一個path或者group中,中間過渡形態的feature map的channel數目(或者卷積核個數),如Figure 2(a)中,在每一條path中,對於輸入256維的向量,使用了4個1*1*256的卷積核進行卷積后得到了256*4的feature map,即4個channel,每個channel的feature map大小為256維,因此,Bottleneck即為4。

Figure 2. Equivalent building blocks of ResNeXt. (a): Aggregated residual transformations, the same as Fig. 1 right. (b): A block equivalent to (a), implemented as early concatenation. (c): A block equivalent to (a,b), implemented as grouped convolutions [23]. Notations in bold text highlight the reformulation changes. A layer is denoted as (# input channels, filter size, # output channels).
ResNet和ResNeXt對比
- 網絡結構對比
圖Figure 2所示表示的depth=3的情況下ResNet和ResNeXt的building block的對比。
- 具體配置對比
ResNet-50和ResNeXt-50的building block的配置對比如Table 1所示,圖中C=32即表示Cardinality=32,Bottleneck= 4,即如圖Figure 2中所示。
Table 1. (Left) ResNet-50. (Right) ResNeXt-50 with a 32×4d template (using the reformulation in Fig. 3(c)). Inside the brackets are the shape of a residual block, and outside the brackets is the number of stacked blocks on a stage. “C=32” suggests grouped convolutions [23] with 32 groups. The numbers of parameters and FLOPs are similar between these two models. 
- 模型大小計算
以圖Figure 3為例,ResNet的參數個數為256 · 64 + 3 · 3 · 64 · 64 + 64 · 256 ≈ 70k 。
ResNeXt的參數個數為C · (256 · d + 3 · 3 · d · d + d · 256),其中,C表示Cardinality=32,d表示bottleneck=4,因此參數總數 ≈ 70k 。

Figure 3. Left: A block of ResNet [13]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown as (# in channels, filter size, # out channels)
- 實驗結果對比
- 證明ResNeXt比ResNet更好,而且Cardinality越大效果越好
Table 2. Ablation experiments on ImageNet-1K. (Top): ResNet-50 with preserved complexity (∼4.1 billion FLOPs); (Bottom): ResNet-101 with preserved complexity ∼7.8 billion FLOPs). The error rate is evaluated on the single crop of 224×224 pixels.

-
- 證明增大Cardinality比增大模型的width或者depth效果更好
Table 3. Comparisons on ImageNet-1K when the number of FLOPs is increased to 2× of ResNet-101’s. The error rate is evaluated on the single crop of 224×224 pixels. The highlighted factors are the factors that increase complexity. 
