Fully Convolutional Networks for Semantic Segmentation 譯文
Abstract
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolutional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmentation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolutional networks,explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [19], the VGG net [31], and GoogLeNet [32]) into fully convolutional networks and transfer their learned representations by fine-tuning [4] to the segmentation task. We then define a novel architecture that combines semantic information from a deep,coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20% relative improvement to 62.2% mean IU on 2012), NYUDv2, and SIFTFlow,while inference takes less than one fifth of a second for a typical image.
卷積網絡在特征分層領域是非常強大的視覺模型。我們證明了經過端到端、像素到像素訓練的卷積網絡並且超過目前語義分割中最先進的技術。我們的核心觀點是建立“全卷積”網絡,輸入任意尺寸,經過有效的推理和學習產生相應尺寸的輸出。我們定義並指定全卷積網絡的空間,解釋它們在空間范圍內dense prediction任務(dense prediction:預測每個像素所屬的類別)和獲取與以前模型聯系。我們改編當前的分類網絡(AlexNet,the VGG net , and GoogLeNet)到完全卷積網絡和通過微調(fine-tune) 傳遞它們的學習表現到分割任務中。然后我們定義了一個跳躍式的架構(skip layers),結合來自深、粗層的語義信息(深層次的存儲圖片的全局信息,相對來說比較注重粗糙,即整體部分)和來自淺、細層的表征信息(淺層次的存儲圖片的局部信息,相對來說比較注重細節,即邊緣部分)來產生准確和精細的分割。我們的完全卷積網絡成為了在PASCAL VOC最出色的分割方式(在2012年相對62.2%的平均IU提高了20%),並且對NYUDv2和SIFT Flow數據集的一個典型圖像推理只需要花費不到0.2秒的時間。 (PASCAL VOC、NYUDv2和SIFT Flow均為數據集)
1.Introduction
Convolutional networks are driving advances in recognition. Convnets are not only improving for whole-image classification [19, 31, 32], but also making progress on local tasks with structured output. These include advances in bounding box object detection [29, 12, 17], part and key-point prediction [39, 24], and local correspondence [24, 9].
卷積網絡在識別領域前進勢頭很猛。卷積網不僅在整個圖片的分類上有所提高 ,也在結構化輸出的局部任務上取得了進步。包括在目標檢測邊界框、部分和關鍵點預測和局部通信的進步。
The natural next step in the progression from coarse to fine inference is to make a prediction at every pixel. Prior approaches have used convnets for semantic segmentation [27,2,8,28,16,14,11],in which each pixel is labeled with the class of its enclosing object or region, but with shortcomings that this work addresses.
在從粗糙到精細推理的進展中下一步自然是對每一個像素進行預測。早前的方法已經將卷積網絡用於語義分割,其中每個像素被標記為其封閉對象或區域的類別,但是有個缺點就是這項工作的定位問題。

We show that a fully convolutional network (FCN), trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery. To our knowledge, this is the first work to train FCNs end-to-end (1) for pixelwise prediction and (2) from supervised pre-training. Fully convolutional versions of existing networks predict dense outputs from arbitrary-sized inputs. Both learning and inference are performed whole-image-at--a-time by dense feedforward computation and backpropagation. In-network upsampling layers enable pixelwise prediction and learning in nets with subsampled pooling.
我們證明了經過端到端、像素到像素訓練的的全卷積網絡,它超過語義分割中在沒有更多機械的情況下超過了最先進的水平 。我們認為,這是第一次訓練端到端(1)的FCN在像素級別的預測,而且來自監督式預處理(2)。全卷積在現有的網絡基礎上從任意尺寸的輸入到預測密集輸出。學習和推理能在整個圖片通過密集的前饋計算和反向傳播一次執行。在神經網絡中上采樣層能在像素級別預測和通過下采樣池化學習。
This method is efficient, both asymptotically and absolutely,and precludes the need for the complications in other works. Patchwise training is common [27, 2, 8, 28, 11], but lacks the efficiency of fully convolutional training. Our approach doesnot make use of pre- and post-processing complications,including superpixels[8,16],proposals[16,14], or post-hoc refinement by random fields or local classifiers [8, 16]. Our model transfers recent success in classification [19, 31, 32] to dense prediction by reinterpreting classification nets as fully convolutional and fine-tuning from their learned representations. In contrast, previous works have applied small convnets without supervised pre-training [8, 28, 27].
這種方法非常有效,無論是漸進地還是完全地,消除了在其他方法中的並發問題。Patchwise訓練(可以理解傳入神經網絡的數據並非是整個圖片,而是對圖片感興趣的局部,這樣做的目的是避免完整圖像訓練的冗余 )是常見的,但是缺少了全卷積訓練的有效性。我們的方法不是利用預處理或者后期處理解決並發問題,包括超像素,proposals(需要看下面的引用),或者對通過隨機域事后細化或者局部分類。我們的模型通過重新解釋分類網到全卷積網絡和微調它們的學習表現將最近在分類上的成功移植到dense prediction。與此相反,先前的工作應用的是小規模、沒有超像素預處理的卷積網。
Semantic segmentation faces an inherent tension between semantics and location: global information resolves what while local information resolves where. Deep feature hierarchies jointly encode location and semantics in a local-to-global pyramid. We define a novel “skip” architecture to combine deep,coarse,semantic information and shallow, fine, appearance information in Section 4.2 (see Figure 3).
語義分割面臨在語義和位置的內在張力問題:全局信息解決的“是什么”,而局部信息解決的是“在哪里”。深層特征通過非線性的局部到全局金字塔結構來編碼了位置和語義信息。我們在4.2節(見圖3)定義了一種利用集合了深、粗層的語義信息和淺、細層的表征信息的特征譜的跨層架構(即skip layers)。
In the next section,we review related work on deep classification nets, FCNs, and recent approaches to semantic segmentation using convnets. The following sections explain FCN design and dense prediction tradeoffs, introduce our architecture with in-network upsampling and multilayer combinations, and describe our experimental framework. Finally, we demonstrate state-of-the-art results on PASCAL VOC 2011-2, NYUDv2, and SIFT Flow.
在下一節,我們回顧深層分類網、FCNs和最近一些利用卷積網解決語義分割的相關工作。接下來的章節將解釋FCN設計和密集預測(dense prediction)權衡,介紹我們的網內上采樣和多層結合架構,描述我們的實驗框架。最后,我們展示了最先進技術在PASCAL VOC 2011-2, NYUDv2, 和SIFT Flow上的實驗結果。
2.Related work
Our approach draws on recent successes of deep nets for image classification [19, 31, 32] and transfer learning [4, 38]. Transfer was first demonstrated on various visual recognition tasks [4, 38], then on detection, and on both instance and semantic segmentation in hybrid proposal classifier models[12,16,14]. We now re-architect and fine-tune classification nets to direct,dense prediction of semantic segmentation. We chart the space of FCNs and situate prior models, both historical and recent, in this framework.
我們的方法是基於最近深層網絡在圖像分類上的成功和遷移學習。遷移第一次被證明在各種視覺識別任務,然后是檢測,不僅在實例還有融合proposal-classification模型的語義分割 。我們現在重新構建和微調直接的、dense prediction語義分割的分類網。在這個框架里我們繪制FCNs的空間並將過去的或是最近的先驗模型置於其中。
Fully convolutional networks To our knowledge, the idea of extending a convnet to arbitrary-sized inputs first appeared in Matan et al. [25], which extended the classic LeNet [21] to recognize strings of digits. Because their net was limited to one-dimensional input strings, Matan et al. used Viterbi decoding to obtain their outputs. Wolf and Platt [37] expand convnet outputs to 2-dimensional maps of detection scores for the four corners of postal address blocks. Both of these historical works do inference and learning fully convolutionally for detection. Ning et al. [27] define a convnet for coarse multiclass segmentation of C. elegans tissues with fully convolutional inference.
全卷積網絡 據我們所知,第一次將卷積網擴展到任意尺寸的輸入的是Matan等人,它將經典的LeNet擴展到識別數字串 。因為他們的網絡結構限制在一維的輸入串,Matan等人利用譯碼器譯碼獲得輸出。Wolf和Platt [40] 將卷積網輸出擴展到來檢測郵政地址塊的四角得分的二維圖。這些先前工作做的是推理和用於檢測的全卷積式學習。Ning等人定義了一種基於全卷積推理的卷積網絡用於秀麗線蟲組織的粗糙的、多分類分割。
Fully convolutional computation has also been exploited in the present era of many-layered nets. Sliding window detection by Sermanet et al. [29], semantic segmentation by Pinheiro and Collobert [28], and image restoration by Eigen et al. [5] do fully convolutional inference. Fully convolutional training is rare, but used effectively by Tompson et al. [35] to learn an end-to-end part detector and spatial model for pose estimation, although they do not exposit on or analyze this method.
全卷積計算也被用在現在的一些多層次的網絡結構中。Sermanet等人的滑動窗口檢測,Pinherio 和Collobert的語義分割,Eigen等人的圖像修復都做了全卷積式推理。全卷積訓練很少,但是被Tompson等人用來學習一種端到端的局部檢測和姿態估計的空間模型非常有效,盡管他們沒有解釋或者分析這種方法。
Alternatively, He et al. [17] discard the non-convolutional portion of classification nets to make a feature extractor. They combine proposals and spatial pyramid pooling to yield a localized, fixed-length feature for classification. While fast and effective, this hybrid model cannot be learned end-to-end.
此外,He等人在特征提取時丟棄了分類網的無卷積部分。他們結合proposals和空間金字塔池來產生一個局部的、固定長度的特征用於分類。盡管快速且有效,但是這種混合模型不能進行端到端的學習。
Dense prediction with convnets Several recent works have applied convnets to dense prediction problems,including semantic segmentation by Ning et al.[27],Farabet et al. [8], and Pinheiro and Collobert [28]; boundary prediction for electron microscopy by Ciresanetal.[2]and for natural images by a hybrid neural net/nearest neighbor model by Ganin and Lempitsky[11];and image restoration and depth estimation by Eigenetal.[5,6]. Common elements of these approaches include
- small models restricting capacity and receptive fields;
- patchwise training [27, 2, 8, 28, 11];
- post-processing by superpixel projection,random field regularization, filtering, or local classification [8, 2, 11];
- input shifting and output interlacing for dense output [28, 11] as introduced by OverFeat [29];
- multi-scale pyramid processing [8, 28, 11];
- saturating tanh nonlinearities [8, 5, 28]; and
- ensembles [2, 11],
whereas our method does without this machinery. However, we do study patchwise training 3.4 and “shift-and-stitch” dense output 3.2 from the perspective of FCNs. We also discuss in-network upsampling 3.3, of which the fully connected prediction by Eigen et al. [6] is a special case.
基於卷積網的dense prediction 近期的一些工作已經將卷積網應用於dense prediction問題,包括Ning等人的語義分割,Farabet等人以及Pinheiro和Collobert;Ciresan等人的電子顯微鏡邊界預測以及Ganin和Lempitsky的通過混合卷積網和最鄰近模型的處理自然場景圖像;還有Eigen等人的圖像修復和深度估計。這些方法的相同點包括如下:
- 限制容量和接收域的小模型
- patchwise訓練
- 超像素投影的預處理,隨機場正則化、濾波或局部分類
- 輸入移位和dense輸出的隔行交錯輸出
- 多尺度金字塔處理
- 飽和雙曲線正切非線性
- 集成
然而我們的方法確實沒有這種機制。但是我們研究了patchwise訓練 (3.4節)和從FCNs的角度出發的“shift-and-stitch”dense輸出(3.2節)。我們也討論了神經網絡內上采樣(3.3節),其中Eigen等人[7]的全連接預測是一個特例。
Unlike these existing methods,we adapt and extend deep classification architectures,using image classification as supervised pre-training, and fine-tune fully convolutionally to learn simply and efficiently from whole image inputs and whole image ground thruths.
和這些現有的方法不同的是,我們改編和擴展了深度分類架構,使用圖像分類作為監督預處理,和從全部圖像的輸入和ground truths(用於有監督訓練的訓練集的分類准確性,即已經標注好的分割圖片)通過全卷積微調進行簡單且高效的學習。
Hariharanetal.[16]andGuptaetal.[14]likewiseadapt deep classification nets to semantic segmentation, but do so in hybrid proposal-classifier models. These approaches fine-tune an R-CNN system [12] by sampling bounding boxes and/or region proposals for detection, semantic segmentation, and instance segmentation. Neither method is learned end-to-end.
Hariharan等人和Gupta等人也改編深度分類網到語義分割,但是也在混合proposal-classifier模型中這么做了。這些方法通過采樣邊界框和region proposal進行微調了R-CNN系統,用於檢測、語義分割和實例分割。這兩種辦法都不能進行端到端的學習。他們分別在PASCAL VOC和NYUDv2實現了最好的分割效果,所以在第5節中我們直接將我們的獨立的、端到端的FCN和他們的語義分割結果進行比較。
They achieve state-of-the-art results on PASCAL VOC segmentation and NYUDv2 segmentation respectively, so we directly compare our standalone, end-to-end FCN to their semantic segmentation results in Section 5.
我們通過跨層和融合特征來定義一種非線性的局部到整體的表述用來協調端到端。在現今的工作中Hariharan等人也在語義分割的混合模型中使用了多層。
3.Fullyconvolutionalnetworks
Each layer of data in a convnet is a three-dimensional array of size h×w×d, where h and w are spatial dimensions, and d is the feature or channel dimension. The first layer is the image, with pixel size h×w, and d color channels. Locations in higher layers correspond to the locations in the image they are path-connected to, which are called their receptive fields.
卷積網的每層數據是一個h×w×d的三維數組,其中h和w是空間維度,d是特征或通道維數。第一層是像素尺寸為h×w、顏色通道數為d的圖像。高層中的位置信息和圖像中它們連通的位置信息相對應,被稱為感受野。
Convnets are built on translation invariance. Their basic components (convolution, pooling, and activation functions) operate on local input regions, and depend only on relative spatial coordinates. Writing $ x_i^j$ for the data vector at location ( i , j ) in a particular layer,and $ y_{ij}$ for the following layer, these functions compute outputs $ y_{ij}$ by
where $ k$ is called the kernel size, $ s$ is the stride or subsampling factor, and $ f_{ks}$ determines the layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling,or an elementwise nonlinearity for an activation function, and so on for other types of layers.
卷積網是以平移不變形作為基礎的。其基本組成部分(卷積,池化和激勵函數)作用在局部輸入域,只依賴相對空間坐標。在特定層記 $ x_i^j$ 為在坐標(i,j)的數據向量,在下一層有 $ y_{ij}$, $ y_{ij}$的計算公式如下:
其中k為卷積核尺寸,s是步長或下采樣因素,\(f_{ks}\)決定了層的類型:一個卷積的矩陣乘或者是平均池化,用於最大池的最大空間值或者是一個激勵函數的一個非線性元素,亦或是層的其他種類等等 。
This functional form is maintained under composition, with kernel size and stride obeying the transformation rule
While a general deep net computes a general nonlinear function, a net with only layers of this form computes a nonlinear filter, which we call a deep filter or fully convolutional network. An FCN naturally operates on an input of any size,and produces an output of corresponding(possibly resampled) spatial dimensions.
當卷積核尺寸和步長遵從轉換規則,這個函數形式被表述為如下形式:
當一個普通深度的網絡計算一個普通的非線性函數,一個網絡只有這種形式的層計算非線性濾波,我們稱之為深度濾波或全卷積網絡。FCN理應可以計算任意尺寸的輸入並產生相應(或許重采樣)空間維度的輸出。
A real-valued loss function composed with an FCN defines a task. If the loss function is a sum over the spatial dimensions of the final layer $ \iota (X;\theta)=\Sigma_{ij}\iota'(X_{ij}; \theta)$,its gradient will be a sum over the gradients of each of its spatial components. Thus stochastic gradient descent on \(\iota`\)computed on whole images will be the same as stochastic gradient descent on \(\iota`\),taking all of the final layer receptive fields as a minibatch.
一個實值損失函數有FCN定義了task。如果損失函數是一個最后一層的空間維度總和, $ \iota (X;\theta)=\Sigma_{ij}\iota'(X_{ij}; \theta)$ ,它的梯度將是它的每層空間組成梯度總和。所以在全部圖像上的基於l的隨機梯度下降計算將和基於l'的梯度下降結果一樣,將最后一層的所有接收域作為minibatch(分批處理)。
When these receptive fields overlap significantly, both feedforward computation and backpropagation are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch.
在這些接收域重疊很大的情況下,前反饋計算和反向傳播計算整圖的疊層都比獨立的patch-by-patch有效的多。
We next explain how to convert classification nets into fully convolutional nets that produce coarse output maps. For pixelwise prediction, we need to connect these coarse outputs back to the pixels. Section 3.2 describes a trick that OverFeat [29] introduced for this purpose. We gain insight into this trick by reinterpreting it as an equivalent network modification. As an efficient, effective alternative, we introducedeconvolutionlayersforupsamplinginSection3.3. In Section 3.4 we consider training by patchwise sampling, and give evidence in Section4.3 that our whole image training is faster and equally effective.
我們接下來將解釋怎么將分類網絡轉換到能產生粗輸出圖的全卷積網絡。對於像素級預測,我們需要連接這些粗略的輸出結果到像素。3.2節描述了一種技巧,快速掃描因此被引入。我們通過將它解釋為一個等價網絡修正而獲得了關於這個技巧的一些領悟。作為一個高效的替換,我們引入了去卷積層用於上采樣見3.3節。在3.4節,我們考慮通過patchwise取樣訓練,便在4.3節證明我們的全圖式訓練更快且同樣有效。
3.1.Adaptingclassifiersfordenseprediction
Typical recognition nets, including LeNet [21], AlexNet [19], and its deeper successors [31, 32], ostensibly take fixed-sized inputs and produce nonspatial outputs. The fully connected layers of these nets have fixed dimensions and throw away spatial coordinates. However, these fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions. Doing so casts them into fully convolutional networks that take input of any size and output classification maps. This transformation is illustrated in Figure 2. (By contrast, nonconvolutional nets, such as the one by Le et al. [20], lack this capability.)

Figure 2. Transforming fully connected layers into convolution layers enables a classification net to output a heatmap. Adding layers and a spatial loss (as in Figure 1) produces an efficient machine for end-to-end dense learning.
典型的識別網絡,包括LeNet, AlexNet, 和一些后繼者,表面上采用的是固定尺寸的輸入產生了非空間的輸出。這些網絡的全連接層有確定的位數並丟棄空間坐標。然而,這些全連接層也被看做是覆蓋全部輸入域的核卷積。需要將它們加入到可以采用任何尺寸輸入並輸出分類圖的全卷積網絡中。這種轉換如圖2所示。
圖2 將全連接層轉化到卷積層能使一個分類網絡輸出heatmap(熱圖)。添加層和一個空間損失(如圖一所示)產生了一個高效的端到端的dense學習機制
Furthermore, while the resulting maps are equivalent to the evaluation of the original net on particular input patches, the computation is highly amortized over the overlapping regions of those patches. For example,while AlexNet takes 1.2 ms (on a typical GPU) to produce the classification scores of a 227 × 227 image, the fully convolutional version takes 22 ms to produce a 10×10 grid of outputs from a 500×500 image, which is more than 5 times faster than the naive approach[1].
此外,當作為結果的圖在特殊的輸入patches上等同於原始網絡的估計,計算是高度攤銷的在那些patches的重疊域上。例如,當AlexNet花費了1.2ms(在標准的GPU上)推算一個227227圖像的分類得分,全卷積網絡花費22ms從一張500500的圖像上產生一個10*10的輸出網格,比朴素法快了5倍多。
The spatial output maps of these convolutionalized models make them a natural choice for dense problems like semantic segmentation. With ground truth available at every output cell, both the forward and backward passes are straightforward, and both take advantage of the inherent computational efficiency (and aggressive optimization) of convolution.
這些卷積化模式的空間輸出圖可以作為一個很自然的選擇對於dense問題,比如語義分割。每個輸出單元ground truth可用,正推法和逆推法都是直截了當的,都利用了卷積的固有的計算效率(和可極大優化性)。
The corresponding backward times for the AlexNet example are 2.4 ms for a single image and 37 ms for a fully convolutional 10 × 10 output map, resulting in a speedup similar to that of the forward pass. This dense backpropagation is illustrated in Figure 1.
對於AlexNet例子相應的逆推法的時間為單張圖像時間2.4ms,全卷積的10*10輸出圖為37ms,結果是相對於順推法速度加快了。 這種密集的反向傳播如圖1所示。
While our reinterpretation of classification nets as fully convolutional yields output maps for inputs of any size, the output dimensions are typically reduced by subsampling. The classification nets subsample to keep filters small and computational requirements reasonable. This coarsens the output of a fully convolutional version of these nets, reducing it from the size of the input by a factor equal to the pixel stride of the receptive fields of the output units.
當我們將分類網絡重新解釋為任意輸出尺寸的全卷積域輸出圖,輸出維數也通過下采樣顯著的減少了。分類網絡下采樣使filter保持小規模同時計算要求合理。這使全卷積式網絡的輸出結果變得粗糙,通過輸入尺寸因為一個和輸出單元的接收域的像素步長等同的因素來降低它。
3.2.Shift-and-stitchisfilterrarefactio
Input shifting and output interlacing is a trick that yields dense predictions from coarse outputs without interpolation, introduced by OverFeat [29]. If the outputs are downsampled by a factor of f,the input is shifted(by left and top padding) x pixels to the right and y pixels down, once for every value of (x,y) ∈ {0,...,f −1}×{0,...,f −1}. These f2 inputs are each run through the convnet, and the outputs are interlaced so that the predictions correspond to the pixels at the centers of their receptive fields.
dense prediction能從粗糙輸出中通過從輸入的平移版本中將輸出拼接起來獲得。如果輸出是因為一個因子f降低采樣,平移輸入的x像素到左邊,y像素到下面,一旦對於每個(x,y)滿足0<=x,y<=f.處理f^2個輸入,並將輸出交錯以便預測和它們接收域的中心像素一致。
Changing only the filters and layer strides of a convnet can produce the same output as this shift-and-stitch trick. Consider a layer (convolution or pooling) with input stride s, and a following convolution layer with filter weights fij (eliding the feature dimensions,irrelevant here). Setting the lower layer’s input stride to 1 upsamples its output by a factor of s, just like shift-and-stitch. However, convolving the original filter with the upsampled output does not produce the same result as the trick, because the original filter only sees a reduced portion of its (now upsampled) input. To reproduce the trick, rarefy the filter by enlarging it as
(with i and j zero-based). Reproducing the full net output of the trick involves repeating this filter enlargement layerby-layer until all subsampling is removed.
盡管單純地執行這種轉換增加了f^2的這個因素的代價,有一個非常有名的技巧用來高效的產生完全相同的結果,這個在小波領域被稱為多孔算法。考慮一個層(卷積或者池化)中的輸入步長s,和后面的濾波權重為f_ij的卷積層(忽略不相關的特征維數)。設置更低層的輸入步長到l上采樣它的輸出影響因子為s。然而,將原始的濾波和上采樣的輸出卷積並沒有產生和shift-and-stitch相同的結果,因為原始的濾波只看得到(已經上采樣)輸入的簡化的部分。為了重現這種技巧,通過擴大來稀疏濾波,如下:
如果s能除以i和j,除非i和j都是0。重現該技巧的全網輸出需要重復一層一層放大這個filter知道所有的下采樣被移除。(在練習中,處理上采樣輸入的下采樣版本可能會更高效。)
Simply decreasing subsampling within a net is a tradeoff: the filters see finer information, but have smaller receptive fields and take longer to compute. We have seen that the shift-and-stitch trick is another kind of tradeoff: the output is made denser without decreasing the receptive field sizes of the filters, but the filters are prohibited from accessing information at a finer scale than their original design.
在網內減少二次采樣是一種折衷的做法:filter能看到更細節的信息,但是接受域更小而且需要花費很長時間計算。Shift-and -stitch技巧是另外一種折衷做法:輸出更加密集且沒有減小filter的接受域范圍,但是相對於原始的設計filter不能感受更精細的信息。
Although we have done preliminary experiments with shift-and-stitch, we do not use it in our model. We find learning through upsampling, as described in the next section, to be more effective and efficient, especially when combined with the skip layer fusion described later on.
盡管我們已經利用這個技巧做了初步的實驗,但是我們沒有在我們的模型中使用它。正如在下一節中描述的,我們發現從上采樣中學習更有效和高效,特別是接下來要描述的結合了跨層融合。
3.3.Upsamplingisbackwardsstridedconvolution
Another way to connect coarse outputs to dense pixels is interpolation. For instance, simple bilinear interpolation computes each output \(y_{ij}\)from the nearest four inputs by a linear map that depends only on the relative positions of the input and output cells.
另一種連接粗糙輸出到dense像素的方法就是插值法。比如,簡單的雙線性插值計算每個輸出\(y_{ij}\)來自只依賴輸入和輸出單元的相對位置的線性圖最近的四個輸入。
In a sense, upsampling with factor f is convolution with a fractional input stride of 1/f. So long as f is integral, a natural way to upsample is therefore backwards convolution (sometimes called deconvolution) with an output stride of f. Such an operation is trivial to implement, since it simply reverses the forward and backward passes of convolution.Thus upsampling is performed in-network for end-to-end learning by backpropagation from the pixelwise loss.
從某種意義上,伴隨因子f的上采樣是對步長為1/f的分數式輸入的卷積操作。只要f是整數,一種自然的方法進行上采樣就是向后卷積(有時稱為去卷積)伴隨輸出步長為f。這樣的操作實現是不重要的,因為它只是簡單的調換了卷積的順推法和逆推法。所以上采樣在網內通過計算像素級別的損失的反向傳播用於端到端的學習
Note that the deconvolution filter in such a layer need not be fixed (e.g., to bilinear upsampling), but can be learned. A stack of deconvolution layers and activation functions can even learn a nonlinear upsampling.
需要注意的是去卷積濾波在這種層面上不需要被固定不變(比如雙線性上采樣)但是可以被學習。一堆反褶積層和激勵函數甚至能學習一種非線性上采樣。
In our experiments, we find that in-network upsampling is fast and effective for learning dense prediction. Our best segmentation architecture uses these layers to learn to upsample for refined prediction in Section 4.2.
在我們的實驗中,我們發現在網內的上采樣對於學習dense prediction是快速且有效的。我們最好的分割架構利用了這些層來學習上采樣用以微調預測,見4.2節。
3.4.Patchwisetrainingislosssampling
In stochastic optimization, gradient computation is driven by the training distribution. Both patchwise training and fully-convolutional training can be made to produce any distribution, although their relative computational efficiency depends on overlap and minibatch size. Whole image fully convolutional training is identical to patchwise training where each batch consists of all the receptive fields of the units below the loss for an image (or collection of images). While this is more efficient than uniform sampling of patches,it reduces the number of possible batches. However, random selection of patches within an image may be recovered simply. Restricting the loss to a randomly sampled subset of its spatial terms (or, equivalently applying a DropConnect mask [36] between the output and the loss) excludes patches from the gradient computation.
在隨機優化中,梯度計算是由訓練分布支配的。patchwise 訓練和全卷積訓練能被用來產生任意分布,盡管他們相對的計算效率依賴於重疊域和minibatch的大小。在每一個由所有的單元接受域組成的批次在圖像的損失之下(或圖像的集合)整張圖像的全卷積訓練等同於patchwise訓練。當這種方式比patches的均勻取樣更加高效的同時,它減少了可能的批次數量。然而在一張圖片中隨機選擇patches可能更容易被重新找到。限制基於它的空間位置隨機取樣子集產生的損失(或者可以說應用輸入和輸出之間的DropConnect mask [39] )排除來自梯度計算的patches。
If the kept patches still have significant overlap, fully convolutional computation will still speed up training. If gradients are accumulated over multiple backward passes, batches can include patches from several images.2
如果保存下來的patches依然有重要的重疊,全卷積計算依然將加速訓練。如果梯度在多重逆推法中被積累,batches能包含幾張圖的patches。
Sampling in patchwise training can correct class imbalance [27, 8, 2] and mitigate the spatial correlation of dense patches [28, 16]. In fully convolutional training, class balance can also be achieved by weighting the loss, and loss sampling can be used to address spatial correlation.
patcheswise訓練中的采樣能糾正分類失調和減輕密集空間相關性的影響。在全卷積訓練中,分類平衡也能通過給損失賦權重實現,對損失采樣能被用來標識空間相關。
We explore training with sampling in Section4.3,and do not find that it yields faster or better convergence for dense prediction. Whole image training is effective and efficient.
我們研究了4.3節中的伴有采樣的訓練,沒有發現對於dense prediction它有更快或是更好的收斂效果。全圖式訓練是有效且高效的。
4.SegmentationArchitecture
We cast ILSVRC classifiers into FCNs and augment them for dense prediction with in-network upsampling and a pixelwise loss. We train for segmentation by fine-tuning. Next, we build a novel skip architecture that combines coarse, semantic and local, appearance information to refine prediction.
我們將ILSVRC分類應用到FCNs增大它們用於dense prediction結合網內上采樣和像素級損失。我們通過微調為分割進行訓練。接下來我們增加了跨層來融合粗的、語義的和局部的表征信息。這種跨層式架構能學習端到端來改善輸出的語義和空間預測。
For this investigation, we train and validate on the PASCAL VOC 2011 segmentation challenge [7]. We train with a per-pixel multinomial logistic loss and validate with the standard metric of mean pixel intersection over union, with the mean taken over all classes, including background. The training ignores pixels that are masked out (as ambiguous or difficult) in the ground truth.
為此,我們訓練和在PASCAL VOC 2011分割挑戰賽中驗證。我們訓練逐像素的多項式邏輯損失和驗證標准度量的在集合中平均像素交集還有基於所有分類上的平均接收,包括背景。這個訓練忽略了那些在groud truth中被遮蓋的像素(模糊不清或者很難辨認)。
4.1.FromclassifiertodenseFCN
We begin by convolutionalizing proven classification architectures as in Section 3. We consider the AlexNet3 architecture [19] that won ILSVRC12, as well as the VGG nets [31] and the GoogLeNet4 [32] which did exceptionally well in ILSVRC14. We pick the VGG 16-layer net5, which we found to be equivalent to the 19-layer net on this task. For GoogLeNet, we use only the final loss layer, and improve performance by discarding the final average pooling layer. We decapitate each net by discarding the final classifier layer, and convert all fully connected layers to convolutions. We append a 1 × 1 convolution with channel dimension 21 to predict scores for each of the PASCAL classes (including background) at each of the coarse output locations, followed by a deconvolution layer to bilinearly upsample the coarse outputs to pixel-dense outputs as described in Section 3.3. Table 1 compares the preliminary validation results along with the basic characteristics of each net. We report the best results achieved after convergence at a fixed learning rate (at least 175 epochs).

我們在第3節中以卷積證明分類架構的。我們認為拿下了ILSVRC12的AlexNet3架構和VGG nets、GoogLeNet4一樣在ILSVRC14上表現的格外好。我們選擇VGG 16層的網絡5,發現它和19層的網絡在這個任務(分類)上相當。對於GoogLeNet,我們僅僅使用的最后的損失層,通過丟棄了最后的平均池化層提高了表現能力。我們通過丟棄最后的分類切去每層網絡頭,然后將全連接層轉化成卷積層。我們附加了一個1*1的、通道維數為21的卷積來預測每個PASCAL分類(包括背景)的得分在每個粗糙的輸出位置,后面緊跟一個去卷積層用來雙線性上采樣粗糙輸出到像素密集輸出如3.3.節中描述。表1將初步驗證結果和每層的基礎特性比較。我們發現最好的結果在以一個固定的學習速率得到(最少175個epochs)。
(表格1 我們改變並擴展了分類卷積網絡,通過對PASCAL VOC 2011有效數據集上的平局交叉和推理時間(NVIDIA Tesla K40c上20組500*500輸入的測試的平均時間)進行比較。我們細化這個改變后的網絡框架用來dense prediction;參數層的數量,輸入單元接收域的大小和網內的粗糙步長。(在一個個固定的學習速率下這些數字有最好的表現,可能表現最好的)
Fine-tuning from classification to segmentation gave reasonable predictions for each net. Even the worst model achieved ∼ 75% of state-of-the-art performance. The segmentation-equippped VGG net (FCN-VGG16) already appears to be state-of-the-art at 56.0 mean IU on val, compared to 52.6 on test [16]. Training on extra data raises performance to 59.4 mean IU on a subset of val7. Training details are given in Section 4.3.
從分類到分割的微調對每層網絡有一個合理的預測。甚至最壞的模型也能達到大約75%的良好表現。內設分割的VGG網絡(FCN-VGG16)已經在val上平均IU 達到了56.0取得了最好的成績,相比於52.6 [17] 。在額外數據上的訓練將FCN-VGG16提高到59.4,將FCN-AlexNet提高到48.0。
Despite similar classification accuracy, our implementation of GoogLeNet did not match this segmentation result.
盡管相同的分類准確率,我們的用GoogLeNet並不能和VGG16的分割結果相比較。
4.2.Combining what and where
We define a new fully convolutional net (FCN) for segmentation that combines layers of the feature hierarchy and refines the spatial precision of the output. See Figure 3.

我們定義了一個新的全卷積網用於結合了特征層級的分割並提高了輸出的空間精度,見圖3。
(圖3 我的有DAG(有向無環圖)網絡學習講粗的高層信息和細的底層信息結合。池化和預測層以能表現出相對空間粒度的網絡顯示,於此同時中間過渡層作為鉛垂線。第一行(FCN-32s):我們的單一流網絡,如圖4.1節描述,在一個單一的步驟中上采樣步長為32預測回像素。第二行(FCN-16s):結合最后一層和pool4層的預測,步長為16,讓我們的網絡預測出更精細的細節,同時保留了高層語義信息。第三行(FCN-8s):pool3的附加預測,步長為8,精度進一步提高。)
While fully convolutionalized classifiers can be finetuned to segmentation as shown in 4.1, and even score highly on the standard metric,their output is dissatisfyingly coarse(seeFigure4). The32pixelstrideatthefinalprediction layer limits the scale of detail in the upsampled output.

當全卷積分類能被微調用於分割如4.1節所示,甚至在標准度量上得分更高,它們的輸出不是很粗糙(見圖4)。最后預測層的32像素步長限制了上采樣輸入的細節的尺寸。
We address this by adding links that combine the final prediction layer with lower layers with finer strides. This turnsalinetopologyintoaDAG,withedgesthatskipahead from lower layers to higher ones (Figure 3). As they see fewer pixels, the finer scale predictions should need fewer layers, so it makes sense to make them from shallower net outputs. Combining fine layers and coarse layers lets the model make local predictions that respect global structure. By analogy to the multiscale local jet of Florack et al. [10], we call our nonlinear local feature hierarchy the deep jet.
我們提出增加結合了最后預測層和有更細小步長的更低層的跨層信息,將一個線划拓撲結構轉變成DAG(有向無環圖),並且邊界將從更底層向前跳躍到更高(圖3)。因為它們只能獲取更少的像素點,更精細的尺寸預測應該需要更少的層,所以從更淺的網中將它們輸出是有道理的。結合了精細層和粗糙層讓模型能做出遵從全局結構的局部預測。與Koenderick 和an Doorn [21]的jet類似,我們把這種非線性特征層稱之為deep jet。
We first divide the output stride in half by predicting from a 16 pixel stride layer. We add a 1 × 1 convolution layer on top of pool4 to produce additional class predictions. We fuse this output with the predictions computed on top of conv7 (convolutionalized fc7) at stride 32 by adding a 2× upsampling layer and summing6 both predictions. (See Figure 3). We initialize the 2× upsampling to bilinearinterpolation,butallowtheparameterstobelearned asdescribedinSection3.3. Finally,thestride16predictions areupsampledbacktotheimage. WecallthisnetFCN-16s. FCN-16s is learned end-to-end, initialized with the parameters of the last, coarser net, which we now call FCN-32s. Thenewparametersactingonpool4arezero-initializedso thatthenetstartswithunmodifiedpredictions. Thelearning rate is decreased by a factor of 100.
我們首先將輸出步長分為一半,通過一個16像素步長層預測。我們增加了一個1*1的卷積層在pool4的頂部來產生附加的類別預測。我們將輸出和預測融合在conv7(fc7的卷積化)的頂部以步長32計算,通過增加一個2×的上采樣層和預測求和(見圖3)。我們初始化這個2×上采樣到雙線性插值,但是允許參數能被學習,如3.3節所描述、最后,步長為16的預測被上采樣回圖像,我們把這種網結構稱為FCN-16s。FCN-16s用來學習端到端,能被最后的參數初始化。這種新的、在pool4上生效的參數是初始化為0 的,所以這種網結構是以未變性的預測開始的。這種學習速率是以100倍的下降的。
Learning this skip net improves performance on the validation set by 3.0 mean IU to 62.4. Figure 4 shows improvement in the fine structure of the output. We compared thisfusionwithlearningonlyfromthepool4layer(which resulted in poor performance), and simply decreasing the learning rate without adding the extra link (which results in an insignificant performance improvement, without improving the quality of the output).
學習這種跨層網絡能在3.0平均IU的有效集合上提高到62.4。圖4展示了在精細結構輸出上的提高。我們將這種融合學習和僅僅從pool4層上學習進行比較,結果表現糟糕,而且僅僅降低了學習速率而沒有增加跨層,導致了沒有提高輸出質量的沒有顯著提高表現。
We continue in this fashion by fusing predictions from pool3 with a 2× upsampling of predictions fused from pool4 and conv7, building the net FCN-8s. We obtain a minor additional improvement to 62.7 mean IU, and find a slight improvement in the smoothness and detail of our output. At this point our fusion improvements have met diminishing returns, both with respect to the IU metric which emphasizes large-scale correctness, and also in terms of the improvement visible e.g. inFigure4,so we do not continue fusing even lower layers.
我們繼續融合pool3和一個融合了pool4和conv7的2×上采樣預測,建立了FCN-8s的網絡結構。在平均IU上我們獲得了一個較小的附加提升到62.7,然后發現了一個在平滑度和輸出細節上的輕微提高。這時我們的融合提高已經得到了一個衰減回饋,既在強調了大規模正確的IU度量的層面上,也在提升顯著度上得到反映,如圖4所示,所以即使是更低層我們也不需要繼續融合。
Refinement by other means Decreasing the stride of pooling layers is the most straightforward way to obtain finer predictions. However, doing so is problematic for our VGG16-based net. Setting the pool5 layer to have stride 1 requires our convolutionalized fc6 to have a kernel size of 14×14 in order to maintain its receptive field size. In addition to their computational cost, we had difficulty learning such large filters. We made an attempt to re-architect the layers above pool5 with smaller filters, but were not successful in achieving comparable performance; one possible explanation is that the initialization from ImageNet-trained weights in the upper layers is important.
其他方式精煉化 減少池層的步長是最直接的一種得到精細預測的方法。然而這么做對我們的基於VGG16的網絡帶來問題。設置pool5的步長到1,要求我們的卷積fc6核大小為14*14來維持它的接收域大小。另外它們的計算代價,通過如此大的濾波器學習非常困難。我們嘗試用更小的濾波器重建pool5之上的層,但是並沒有得到有可比性的結果;一個可能的解釋是ILSVRC在更上層的初始化時非常重要的。
Another way to obtain finer predictions is to use the shift-and-stitch trick described in Section 3.2. In limited experiments, we found the cost to improvement ratio from this method to be worse than layer fusion.
另一種獲得精細預測的方法就是利用3.2節中描述的shift-and-stitch技巧。在有限的實驗中,我們發現從這種方法的提升速率比融合層的方法花費的代價更高。
4.3.Experimentalframework
Optimization We train by SGD with momentum. We use a minibatch size of 20 images and fixed learning rates of \(10^{-3}\), \(10^{-4}\) and \(5^{-5}\) for FCN-AlexNet, FCN-VGG16, and FCN-GoogLeNet, respectively, chosen by line search. We use momentum 0.9, weight decay of \(5^{-4}\) or \(2^{-4}\), and doubled the learning rate for biases,although we found training to be insensitive to these parameters (but sensitive to the learning rate). We zero-initialize the class scoring convolution layer, finding random initialization to yield neither better performance nor faster convergence. Dropout was included where used in the original classifier nets.
**優化 ** 我們利用momentum訓練了GSD。我們利用了一個minibatch大小的20張圖片,然后固定學習速率為\(10^{-3}\),\(10^{-4}\),和5-5用於FCN-AlexNet, FCN-VGG16,和FCN-GoogLeNet,通過各自的線性搜索選擇。我們利用了0.9的momentum,權值衰減在\(5^{-4}\)或是 \(2^{-4}\),而且對於偏差的學習速率加倍了,盡管我們發現訓練對單獨的學習速率敏感。我們零初始化類的得分層,隨機初始化既不能產生更好的表現也沒有更快的收斂。Dropout被包含在用於原始分類的網絡中。
Fine-tuning We fine-tune all layers by backpropagation through the whole net. Fine-tuning the output classifier alone yields only 70% of the full finetuning performance as compared in Table 2. Training from scratch is not feasible considering the time required to learn the base classification nets. (Note that the VGG net is trained in stages, while we initialize from the full 16-layer version.) Fine-tuning takes three days on a single GPU for the coarse FCN-32s version, and about one day each to upgrade to the FCN-16s and FCN-8s versions.
微調 我們通過反向傳播微調整個網絡的所有層。經過表2的比較,微調單獨的輸出分類表現只有全微調的70%。考慮到學習基礎分類網絡所需的時間,從scratch中訓練不是可行的。(注意VGG網絡的訓練是階段性的,當我們從全16層初始化后)。對於粗糙的FCN-32s,在單GPU上,微調要花費三天的時間,而且大約每隔一天就要更新到FCN-16s和FCN-8s版本。
Patch Sampling As explained in Section 3.4, our full image training effectively batches each image into a regular grid of large, overlapping patches. By contrast, prior work randomly samples patches over a full dataset [27, 2, 8, 28, 11], potentially resulting in higher variance batches that may accelerate convergence [22]. We study this tradeoff by spatially sampling the loss in the manner described earlier, making an independent choice to ignore each final layercellwithsomeprobability1−p. To avoid changing the effective batch size,we simultaneously increase the number of images per batch by a factor 1/p. Note that due to the efficiency of convolution, this form of rejection sampling is still faster than patchwise training for large enough values of p (e.g., at least for p > 0.2 according to the numbers in Section 3.1). Figure 5 shows the effect of this form of sampling on convergence. We find that sampling does not have a significant effect on convergence rate compared to whole image training, but takes significantly more time due to the larger number of images that need to be considered per batch. We therefore choose unsampled, whole image training in our other experiments.
patch取樣 正如3.4節中解釋的,我們的全圖有效地訓練每張圖片batches到常規的、大的、重疊的patches網格。相反的,先前工作隨機樣本patches在一整個數據集,可能導致更高的方差batches,可能加速收斂。我們通過空間采樣之前方式描述的損失研究這種折中,以1-p的概率做出獨立選擇來忽略每個最后層單元。為了避免改變有效的批次尺寸,我們同時以因子1/p增加每批次圖像的數量。注意的是因為卷積的效率,在足夠大的p值下,這種拒絕采樣的形式依舊比patchwose訓練要快(比如,根據3.1節的數量,最起碼p>0.2)圖5展示了這種收斂的采樣的效果。我們發現采樣在收斂速率上沒有很顯著的效果相對於全圖式訓練,但是由於每個每個批次都需要大量的圖像,很明顯的需要花費更多的時間。
Class Balancing Fully convolutional training can balance classes by weighting or sampling the loss. Although our labels are mildly unbalanced (about 3/4 are background), we find class balancing unnecessary.
分類平衡 全卷積訓練能通過按權重或對損失采樣平衡類別。盡管我們的標簽有輕微的不平衡(大約3/4是背景),我們發現類別平衡不是必要的。
Dense Prediction The scores are upsampled to the input dimensions by deconvolution layers within the net. Final layer deconvolutional filters are fixed to bilinear interpolation, while intermediate upsampling layers are initialized to bilinear upsampling, and then learned. Shift-and-stitch (Section 3.2), or the filter rarefaction equivalent, are not used.
Augmentation We tried augmenting the training data by randomly mirroring and “jittering” the images by translating them up to 32 pixels(the coarsest scale of prediction) in each direction. This yielded no noticeable improvement.
數據增強 我們嘗試通過隨機鏡像和“抖動”圖像來增加訓練數據,方法是在每個方向將它們轉換為32個像素(最粗略的預測尺度)。這沒有明顯的改善。
More Training Data The PASCAL VOC 2011segmentation challenge training set, which we used for Table 1, labels 1112 images. Hariharan et al. [15] have collected labels for a much larger set of 8498 PASCAL training images, which was used to train the previous state-of-the-art system, SDS. This training data improves the FCNVGG16 validation score7 by 3.4 points to 59.4 mean IU.
**更多訓練集 ** PASCAL VOC 2011分割挑戰訓練集,我們用於表1,標簽1112圖像。Hariharan等人收集了一組更大的8498 PASCAL VOC 訓練圖像的標簽,用於訓練先前的最先進系統SDS。此訓練數據將FCNVGG16驗證分數7提高了3.4點,達到59.4平均IU。
Implementation All models are trained and tested with Caffe [18] on a single NVIDIA Tesla K40c. The models and code will be released open-source on publication.
實施所有的模型都是在一個nvidia tesla k40c上用caffe進行訓練和測試的,模型和代碼將在發布時開源發布。
5. Results
We test our FCN on semantic segmentation and scene parsing, exploring PASCAL VOC, NYUDv2, and SIFT Flow. Although these tasks have historically distinguished between objects and regions, we treat both uniformly as pixel prediction. We evaluate our FCN skip architecture8 on each of these datasets, and then extend it to multi-modal input for NYUDv2 and multi-task prediction for the semantic and geometric labels of SIFT Flow.
我們訓練FCN在語義分割和場景解析,研究了PASCAL VOC, NYUDv2和 SIFT Flow。盡管這些任務在以前主要是用在物體和區域上,我們都一律將它們視為像素預測。我們在這些數據集中都進行測試用來評估我們的FCN跨層式架構,然后對於NYUDv2將它擴展成一個多模型的輸出,對於SIFT Flow則擴展成多任務的語義和集合標簽。
Metrics We report four metrics from common semantic segmentation and scene parsing evaluations that are variations on pixel accuracy and region intersection over union (IU). Let \(n_{ij}\) be the number of pixels of class \(i\) predicted to belong to class \(j\) , where there are \(n_{cl}\) different classes, and let $t_i = \Sigma_j n_{ij} $ be the total number of pixels of class \(i\). We compute:
- pixel accuracy: \(\Sigma_i n_{ij}/\Sigma_i t_i\)
- mean accuraccy:\((1/n_{cl})\Sigma_i n_{ii}/t_i\)
- mean IU:\((1/n_{cl})\Sigma_in_{ii}/(t_i+\Sigma_j n_{ji}-n_{ii})\)
- frequency weighted IU:\((\Sigma_k t_k)^{-1}\Sigma_it_in_{ii}/(t_i+\Sigma_jn_{ji}-n_{ii})\)
度量 我們從常見的語義分割和場景解析評估中提出四種度量,它們在像素准確率和在聯合的區域交叉上是不同的。令\(n_{ij}\)為類別i的被預測為類別j的像素數量,有\(n_{ij}\)個不同的類別,令 $t_i = \Sigma_j n_{ij} $ 為類別i的像素總的數量。我們將計算:
- 像素准確率: \(\Sigma_i n_{ij}/\Sigma_i t_i\)
- 平局准確率:\((1/n_{cl})\Sigma_i n_{ii}/t_i\)
- 平局 IU:\((1/n_{cl})\Sigma_in_{ii}/(t_i+\Sigma_j n_{ji}-n_{ii})\)
- 加權頻數 IU:\((\Sigma_k t_k)^{-1}\Sigma_it_in_{ii}/(t_i+\Sigma_jn_{ji}-n_{ii})\)
PASCAL VOC Table 3 gives the performijance of our FCN-8s on the test sets of PASCAL VOC 2011 and 2012, and compares it to the previous state-of-the-art, SDS [16], and the well-known R-CNN [12]. We achieve the best results on mean IU9 by a relative margin of 20%. Inference time is reduced 114 (convnet only, ignoring proposals and refinement) or 286 (overall).
PASCAL VOC 表3給出了我們的FCN-8s的在PASCAL VOC2011和2012測試集上的表現,然后將它和之前的先進方法SDS[17]和著名的R-CNN進行比較。我們在平均IU上取得了最好的結果相對提升了20%。推理時間被降低了114×(只有卷積網,沒有proposals和微調)或者286×(全部都有)。
NYUDv2 [30] is an RGB-D dataset collected using the Microsoft Kinect. It has 1449 RGB-D images, with pixelwise labels that have been coalesced into a 40 class semantic segmentation task by Gupta et al. [13]. We report results on the standard split of 795 training images and 654 testing images. (Note: all model selection is performed on PASCAL 2011 val.) Table 4 gives the performance of our model in several variations. First we train our unmodified coarse model (FCN-32s) on RGB images. To add depth information, we train on a model upgraded to take four-channel RGB-D input (early fusion). This provides little benefit,perhaps due to the difficultly of propagating meaningful gradients all the way through the model. Following the success of Gupta et al. [14], we try the three-dimensional HHA encoding of depth, training nets on just this information, as well as a “late fusion” of RGB and HHA where the predictions from both nets are summed at the final layer, and the resulting two-stream net is learned end-to-end. Finally we upgrade this late fusion net to a 16-stride version.
NVUDv2 是一種通過利用Microsoft Kinect收集到的RGB-D數據集,含有已經被合並進Gupt等人的40類別的語義分割任務的pixelwise標簽。我們報告結果基於標准分離的795張圖片和654張測試圖片。(注意:所有的模型選擇將展示在PASCAL 2011 val上)。表4給出了我們模型在一些變化上的表現。首先我們在RGB圖片上訓練我們的未經修改的粗糙模型(FCN-32s)。為了添加深度信息,我們訓練模型升級到能采用4通道RGB-Ds的輸入(早期融合)。這提供了一點便利,也許是由於模型一直要傳播有意義的梯度的困難。緊隨Gupta等人的成功,我們嘗試3維的HHA編碼深度,只在這個信息上(即深度)訓練網絡,和RGB與HHA的“后期融合”一樣來自這兩個網絡中的預測將在最后一層進行總結,結果的雙流網絡將進行端到端的學習。最后我們將這種后期融合網絡升級到16步長的版本。
SIFT Flow is a dataset of 2,688 images with pixel labels for 33 semantic categories (“bridge”, “mountain”, “sun”),as well as three geometric categories (“horizontal”, “vertical”, and “sky”). An FCN can naturally learn a joint representation that simultaneously predicts both types of labels. We learn a two-headed version of FCN-16s with semantic and geometric prediction layers and losses. The learned model performs as well on both tasks as two independently trained models, while learning and inference are essentially as fast as each independent model by itself. The results in Table 5, computed on the standard split into 2,488 training
and 200 test images,10 show state-of-the-art performance on both tasks.
SIFT Flow是一個帶有33語義范疇(“橋”、“山”、“太陽”)的像素標簽的2688張圖片的數據集和3個幾何分類(“水平”、“垂直”和“sky")一樣。一個FCN能自然學習共同代表權,即能同時預測標簽的兩種類別。我們學習FCN-16s的一種雙向版本結合語義和幾何預測層和損失。這種學習模型在這兩種任務上作為獨立的訓練模型表現很好,同時它的學習和推理基本上和每個獨立的模型一樣快。表5的結果顯示,計算在標准分離的2488張訓練圖片和200張測試圖片上計算,在這兩個任務上都表現的極好。


6. Conclusion
Fully convolutional networks are a rich class of models,of which modern classification convnets are a special case. Recognizing this, extending these classification nets to segmentation, and improving the architecture with multi-resolution layer combinations dramatically improves the state-of-the-art, while simultaneously simplifying and speeding up learning and inference.
全卷積網絡是模型非常重要的部分,是現代化分類網絡中一個特殊的例子。認識到這個,將這些分類網絡擴展到分割並通過多分辨率的層結合顯著提高先進的技術,同時簡化和加速學習和推理。
Acknowledgements This work was supported in part by DARPA’s MSEE and SMISC programs, NSF awards IIS-1427425, IIS-1212798, IIS-1116411, and the NSF GRFP,Toyota, and the Berkeley Vision and Learning Center. We gratefully acknowledge NVIDIA for GPU donation. We thank Bharath Hariharan and Saurabh Gupta for their advice and dataset tools. We thank Sergio Guadarrama for reproducing GoogLeNet in Caffe. We thank Jitendra Malik for his helpful comments. Thanks to Wei Liu for pointing out an issue wth our SIFT Flow mean IU computation and an error in our frequency weighted mean IU formula.
鳴謝 這項工作有以下部分支持DARPA's MSEE和SMISC項目,NSF awards IIS-1427425, IIS-1212798, IIS-1116411, 還有NSF GRFP,Toyota, 還有 Berkeley Vision和Learning Center。我們非常感謝NVIDIA捐贈的GPU。我們感謝Bharath Hariharan 和Saurabh Gupta的建議和數據集工具;我們感謝Sergio Guadarrama 重構了Caffe里的GoogLeNet;我們感謝Jitendra Malik的有幫助性評論;感謝Wei Liu指出了我們SIFT Flow平均IU計算上的一個問題和頻率權重平均IU公式的錯誤。
A. Upper Bounds on IU
In this paper, we have achieved good performance on the mean IU segmentation metric even with coarse semantic prediction. To better understand this metric and the limits of this approach with respect to it, we compute approximate upper bounds on performance with prediction at various scales. We do this by downsampling ground truth images and then upsampling them again to simulate the best results obtainable with a particular downsampling factor. The following table gives the mean IU on a subset of PASCAL 2011 val for various downsampling factors.
在這篇論文中,我們已經在平均IU分割度量上取到了很好的效果,即使是粗糙的語義預測。為了更好的理解這種度量還有關於這種方法的限制,我們在計算不同的規模上預測的表現的大致上界。我們通過下采樣ground truth圖像,然后再次對它們進行上采樣,來模擬可以獲得最好的結果,其伴隨着特定的下采樣因子。下表給出了不同下采樣因子在PASCAL2011 val的一個子集上的平均IU。pixel-perfect預測很顯然在取得最最好效果上不是必須的,而且,相反的,平均IU不是一個好的精細准確度的測量標准。
B. More Results
We further evaluate our FCN for semantic segmentation.
PASCAL-Context [26] provides whole scene annotations of PASCAL VOC 2010. While there are over 400 distinct classes, we follow the 59 class task defined by [26] that picks the most frequent classes. We train and evaluate on the training and val sets respectively. In Table 6, we compare to the joint object + stuff variation of Convolutional Feature Masking [3] which is the previous state-of-the-art on this task. FCN-8s scores 35.1 mean IU for an 11% relative improvement.
我們將我們的FCN用於語義分割進行了更進一步的評估。
PASCAL-Context [29] 提供了PASCAL VOC 2011的全部場景注釋。有超過400中不同的類別,我們遵循了 [29] 定義的被引用最頻繁的59種類任務。我們分別訓練和評估了訓練集和val集。在表6中,我們將聯合對象和Convolutional Feature Masking [4] 的stuff variation進行比較,后者是之前這項任務中最好的方法。FCN-8s在平均IU上得分為37.8,相對提高了20%
Changelog
The arXiv version of this paper is kept up-to-date with corrections and additional relevant material. The following gives a brief history of changes.v2 Add Appendix A giving upper bounds on mean IU and Appendix B with PASCAL-Context results. Correct PASCAL validation numbers (previously, some val images were included in train), SIFT Flow mean IU (which used an inappropriately strict metric), and an error in the frequency weighted mean IU formula. Add link to models and update timing numbers to reflect improved implementation (which is publicly available).
論文的arXiv版本保持着最新的修正和其他的相關材料,接下來給出一份簡短的變更歷史。v2 添加了附錄A和附錄B。修正了PASCAL的有效數量(之前一些val圖像被包含在訓練中),SIFT Flow平均IU(用的不是很規范的度量),還有頻率權重平均IU公式的一個錯誤。添加了模型和更新時間數字來反映改進的實現的鏈接(公開可用的)。
References
略



