版權:本文版權歸作者和博客園共有
轉載:歡迎轉載,但未經作者同意,必須保留此段聲明;必須在文章中給出原文連接;否則必究法律責任 |




目標檢測算法,如R-CNN輸入一張圖像,確定圖像中主要物體的位置和類別。
- Inputs: Image 輸入是圖像
- Outputs: Bounding boxes + labels for each object in the image.輸出是,對於圖像中的每個物體給出 Bounding boxes+標簽

選擇性搜索Selective Search利用多尺度的窗口搜索有紋理、顏色和強度的相鄰像素。 圖片來源: https://www.koen.me/research/pub/uijlings-ijcv2013-draft.pdf

- Inputs: sub-regions of the image corresponding to objects.圖像對應物體的子區域
- Outputs: New bounding box coordinates for the object in the sub-region.在子區域中物體的新的邊界框坐標
- 生成一組候選邊界框
- 將帶有候選邊界框的圖像輸入到預訓練好的AlexNet中,最后用SVM判斷圖像中物體在哪個框中
- 如果物體被分類了(這個框確實包含物體),就把框輸入到線性模型中,輸出這個框更窄tighter的坐標。
2015: Fast R-CNN -加速、簡化R-CNN

- 每張圖像的每個候選區域都要輸入到CNN(AlexNet)中(每張圖像大約2000次)
- 需要分別訓練三個不同的模型-生成圖像特征的CNN,預測類別的分類器(SVM)和用於縮小邊界框的回歸模型。

這就是Fast R-CNN做的一個技巧--RoIPool (Region of Interest Pooling)感興趣區域池化。在上圖中,每個區域的CNN特征是通過從CNN的特征圖中選擇對應的區域得到的,然后,每個區域的特征再經過池化(通常最大池化)這樣相比於之前的每張圖像需要進入CNN2000次,用這個方法后每張圖像就只經過CNN一次。

Fast R-CNN的第二個insight就是在一個單獨模型中,聯合訓練CNN、分類器和邊界框回歸器,R-CNN需要有不同的模型提取圖像特征(CNN)、分類(SVM)和縮小邊界框(回歸器)。
- Inputs: Images with region proposals.帶有候選區域的圖像
- Outputs: Object classifications of each region along with tighter bounding boxes.每個區域的物體分類和更窄的邊界框
2016: Faster R-CNN - 加速候選區域

- Inputs: Images (Notice how region proposals are not needed).圖像,不需要候選區域
- Outputs: Classifications and bounding box coordinates of objects in the images.圖像中物體的分類和邊界框坐標。

The Region Proposal Network slides a window over the features of the CNN. At each window location, the network outputs a score and a bounding box per anchor (hence 4k box coordinates where k is the number of anchors). Source: https://arxiv.org/abs/1506.01497.

We know that the bounding boxes for people tend to be rectangular and vertical. We can use this intuition to guide our Region Proposal networks through creating an anchor of such dimensions. Image Source: http://vlm1.uta.edu/~athitsos/courses/cse6367_spring2011/assignments/assignment1/bbox0062.jpg.
- Inputs: CNN Feature Map. CNN特征圖
- Outputs: A bounding box per anchor. A score representing how likely the image in that bounding box will be an object. 每個anchor一個邊界框,打分代表邊界框內是物體的可能性
2017: Mask R-CNN - 將Faster R-CNN拓展到像素級分割

The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are. Source: https://arxiv.org/abs/1703.06870.

Kaiming He, a researcher at Facebook AI, is lead author of Mask R-CNN and also a coauthor of Faster R-CNN.

- Inputs: CNN Feature Map. CNN特征圖
- Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).二值矩陣,當該像素屬於物體時mask值為1,否則為0

Instead of RoIPool, the image gets passed through RoIAlign so that the regions of the feature map selected by RoIPool correspond more precisely to the regions of the original image. This is needed because pixel level segmentation requires more fine-grained alignment than bounding boxes. Source: https://arxiv.org/abs/1703.06870.

How do we accurately map a region of interest from the original image onto the feature map?
在RoIPool中我們大概選擇兩個像素,得到了有些不對齊的結果,然而在RoIAlign中我們避免用這樣的舍入,用雙線性插值得到准確的2.93。這樣就可以避免RoIPool產生的不對齊。
當得到了這些masks后,Mask R-CNN用Faster R-CNN生成的分類和邊界框將他們結合起來,生成精確地分割。
Mask R-CNN is able to segment as well as classify the objects in an image. Source: https://arxiv.org/abs/1703.06870.