總結自論文:Faster_RCNN,與Pytorch代碼:
本文主要介紹代碼第二部分:model/utils , 首先分析一些主要理論操作,然后在代碼分析里詳細介紹其具體實現。
一. 主要操作
1. bounding box回歸:
目的是提高定位表現。在DPM與RCNN中均有運用。
1) RCNN版本:
在RCNN中,利用class-specific(特定類別)的bounding box regressor。也即每一個類別學一個回歸器,然后對該類的bbox預測結果進行進一步微調。注意在回歸的時候要將bbox坐標(左上右下)轉為中心點(x,y)與寬高(w,h)。對於bbox的預測結果P和gt_bbox Q來說我們學要學一個變換,使得這個變換可以將P映射到一個新的位置,使其盡可能逼近gt_bbox。與Faster-RCNN不同之處是這個變換的參數組為:
這四個參數都是特征的函數,前兩個體現為bbox的中心尺度不變性,后兩個體現為體現為bbox寬高的對數空間轉換。學到這四個參數(函數)后,就可以將P映射到G', 使得G'盡量逼近ground truth G:
1)
那么這個參數組是怎么得到的呢?它是關於候選框P的pool5 特征的函數。由pool5出來的候選框P的特征我們定義為
,那么我們有:
2)
其中W就是可學習的參數。也即我們要學習的參數組等價於W與特征的乘積。那么回歸的目標參數組是什么呢?就是上面四個式子中的逆過程:
3)
就是回歸的目標參數組。即我們希望對於不同類別分別學一個W,使得對每個類別的候選框在pool5提到的特征與W乘積后可以盡可能的逼近
。很清楚,最小二乘嶺回歸目標函數:
4)
因為是嶺回歸,所以有一個關於W的L2懲罰項,RCNN論文里給的懲罰因子lambda=1000。還有就是這個回歸數據對(P,G)不是隨便選的,預測的P應該離至少一個ground truth G很近,這樣學出來的參數才有意義。近的度量是P、G的IOU>0.6。
可以看到RCNN的每一個proposal都要經過一次特征提取的過程,這樣效率很低,而后續的Fast\Faster-RCNN都是對一張圖的feature map上的區域進行bounding box回歸。
2) Faster RCNN版本:
較於RCNN,主要有兩點不同。
首先,特征不同。RCNN中,回歸的特征是每個proposal的經過pool5后的特征,而Faster-RCNN是在整張圖的feature map上以3*3大小的卷積不斷滑過,每個3*3大小的feature map對應於9個anchor。之后是兩個並行1*1的卷積縮小特征通道為4*9(9個abchor的四個坐標)和2*9(9個anchor的0-1類別),分別用來做回歸與分類。這也是RPN網絡的工作之一。RPN網絡也是Faster-RCNN的主要優勢。
其次,回歸器數目與回歸目標函數不同。在Faster-RCNN中不再是class-specific,而是9個回歸器。因為feature map上的每個點對應有9個anchor。這9個anchor對應了9種不同的尺度和寬高比。每個回歸器只針對1種尺度與寬高比。所以雖然Faster-RCNN中給出的候選框是9種anchor,但是經過多次回歸它可以預測出各種大小形狀的bounding box,這也歸功於anchor的設計。至於回歸損失函數,首先看一下預測和目標公式:
其中x,y,w,h分別為bbox的中心點坐標,寬與高。分別是預測box、anchor box、真實box。計算類似於RCNN,前兩行是預測的box關於anchor的offset與scales,后兩行是真實box與anchor的offset與scales。那回歸的目的很明顯,即使得
盡可能相近。回歸損失函數利用的是Fast-RCNN中定義的smooth L1函數,對外點更不敏感:
損失函數優化權重W,使得測試時bbox經過W運算后可以得到一個較好的offsets與scales,利用這個offsets與scales可在原預測bbox上微調,得到更好的預測結果。
2. RPN網絡剖析:
RPN網絡 RoIHead網絡
RPN網絡也是Faster-RCNN中最大的改進。RPN網絡的輸入為圖像特征。RPN網絡是全卷積網絡。RPN網絡要完成的任務是訓練自己、提供rois。
訓練自己:二分類、bounding box 回歸(由AnchorTargetCreator實現)
提供rois:為Fast-RCNN提供訓練需要的rois(由ProposalCreator實現)
1)RPN網絡綜述:
前面提到過,整個訓練過程batchsize=1,即每次輸入一張圖片,所以feature map的shape為(1,512,hh, ww)。那么RPN的輸入便是(1,512,hh, ww)。然后經過512個3*3且含pad的卷積后仍為(1,512,hh,ww)。此卷積后shape並沒有發生變化,意義是轉換語義空間?然后分支出現了。有兩路分支,左路是18個1*1卷積,右路是36個1*1卷積。1*1卷積的意義是改變特征維度。那左路卷積后shape為(1,18,hh,ww),右路卷積后shape為(1,36,hh,ww)。左路通道數變為18,是因為每個點對應的9個anchor實現2分類概率預測,所以是9*2 = 18!右路通道數變為36,是因為每個點對應的9個anchor實現4個坐標值的預測,所以是9*4 = 36!
2)RPN網絡中AnchorTargetCreator分析:
將20000多個候選的anchor選出256個anchor進行二分類和所有的anchor進行回歸位置 。為上面的預測值提供相應的真實值。選擇方式如下:
- 對於每一個ground truth bounding box (
gt_bbox
),選擇和它重疊度(IoU)最高的一個anchor作為正樣本。 - 對於剩下的anchor,從中選擇和任意一個
gt_bbox
重疊度超過0.7的anchor,作為正樣本,正樣本的數目不超過128個。 - 隨機選擇和
gt_bbox
重疊度小於0.3的anchor作為負樣本。負樣本和正樣本的總數為256。
對於每個anchor, gt_label 要么為1(前景),要么為0(背景),所以這樣實現二分類。在計算回歸損失的時候,只計算正樣本(前景)的損失,不計算負樣本的位置損失。
3) RPN網絡中ProposalCreator分析:
RPN利用 AnchorTargetCreator自身訓練的同時,還會提供RoIs(region of interests)給Fast RCNN(RoIHead)作為訓練樣本。RPN生成RoIs的過程(ProposalCreator
)如下:
- 對於每張圖片,利用它的feature map, 計算 (H/16)× (W/16)×9(大概20000)個anchor屬於前景的概率,以及對應的位置參數。
- 選取概率較大的12000個anchor
- 利用回歸的位置參數,修正這12000個anchor的位置,得到RoIs
- 利用非極大值((Non-maximum suppression, NMS)抑制,選出概率最大的2000個RoIs
注意:在inference的時候,為了提高處理速度,12000和2000分別變為6000和300.
注意:這部分的操作不需要進行反向傳播,因此可以利用numpy/tensor實現。
RPN的輸出:RoIs(形如2000×4或者300×4的tensor)
3. RPN網絡 至 RoIHead網絡
ProposalTargetCreator分析:
ProposalTargetCreator是RPN網絡與ROIHead網絡的過渡操作,前面講過,RPN會產生大約2000個RoIs,這2000個RoIs不是都拿去訓練,而是利用ProposalTargetCreator
選擇128個RoIs用以訓練。選擇的規則如下:
- RoIs和gt_bboxes 的IoU大於0.5的,選擇一些(比如32個)
- 選擇 RoIs和gt_bboxes的IoU小於等於0(或者0.1)的選擇一些(比如 128-32=96個)作為負樣本
為了便於訓練,對選擇出的128個RoIs,還對他們的gt_roi_loc
進行標准化處理(減去均值除以標准差)
對於分類問題,直接利用交叉熵損失. 而對於位置的回歸損失,一樣采用Smooth_L1Loss, 只不過只對正樣本計算損失.而且是只對正樣本中的這個類別4個參數計算損失。舉例來說:
- 一個RoI在經過FC 84后會輸出一個84維的loc 向量. 如果這個RoI是負樣本,則這84維向量不參與計算 L1_Loss
- 如果這個RoI是正樣本,屬於label K,那么它的第 K×4, K×4+1 ,K×4+2, K×4+3 這4個數參與計算損失,其余的不參與計算損失。
二. 代碼分析
1. bbox_tools.py
有關生成、微調bounding box的操作

import numpy as np import numpy as xp import six from six import __init__ def loc2bbox(src_bbox, loc): """Decode bounding boxes from bounding box offsets and scales. Given bounding box offsets and scales computed by :meth:`bbox2loc`, this function decodes the representation to coordinates in 2D image coordinates. Given scales and offsets :math:`t_y, t_x, t_h, t_w` and a bounding box whose center is :math:`(y, x) = p_y, p_x` and size :math:`p_h, p_w`, the decoded bounding box's center :math:`\\hat{g}_y`, :math:`\\hat{g}_x` and size :math:`\\hat{g}_h`, :math:`\\hat{g}_w` are calculated by the following formulas. * :math:`\\hat{g}_y = p_h t_y + p_y` * :math:`\\hat{g}_x = p_w t_x + p_x` * :math:`\\hat{g}_h = p_h \\exp(t_h)` * :math:`\\hat{g}_w = p_w \\exp(t_w)` The decoding formulas are used in works such as R-CNN [#]_. The output is same type as the type of the inputs. .. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \ Rich feature hierarchies for accurate object detection and semantic \ segmentation. CVPR 2014. Args: src_bbox (array): A coordinates of bounding boxes. Its shape is :math:`(R, 4)`. These coordinates are :math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`. loc (array): An array with offsets and scales. The shapes of :obj:`src_bbox` and :obj:`loc` should be same. This contains values :math:`t_y, t_x, t_h, t_w`. Returns: array: Decoded bounding box coordinates. Its shape is :math:`(R, 4)`. \ The second axis contains four values \ :math:`\\hat{g}_{ymin}, \\hat{g}_{xmin}, \\hat{g}_{ymax}, \\hat{g}_{xmax}`. """ if src_bbox.shape[0] == 0: return xp.zeros((0, 4), dtype=loc.dtype) src_bbox = src_bbox.astype(src_bbox.dtype, copy=False) src_height = src_bbox[:, 2] - src_bbox[:, 0] src_width = src_bbox[:, 3] - src_bbox[:, 1] src_ctr_y = src_bbox[:, 0] + 0.5 * src_height src_ctr_x = src_bbox[:, 1] + 0.5 * src_width dy = loc[:, 0::4] dx = loc[:, 1::4] dh = loc[:, 2::4] dw = loc[:, 3::4] ctr_y = dy * src_height[:, xp.newaxis] + src_ctr_y[:, xp.newaxis] ctr_x = dx * src_width[:, xp.newaxis] + src_ctr_x[:, xp.newaxis] h = xp.exp(dh) * src_height[:, xp.newaxis] w = xp.exp(dw) * src_width[:, xp.newaxis] dst_bbox = xp.zeros(loc.shape, dtype=loc.dtype) dst_bbox[:, 0::4] = ctr_y - 0.5 * h dst_bbox[:, 1::4] = ctr_x - 0.5 * w dst_bbox[:, 2::4] = ctr_y + 0.5 * h dst_bbox[:, 3::4] = ctr_x + 0.5 * w return dst_bbox def bbox2loc(src_bbox, dst_bbox): """Encodes the source and the destination bounding boxes to "loc". Given bounding boxes, this function computes offsets and scales to match the source bounding boxes to the target bounding boxes. Mathematcially, given a bounding box whose center is :math:`(y, x) = p_y, p_x` and size :math:`p_h, p_w` and the target bounding box whose center is :math:`g_y, g_x` and size :math:`g_h, g_w`, the offsets and scales :math:`t_y, t_x, t_h, t_w` can be computed by the following formulas. * :math:`t_y = \\frac{(g_y - p_y)} {p_h}` * :math:`t_x = \\frac{(g_x - p_x)} {p_w}` * :math:`t_h = \\log(\\frac{g_h} {p_h})` * :math:`t_w = \\log(\\frac{g_w} {p_w})` The output is same type as the type of the inputs. The encoding formulas are used in works such as R-CNN [#]_. .. [#] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. \ Rich feature hierarchies for accurate object detection and semantic \ segmentation. CVPR 2014. Args: src_bbox (array): An image coordinate array whose shape is :math:`(R, 4)`. :math:`R` is the number of bounding boxes. These coordinates are :math:`p_{ymin}, p_{xmin}, p_{ymax}, p_{xmax}`. dst_bbox (array): An image coordinate array whose shape is :math:`(R, 4)`. These coordinates are :math:`g_{ymin}, g_{xmin}, g_{ymax}, g_{xmax}`. Returns: array: Bounding box offsets and scales from :obj:`src_bbox` \ to :obj:`dst_bbox`. \ This has shape :math:`(R, 4)`. The second axis contains four values :math:`t_y, t_x, t_h, t_w`. """ height = src_bbox[:, 2] - src_bbox[:, 0] width = src_bbox[:, 3] - src_bbox[:, 1] ctr_y = src_bbox[:, 0] + 0.5 * height ctr_x = src_bbox[:, 1] + 0.5 * width base_height = dst_bbox[:, 2] - dst_bbox[:, 0] base_width = dst_bbox[:, 3] - dst_bbox[:, 1] base_ctr_y = dst_bbox[:, 0] + 0.5 * base_height base_ctr_x = dst_bbox[:, 1] + 0.5 * base_width eps = xp.finfo(height.dtype).eps height = xp.maximum(height, eps) width = xp.maximum(width, eps) dy = (base_ctr_y - ctr_y) / height dx = (base_ctr_x - ctr_x) / width dh = xp.log(base_height / height) dw = xp.log(base_width / width) loc = xp.vstack((dy, dx, dh, dw)).transpose() return loc def bbox_iou(bbox_a, bbox_b): """Calculate the Intersection of Unions (IoUs) between bounding boxes. IoU is calculated as a ratio of area of the intersection and area of the union. This function accepts both :obj:`numpy.ndarray` and :obj:`cupy.ndarray` as inputs. Please note that both :obj:`bbox_a` and :obj:`bbox_b` need to be same type. The output is same type as the type of the inputs. Args: bbox_a (array): An array whose shape is :math:`(N, 4)`. :math:`N` is the number of bounding boxes. The dtype should be :obj:`numpy.float32`. bbox_b (array): An array similar to :obj:`bbox_a`, whose shape is :math:`(K, 4)`. The dtype should be :obj:`numpy.float32`. Returns: array: An array whose shape is :math:`(N, K)`. \ An element at index :math:`(n, k)` contains IoUs between \ :math:`n` th bounding box in :obj:`bbox_a` and :math:`k` th bounding \ box in :obj:`bbox_b`. """ if bbox_a.shape[1] != 4 or bbox_b.shape[1] != 4: raise IndexError # top left tl = xp.maximum(bbox_a[:, None, :2], bbox_b[:, :2]) # bottom right br = xp.minimum(bbox_a[:, None, 2:], bbox_b[:, 2:]) area_i = xp.prod(br - tl, axis=2) * (tl < br).all(axis=2) area_a = xp.prod(bbox_a[:, 2:] - bbox_a[:, :2], axis=1) area_b = xp.prod(bbox_b[:, 2:] - bbox_b[:, :2], axis=1) return area_i / (area_a[:, None] + area_b - area_i) def __test(): pass if __name__ == '__main__': __test() def generate_anchor_base(base_size=16, ratios=[0.5, 1, 2], anchor_scales=[8, 16, 32]): """Generate anchor base windows by enumerating aspect ratio and scales. Generate anchors that are scaled and modified to the given aspect ratios. Area of a scaled anchor is preserved when modifying to the given aspect ratio. :obj:`R = len(ratios) * len(anchor_scales)` anchors are generated by this function. The :obj:`i * len(anchor_scales) + j` th anchor corresponds to an anchor generated by :obj:`ratios[i]` and :obj:`anchor_scales[j]`. For example, if the scale is :math:`8` and the ratio is :math:`0.25`, the width and the height of the base window will be stretched by :math:`8`. For modifying the anchor to the given aspect ratio, the height is halved and the width is doubled. Args: base_size (number): The width and the height of the reference window. ratios (list of floats): This is ratios of width to height of the anchors. anchor_scales (list of numbers): This is areas of anchors. Those areas will be the product of the square of an element in :obj:`anchor_scales` and the original area of the reference window. Returns: ~numpy.ndarray: An array of shape :math:`(R, 4)`. Each element is a set of coordinates of a bounding box. The second axis corresponds to :math:`(y_{min}, x_{min}, y_{max}, x_{max})` of a bounding box. """ py = base_size / 2. px = base_size / 2. anchor_base = np.zeros((len(ratios) * len(anchor_scales), 4), dtype=np.float32) for i in six.moves.range(len(ratios)): for j in six.moves.range(len(anchor_scales)): h = base_size * anchor_scales[j] * np.sqrt(ratios[i]) w = base_size * anchor_scales[j] * np.sqrt(1. / ratios[i]) index = i * len(anchor_scales) + j anchor_base[index, 0] = py - h / 2. anchor_base[index, 1] = px - w / 2. anchor_base[index, 2] = py + h / 2. anchor_base[index, 3] = px + w / 2. return anchor_base
函數bbox2loc輸入的是源bbox和目標bbox,然后輸出的是參數組,即源bbox相對於bbox的offset和scales。即實現的是上述公式3)。注意對坐標的轉換(頂點坐標轉為中心、寬高)。
函數loc2bbox輸入的是源bbox和參數組,輸出的是目標bbox,正好是上面的逆過程,實現的是上述公式2)。bbox2loc稱編碼過程,那loc2bbox即為解碼過程。
函數bbox_iou實現的是交並比IOU,即任給兩組bbox(N,4 與 K,4),輸出數組shape為(N,K),即求出兩組bbox中兩兩的交並比。
函數generate_anchor_base實現生成9個base anchor,為什么是base呢,因為對於每個feature map平面中的點,都要以此點為中心生成9個anchor。下圖所示是以(0,0)為中心:
上圖是按照論文所述:9個anchor對應於3種scales(面積分別為1282,2562,5122)和3種aspect ratios(寬高比分別為1:1, 1:2, 2:1)。這9個anchor形狀應為:
90.50967 *181.01933 = 1282
181.01933 * 362.03867 = 2562
362.03867 * 724.07733 = 5122
128.0 * 128.0 = 1282
256.0 * 256.0 = 2562
512.0 * 512.0 = 5122
181.01933 * 90.50967 = 1282
362.03867 * 181.01933 = 2562
724.07733 * 362.03867 = 5122
該函數返回值為anchor_base,形狀9*4,是9個anchor的坐上右下坐標:
-37.2548 -82.5097 53.2548 98.5097
-82.5097 -173.019 98.5097 189.019
-173.019 -354.039 189.019 370.039
-56 -56 72 72
-120 -120 136 136
-248 -248 264 264
-82.5097 -37.2548 98.5097 53.2548
-173.019 -82.5097 189.019 98.5097
-354.039 -173.019 370.039 189.019
那么問題來了,上面這個只產生的是以左上角(0,0)為中心的bbox,如何產生以feature map上的每個點為中心得到的anchor呢?
代碼 model / region_proposal_network 中的函數實現了這一操作:
self.anchor_base = generate_anchor_base(anchor_scales=anchor_scales, ratios=ratios) # 首先生成上述以(0,0)為中心的9個base anchor
... n, _, hh, ww = x.shape # x為feature map,n為batch_size,此版本代碼為1. hh,ww即為寬高 anchor = _enumerate_shifted_anchor( # 調用下述函數 np.array(self.anchor_base), self.feat_stride, hh, ww) # feat_stride=16 ,因為是經4次pool后提到的特征,故feature map較原圖縮小了16倍 ... def _enumerate_shifted_anchor(anchor_base, feat_stride, height, width): # 利用base anchor生成所有對應feature map的anchor # Enumerate all shifted anchors: # anchor_base :(9,4) 坐標,這里 A=9 # # add A anchors (1, A, 4) to # cell K shifts (K, 1, 4) to get # shift anchors (K, A, 4) # reshape to (K*A, 4) shifted anchors # return (K*A, 4) # !TODO: add support for torch.CudaTensor # xp = cuda.get_array_module(anchor_base) # it seems that it can't be boosed using GPU import numpy as xp shift_y = xp.arange(0, height * feat_stride, feat_stride) # 縱向偏移量(0,16,32,...) shift_x = xp.arange(0, width * feat_stride, feat_stride) # 橫向偏移量(0,16,32,...) shift_x, shift_y = xp.meshgrid(shift_x, shift_y) shift = xp.stack((shift_y.ravel(), shift_x.ravel(), shift_y.ravel(), shift_x.ravel()), axis=1) A = anchor_base.shape[0] # 9 K = shift.shape[0] # K = hh*ww ,K約為20000 anchor = anchor_base.reshape((1, A, 4)) + \ shift.reshape((1, K, 4)).transpose((1, 0, 2)) anchor = anchor.reshape((K * A, 4)).astype(np.float32) return anchor # 返回(K,4),所有anchor的坐標
分析上述代碼:函數_enumerate_shifted_anchor首先生成橫向與縱向的偏移量,我們需要將特征圖的每一個點放大16倍到原圖:
原始圖片的錨點中心 ,兩兩相距16像素 (圖源:機器之心)
得到偏移量后我們將每個偏移量與base anchor的坐標相加,即得到所有anchor的左上右下坐標。每張圖都約生成有hh*ww=20000個anchor!
左側:錨點、中心:特征圖空間單一錨點在原圖中的表達,右側:所有錨點在原圖中的表達 (圖源:機器之心)
2. creator_tool.py

import numpy as np import cupy as cp from model.utils.bbox_tools import bbox2loc, bbox_iou, loc2bbox from model.utils.nms import non_maximum_suppression class ProposalTargetCreator(object): """Assign ground truth bounding boxes to given RoIs. The :meth:`__call__` of this class generates training targets for each object proposal. This is used to train Faster RCNN [#]_. .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \ Faster R-CNN: Towards Real-Time Object Detection with \ Region Proposal Networks. NIPS 2015. Args: n_sample (int): The number of sampled regions. pos_ratio (float): Fraction of regions that is labeled as a foreground. pos_iou_thresh (float): IoU threshold for a RoI to be considered as a foreground. neg_iou_thresh_hi (float): RoI is considered to be the background if IoU is in [:obj:`neg_iou_thresh_hi`, :obj:`neg_iou_thresh_hi`). neg_iou_thresh_lo (float): See above. """ def __init__(self, n_sample=128, pos_ratio=0.25, pos_iou_thresh=0.5, neg_iou_thresh_hi=0.5, neg_iou_thresh_lo=0.0 ): self.n_sample = n_sample self.pos_ratio = pos_ratio self.pos_iou_thresh = pos_iou_thresh self.neg_iou_thresh_hi = neg_iou_thresh_hi self.neg_iou_thresh_lo = neg_iou_thresh_lo # NOTE: py-faster-rcnn默認的值是0.1 def __call__(self, roi, bbox, label, loc_normalize_mean=(0., 0., 0., 0.), loc_normalize_std=(0.1, 0.1, 0.2, 0.2)): """Assigns ground truth to sampled proposals. This function samples total of :obj:`self.n_sample` RoIs from the combination of :obj:`roi` and :obj:`bbox`. The RoIs are assigned with the ground truth class labels as well as bounding box offsets and scales to match the ground truth bounding boxes. As many as :obj:`pos_ratio * self.n_sample` RoIs are sampled as foregrounds. Offsets and scales of bounding boxes are calculated using :func:`model.utils.bbox_tools.bbox2loc`. Also, types of input arrays and output arrays are same. Here are notations. * :math:`S` is the total number of sampled RoIs, which equals \ :obj:`self.n_sample`. * :math:`L` is number of object classes possibly including the \ background. Args: roi (array): Region of Interests (RoIs) from which we sample. Its shape is :math:`(R, 4)` bbox (array): The coordinates of ground truth bounding boxes. Its shape is :math:`(R', 4)`. label (array): Ground truth bounding box labels. Its shape is :math:`(R',)`. Its range is :math:`[0, L - 1]`, where :math:`L` is the number of foreground classes. loc_normalize_mean (tuple of four floats): Mean values to normalize coordinates of bouding boxes. loc_normalize_std (tupler of four floats): Standard deviation of the coordinates of bounding boxes. Returns: (array, array, array): * **sample_roi**: Regions of interests that are sampled. \ Its shape is :math:`(S, 4)`. * **gt_roi_loc**: Offsets and scales to match \ the sampled RoIs to the ground truth bounding boxes. \ Its shape is :math:`(S, 4)`. * **gt_roi_label**: Labels assigned to sampled RoIs. Its shape is \ :math:`(S,)`. Its range is :math:`[0, L]`. The label with \ value 0 is the background. """ n_bbox, _ = bbox.shape roi = np.concatenate((roi, bbox), axis=0) pos_roi_per_image = np.round(self.n_sample * self.pos_ratio) iou = bbox_iou(roi, bbox) gt_assignment = iou.argmax(axis=1) max_iou = iou.max(axis=1) # Offset range of classes from [0, n_fg_class - 1] to [1, n_fg_class]. # The label with value 0 is the background. gt_roi_label = label[gt_assignment] + 1 # Select foreground RoIs as those with >= pos_iou_thresh IoU. pos_index = np.where(max_iou >= self.pos_iou_thresh)[0] pos_roi_per_this_image = int(min(pos_roi_per_image, pos_index.size)) if pos_index.size > 0: pos_index = np.random.choice( pos_index, size=pos_roi_per_this_image, replace=False) # Select background RoIs as those within # [neg_iou_thresh_lo, neg_iou_thresh_hi). neg_index = np.where((max_iou < self.neg_iou_thresh_hi) & (max_iou >= self.neg_iou_thresh_lo))[0] neg_roi_per_this_image = self.n_sample - pos_roi_per_this_image neg_roi_per_this_image = int(min(neg_roi_per_this_image, neg_index.size)) if neg_index.size > 0: neg_index = np.random.choice( neg_index, size=neg_roi_per_this_image, replace=False) # The indices that we're selecting (both positive and negative). keep_index = np.append(pos_index, neg_index) gt_roi_label = gt_roi_label[keep_index] gt_roi_label[pos_roi_per_this_image:] = 0 # negative labels --> 0 sample_roi = roi[keep_index] # Compute offsets and scales to match sampled RoIs to the GTs. gt_roi_loc = bbox2loc(sample_roi, bbox[gt_assignment[keep_index]]) gt_roi_loc = ((gt_roi_loc - np.array(loc_normalize_mean, np.float32) ) / np.array(loc_normalize_std, np.float32)) return sample_roi, gt_roi_loc, gt_roi_label class AnchorTargetCreator(object): """Assign the ground truth bounding boxes to anchors. Assigns the ground truth bounding boxes to anchors for training Region Proposal Networks introduced in Faster R-CNN [#]_. Offsets and scales to match anchors to the ground truth are calculated using the encoding scheme of :func:`model.utils.bbox_tools.bbox2loc`. .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \ Faster R-CNN: Towards Real-Time Object Detection with \ Region Proposal Networks. NIPS 2015. Args: n_sample (int): The number of regions to produce. pos_iou_thresh (float): Anchors with IoU above this threshold will be assigned as positive. neg_iou_thresh (float): Anchors with IoU below this threshold will be assigned as negative. pos_ratio (float): Ratio of positive regions in the sampled regions. """ def __init__(self, n_sample=256, pos_iou_thresh=0.7, neg_iou_thresh=0.3, pos_ratio=0.5): self.n_sample = n_sample self.pos_iou_thresh = pos_iou_thresh self.neg_iou_thresh = neg_iou_thresh self.pos_ratio = pos_ratio def __call__(self, bbox, anchor, img_size): """Assign ground truth supervision to sampled subset of anchors. Types of input arrays and output arrays are same. Here are notations. * :math:`S` is the number of anchors. * :math:`R` is the number of bounding boxes. Args: bbox (array): Coordinates of bounding boxes. Its shape is :math:`(R, 4)`. anchor (array): Coordinates of anchors. Its shape is :math:`(S, 4)`. img_size (tuple of ints): A tuple :obj:`H, W`, which is a tuple of height and width of an image. Returns: (array, array): #NOTE: it's scale not only offset * **loc**: Offsets and scales to match the anchors to \ the ground truth bounding boxes. Its shape is :math:`(S, 4)`. * **label**: Labels of anchors with values \ :obj:`(1=positive, 0=negative, -1=ignore)`. Its shape \ is :math:`(S,)`. """ img_H, img_W = img_size n_anchor = len(anchor) inside_index = _get_inside_index(anchor, img_H, img_W) anchor = anchor[inside_index] argmax_ious, label = self._create_label( inside_index, anchor, bbox) # compute bounding box regression targets loc = bbox2loc(anchor, bbox[argmax_ious]) # map up to original set of anchors label = _unmap(label, n_anchor, inside_index, fill=-1) loc = _unmap(loc, n_anchor, inside_index, fill=0) return loc, label def _create_label(self, inside_index, anchor, bbox): # label: 1 is positive, 0 is negative, -1 is dont care label = np.empty((len(inside_index),), dtype=np.int32) label.fill(-1) argmax_ious, max_ious, gt_argmax_ious = \ self._calc_ious(anchor, bbox, inside_index) # assign negative labels first so that positive labels can clobber them label[max_ious < self.neg_iou_thresh] = 0 # positive label: for each gt, anchor with highest iou label[gt_argmax_ious] = 1 # positive label: above threshold IOU label[max_ious >= self.pos_iou_thresh] = 1 # subsample positive labels if we have too many n_pos = int(self.pos_ratio * self.n_sample) pos_index = np.where(label == 1)[0] if len(pos_index) > n_pos: disable_index = np.random.choice( pos_index, size=(len(pos_index) - n_pos), replace=False) label[disable_index] = -1 # subsample negative labels if we have too many n_neg = self.n_sample - np.sum(label == 1) neg_index = np.where(label == 0)[0] if len(neg_index) > n_neg: disable_index = np.random.choice( neg_index, size=(len(neg_index) - n_neg), replace=False) label[disable_index] = -1 return argmax_ious, label def _calc_ious(self, anchor, bbox, inside_index): # ious between the anchors and the gt boxes ious = bbox_iou(anchor, bbox) argmax_ious = ious.argmax(axis=1) max_ious = ious[np.arange(len(inside_index)), argmax_ious] gt_argmax_ious = ious.argmax(axis=0) gt_max_ious = ious[gt_argmax_ious, np.arange(ious.shape[1])] gt_argmax_ious = np.where(ious == gt_max_ious)[0] return argmax_ious, max_ious, gt_argmax_ious def _unmap(data, count, index, fill=0): # Unmap a subset of item (data) back to the original set of items (of # size count) if len(data.shape) == 1: ret = np.empty((count,), dtype=data.dtype) ret.fill(fill) ret[index] = data else: ret = np.empty((count,) + data.shape[1:], dtype=data.dtype) ret.fill(fill) ret[index, :] = data return ret def _get_inside_index(anchor, H, W): # Calc indicies of anchors which are located completely inside of the image # whose size is speficied. index_inside = np.where( (anchor[:, 0] >= 0) & (anchor[:, 1] >= 0) & (anchor[:, 2] <= H) & (anchor[:, 3] <= W) )[0] return index_inside class ProposalCreator: # unNOTE: I'll make it undifferential # unTODO: make sure it's ok # It's ok """Proposal regions are generated by calling this object. The :meth:`__call__` of this object outputs object detection proposals by applying estimated bounding box offsets to a set of anchors. This class takes parameters to control number of bounding boxes to pass to NMS and keep after NMS. If the paramters are negative, it uses all the bounding boxes supplied or keep all the bounding boxes returned by NMS. This class is used for Region Proposal Networks introduced in Faster R-CNN [#]_. .. [#] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. \ Faster R-CNN: Towards Real-Time Object Detection with \ Region Proposal Networks. NIPS 2015. Args: nms_thresh (float): Threshold value used when calling NMS. n_train_pre_nms (int): Number of top scored bounding boxes to keep before passing to NMS in train mode. n_train_post_nms (int): Number of top scored bounding boxes to keep after passing to NMS in train mode. n_test_pre_nms (int): Number of top scored bounding boxes to keep before passing to NMS in test mode. n_test_post_nms (int): Number of top scored bounding boxes to keep after passing to NMS in test mode. force_cpu_nms (bool): If this is :obj:`True`, always use NMS in CPU mode. If :obj:`False`, the NMS mode is selected based on the type of inputs. min_size (int): A paramter to determine the threshold on discarding bounding boxes based on their sizes. """ def __init__(self, parent_model, nms_thresh=0.7, n_train_pre_nms=12000, n_train_post_nms=2000, n_test_pre_nms=6000, n_test_post_nms=300, min_size=16 ): self.parent_model = parent_model self.nms_thresh = nms_thresh self.n_train_pre_nms = n_train_pre_nms self.n_train_post_nms = n_train_post_nms self.n_test_pre_nms = n_test_pre_nms self.n_test_post_nms = n_test_post_nms self.min_size = min_size def __call__(self, loc, score, anchor, img_size, scale=1.): """input should be ndarray Propose RoIs. Inputs :obj:`loc, score, anchor` refer to the same anchor when indexed by the same index. On notations, :math:`R` is the total number of anchors. This is equal to product of the height and the width of an image and the number of anchor bases per pixel. Type of the output is same as the inputs. Args: loc (array): Predicted offsets and scaling to anchors. Its shape is :math:`(R, 4)`. score (array): Predicted foreground probability for anchors. Its shape is :math:`(R,)`. anchor (array): Coordinates of anchors. Its shape is :math:`(R, 4)`. img_size (tuple of ints): A tuple :obj:`height, width`, which contains image size after scaling. scale (float): The scaling factor used to scale an image after reading it from a file. Returns: array: An array of coordinates of proposal boxes. Its shape is :math:`(S, 4)`. :math:`S` is less than :obj:`self.n_test_post_nms` in test time and less than :obj:`self.n_train_post_nms` in train time. :math:`S` depends on the size of the predicted bounding boxes and the number of bounding boxes discarded by NMS. """ # NOTE: when test, remember # faster_rcnn.eval() # to set self.traing = False if self.parent_model.training: n_pre_nms = self.n_train_pre_nms n_post_nms = self.n_train_post_nms else: n_pre_nms = self.n_test_pre_nms n_post_nms = self.n_test_post_nms # Convert anchors into proposal via bbox transformations. # roi = loc2bbox(anchor, loc) roi = loc2bbox(anchor, loc) # Clip predicted boxes to image. roi[:, slice(0, 4, 2)] = np.clip( roi[:, slice(0, 4, 2)], 0, img_size[0]) roi[:, slice(1, 4, 2)] = np.clip( roi[:, slice(1, 4, 2)], 0, img_size[1]) # Remove predicted boxes with either height or width < threshold. min_size = self.min_size * scale hs = roi[:, 2] - roi[:, 0] ws = roi[:, 3] - roi[:, 1] keep = np.where((hs >= min_size) & (ws >= min_size))[0] roi = roi[keep, :] score = score[keep] # Sort all (proposal, score) pairs by score from highest to lowest. # Take top pre_nms_topN (e.g. 6000). order = score.ravel().argsort()[::-1] if n_pre_nms > 0: order = order[:n_pre_nms] roi = roi[order, :] # Apply nms (e.g. threshold = 0.7). # Take after_nms_topN (e.g. 300). # unNOTE: somthing is wrong here! # TODO: remove cuda.to_gpu keep = non_maximum_suppression( cp.ascontiguousarray(cp.asarray(roi)), thresh=self.nms_thresh) if n_post_nms > 0: keep = keep[:n_post_nms] roi = roi[keep] return roi
這個腳本實現了三個Creator函數,分別是:ProposalCreator、AnchorTargetCreator、ProposalTargetCreator
前兩個都在RPN網絡里實現,第三個在RoIHead網絡里實現:
1) AnchorTargetCreator:
目的:利用每張圖中bbox的真實標簽來為所有任務分配ground truth!
輸入:最初生成的20000個anchor坐標、此一張圖中所有的bbox的真實坐標
輸出:size為(20000,1)的正負label(其中只有128個為1,128個為0,其余都為-1)、 size為(20000,4)的回歸目標(所有anchor的坐標都有)
前面講到每張圖進來都會生成約20000個anchor,前面已經分析了這20000個anchor的生成過程。那問題來了,我們在RPN網絡里要做三個操作:分類、回歸、提供rois 。分類和回歸的ground truth 怎么獲取?如何給20000個anchor在分類時賦予正負標簽gt_rpn_label
?如何給回歸操作賦予回歸目標gt_rpn_loc??? 這就是此creator的目的,利用每張圖bbox的真實標簽來為所有任務分配ground truth!
注意雖然是給所有20000個anchor賦予了ground truth,但是我們只從中任挑128個正類和128個負類共256個樣本來訓練。不利用所有樣本訓練的原因是顯然圖中負類遠多於正類樣本數目。同樣回歸也只挑256個anchor來完成。
此函數首先將一張圖中所有20000個anchor中所有完整包含在圖像中的anchor篩選出來,假如挑出15000個anchor,要記錄下來這部分的索引。
然后利用函數bbox_iou計算15000個anchor與真實bbox的IOU。然后利用函數_create_label根據行列索引分別求出每個anchor與哪個bbox的iou最大,以及最大值,然后返回最大iou的索引argmax_ious(即每個anchor與真實bbox最大iou的索引)與label(label中背景為-1,負樣本為0, 正樣本為1)。注意雖然是要挑選256個,但是這里返回的label仍然是全部,只不過label里面有128為0,128個為1,其余都為-1而已。然后函數bbox2loc利用返回的索引argmax_ious來計算出回歸的目標參數組loc。然后根據之前記錄的索引,將15000個再映射回20000長度的label(其余的label一律置為-1)和loc(其余的loc一律置為(0,0,0,0))。有了RPN網絡兩個1*1卷積輸出的類別label和位置參數loc的預測值,AnchorTargetCreator又為其對應生成了真實值ground truth。那么AnchorTargetCreator的損失函數rpn_loss就很了然了:
cls代表二分類,reg代表回歸。為什么我們挑出來256個label還要映射回20000呢?就是因為這里網絡的預測結果(1*1卷積)就是20000個,而我們將要忽略的label都設為了-1,這就允許我們得以篩選,而loc也是一樣的道理。所以損失函數里,而
。
2) ProposalCreator :
目的:為Fast-RCNN也即檢測網絡提供2000個訓練樣本
輸入:RPN網絡中1*1卷積輸出的loc和score,以及20000個anchor坐標,原圖尺寸,scale(即對於這張訓練圖像較其原始大小的scale)
輸出:2000個訓練樣本rois(只是2000*4的坐標,無ground truth!)
RPN利用 AnchorTargetCreator自身訓練的同時,還會提供RoIs(region of interests)給Fast RCNN(RoIHead)作為訓練樣本。
注意這里首次用到了NMS(非極大值抑制)。利用非極大值抑制可以循環篩選bbox,使得從所有bbox中篩選出期望數量的bbox,這是控制bbox數量的主要方法。函數loc2bbox首先利用RPN網絡輸出的預測值loc來對20000個anchor進行微調,此時微調后的20000個anchor稱之為rois。然后根據原圖尺寸將這些rois進行截斷在原圖像大小內部。然后將此時所有roi中所有寬與高皆大於16的roi的索引記錄,假設有18000個roi滿足。然后利用預測值score對這些roi從高到低排序,只取前12000個。然后利用NMS進一步篩選,得到2000個roi。
與AnchorTargetCreator的關系:ProposalCreator 只是拿1*1卷積的兩路輸出loc和score和20000個anchor來選出最終的2000個rois。與AnchorTargetCreator其實並無交集。AnchorTargetCreator所做的事情就是訓練,輸出真實值來和1*1卷積的兩路輸出loc和score進行訓練,使得網絡變好,那這樣ProposalCreator 選出來的2000個roi質量會更好,所以他倆唯一的共同點就是都利用了預測的loc和score、20000個原始anchor坐標!
3) ProposalTargetCreator
目的:為2000個rois賦予ground truth!(嚴格講挑出128個賦予ground truth!)
輸入:2000個rois、一個batch(一張圖)中所有的bbox ground truth(R,4)、對應bbox所包含的label(R,1)(VOC2007來說20類0-19)
輸出:128個sample roi(128,4)、128個gt_roi_loc(128,4)、128個gt_roi_label(128,1)
ProposalCreator 輸出的2000個roi作為ProposalTargetCreator的輸入,同時輸入的還有一張圖上的所有bbox、label的ground trurh。如果此輸入圖像里有5個object,那么就有5個bbox和5個label。那么這時的三個輸入可能是:下面我們將使用此例R=5來分析:
5*4 bbox的ground truth 5*1 label 2000*4個roi
代碼首先將2000個roi和5個bbox給concatenate了一下成為新的roi(2005,4)。存疑,我覺得這里沒必要concatenate。我們只需要從這新的2005個中挑選128個roi出來來為Fast-RCNN提供訓練sample。首先還是調用函數bbox_iou來求roi與bbox的iou矩陣,為(2005,5)。然后記錄每行的最大值、最大值索引,即這2005個roi和5個bbox里某個roi最大,那么這個roi就屬於某個label。下面就是選128個roi,記錄下其中的索引,前32個為正類,后96個為負類。然后利用這128個索引值keep_index就得到了128個sample roi,128個gt_label,將sample_roi和其所屬bbox經函數bbox2loc就得到了128個gt_loc。
2005*5 iou矩陣 2005*1 max_iou 2005*1 gt_assignment 2005*1 gt_roi_label(其實篩選后才叫gt_roi_label)
具體篩選過程見下圖:
那么此時輸出的128*4的sample_roi就可以去扔到 RoIHead網絡里去進行分類與回歸了。同樣, RoIHead網絡利用這sample_roi+featue為輸入,輸出是分類(21類)和回歸(進一步微調bbox)的預測值,那么分類回歸的groud truth是誰呢?就是ProposalTargetCreator輸出的gt_roi_label和gt_roi_loc。那么有了預測值和真實值就能訓練損失roi_loss了。注意這里的128個roi肯定是在原圖內的,因為ProposalCreator已經將所有roi都截斷在原圖內了。
與AnchorTargetCreator、ProposalCreator的關系:
ProposalCreator的輸出作為此Creator的輸入。ProposalTargetCreator和AnchorTargetCreator非常相似了(名字就很相似):都為分類回歸損失函數創造ground truth,因為這兩個Creator輸入都含有一張圖片中的gt_bbox。ProposalTargetCreator首次用到了真實的21個類的label,且該類最后對loc進行了歸一化處理,所以預測時要進行均值方差處理。
三. 小結
1. 三個Creator的共同點:
都在各自的類里定義了__call__函數,使對象可以向函數一樣調用。都參與訓練。
2. rpn_loss與roi_loss的異同:
都是分類與回歸的多目標損失。所以Faster-RCNN共有4個子損失函數。
對於 rpn_loss中的分類是2分類,是256個樣本參與,正負樣本各一半,分類預測值是rpn網絡的1*1卷積輸出,分類真實標簽是AnchorTargetCreator生成的ground truth。 rpn_loss中的回歸樣本數是所有20000個(嚴格講是20000個bbox中所有完整出現在原圖中的bbox)bbox來參與,回歸預測值是rpn網絡的另一個1*1卷積輸出,回歸目標是AnchorTargetCreator生成的ground truth。
對於roi_loss中的分類是21分類,是128個樣本參與,正負樣本1:3。分類預測值是Roi_head網絡的FC21輸出,分類真實標簽是ProposalTargetCreator生成的ground truth。roi_loss中的回歸樣本數是128個,回歸預測值是Roi_head網絡的FC84輸出,回歸目標是ProposalTargetCreator生成的ground truth。
Reference: