轉載請注明出處:
https://www.cnblogs.com/darkknightzh/p/10043864.html
參考網址:
論文:https://arxiv.org/abs/1506.01497
tf的第三方faster rcnn:https://github.com/endernewton/tf-faster-rcnn
IOU:https://www.cnblogs.com/darkknightzh/p/9043395.html
faster rcnn主要包括兩部分:rpn網絡和rcnn網絡。rpn網絡用於保留在圖像內部的archors,同時得到這些archors是正樣本還是負樣本還是不關注。最終訓練時通過nms保留最多2000個archors,測試時保留300個archors。另一方面,rpn網絡會提供256個archors給rcnn網絡,用於rcnn分類及回歸坐標位置。
Network為基類,vgg16為派生類,重載了Network中的_image_to_head和_head_to_tail。
下面只針對vgg16進行分析。
faster rcnn網絡總體結構如下圖所示。
1. 訓練階段:
SolverWrapper通過construct_graph創建網絡、train_op等。
construct_graph通過Network的create_architecture創建網絡。
1.1 create_architecture
create_architecture通過_build_network具體創建網絡模型、損失及其他相關操作,得到rois, cls_prob, bbox_pred,定義如下

1 def create_architecture(self, mode, num_classes, tag=None, anchor_scales=(8, 16, 32), anchor_ratios=(0.5, 1, 2)): 2 self._image = tf.placeholder(tf.float32, shape=[1, None, None, 3]) # 由於圖像寬高不定,因而第二維和第三維都是None 3 self._im_info = tf.placeholder(tf.float32, shape=[3]) # 圖像信息,高、寬、縮放到寬為600或者高為1000的最小比例 4 self._gt_boxes = tf.placeholder(tf.float32, shape=[None, 5]) # ground truth框的信息。前四個為位置信息,最后一個為該框對應的類別(見roi_data_layer/minibatch.py/get_minibatch) 5 self._tag = tag 6 7 self._num_classes = num_classes 8 self._mode = mode 9 self._anchor_scales = anchor_scales 10 self._num_scales = len(anchor_scales) 11 12 self._anchor_ratios = anchor_ratios 13 self._num_ratios = len(anchor_ratios) 14 15 self._num_anchors = self._num_scales * self._num_ratios # self._num_anchors=9 16 17 training = mode == 'TRAIN' 18 testing = mode == 'TEST' 19 20 weights_regularizer = tf.contrib.layers.l2_regularizer(cfg.TRAIN.WEIGHT_DECAY) # handle most of the regularizers here 21 if cfg.TRAIN.BIAS_DECAY: 22 biases_regularizer = weights_regularizer 23 else: 24 biases_regularizer = tf.no_regularizer 25 26 # list as many types of layers as possible, even if they are not used now 27 with arg_scope([slim.conv2d, slim.conv2d_in_plane, slim.conv2d_transpose, slim.separable_conv2d, slim.fully_connected], 28 weights_regularizer=weights_regularizer, biases_regularizer=biases_regularizer, biases_initializer=tf.constant_initializer(0.0)): 29 # rois:256個archors的類別(訓練時為每個archors的類別,測試時全0) 30 # cls_prob:256個archors每一類別的概率 31 # bbox_pred:預測位置信息的偏移 32 rois, cls_prob, bbox_pred = self._build_network(training) 33 34 layers_to_output = {'rois': rois} 35 36 for var in tf.trainable_variables(): 37 self._train_summaries.append(var) 38 39 if testing: 40 stds = np.tile(np.array(cfg.TRAIN.BBOX_NORMALIZE_STDS), (self._num_classes)) 41 means = np.tile(np.array(cfg.TRAIN.BBOX_NORMALIZE_MEANS), (self._num_classes)) 42 self._predictions["bbox_pred"] *= stds # 訓練時_region_proposal中預測的位置偏移減均值除標准差,因而測試時需要反過來。 43 self._predictions["bbox_pred"] += means 44 else: 45 self._add_losses() 46 layers_to_output.update(self._losses) 47 48 val_summaries = [] 49 with tf.device("/cpu:0"): 50 val_summaries.append(self._add_gt_image_summary()) 51 for key, var in self._event_summaries.items(): 52 val_summaries.append(tf.summary.scalar(key, var)) 53 for key, var in self._score_summaries.items(): 54 self._add_score_summary(key, var) 55 for var in self._act_summaries: 56 self._add_act_summary(var) 57 for var in self._train_summaries: 58 self._add_train_summary(var) 59 60 self._summary_op = tf.summary.merge_all() 61 self._summary_op_val = tf.summary.merge(val_summaries) 62 63 layers_to_output.update(self._predictions) 64 65 return layers_to_output
1.2 _build_network
_build_network用於創建網絡
_build_network = _image_to_head + //得到輸入圖像的特征
_anchor_component + //得到所有可能的archors在原始圖像中的坐標(可能超出圖像邊界)及archors的數量
_region_proposal + //對輸入特征進行處理,最終得到2000個archors(訓練)或300個archors(測試)
_crop_pool_layer + //將256個archors裁剪出來,並縮放到7*7的固定大小,得到特征
_head_to_tail + //將256個archors的特征增加fc及dropout,得到4096維的特征
_region_classification // 增加fc層及dropout層,用於rcnn的分類及回歸
總體流程:網絡通過vgg1-5得到特征net_conv后,送入rpn網絡得到候選區域archors,去除超出圖像邊界的archors並選出2000個archors用於訓練rpn網絡(300個用於測試)。並進一步選擇256個archors(用於rcnn分類)。之后將這256個archors的特征根據rois進行裁剪縮放及pooling,得到相同大小7*7的特征pool5,pool5通過兩個fc層得到4096維特征fc7,fc7送入_region_classification(2個並列的fc層),得到21維的cls_score和21*4維的bbox_pred。
_build_network定義如下

1 def _build_network(self, is_training=True): 2 if cfg.TRAIN.TRUNCATED: # select initializers 3 initializer = tf.truncated_normal_initializer(mean=0.0, stddev=0.01) 4 initializer_bbox = tf.truncated_normal_initializer(mean=0.0, stddev=0.001) 5 else: 6 initializer = tf.random_normal_initializer(mean=0.0, stddev=0.01) 7 initializer_bbox = tf.random_normal_initializer(mean=0.0, stddev=0.001) 8 9 net_conv = self._image_to_head(is_training) # 得到vgg16的conv5_3 10 with tf.variable_scope(self._scope, self._scope): 11 self._anchor_component() # 通過特征圖及相對原始圖像的縮放倍數_feat_stride得到所有archors的起點及終點坐標 12 rois = self._region_proposal(net_conv, is_training, initializer) # 通過rpn網絡,得到256個archors的類別(訓練時為每個archors的類別,測試時全0)及位置(后四維) 13 pool5 = self._crop_pool_layer(net_conv, rois, "pool5") # 對特征圖通過rois得到候選區域,並對候選區域進行縮放,得到14*14的固定大小,進一步pooling成7*7大小 14 15 fc7 = self._head_to_tail(pool5, is_training) # 對固定大小的rois增加fc及dropout,得到4096維的特征,用於分類及回歸 16 with tf.variable_scope(self._scope, self._scope): 17 cls_prob, bbox_pred = self._region_classification(fc7, is_training, initializer, initializer_bbox) # 對rois進行分類,完成目標檢測;進行回歸,得到預測坐標 18 19 self._score_summaries.update(self._predictions) 20 21 # rois:256個archors的類別(訓練時為每個archors的類別,測試時全0) 22 # cls_prob:256個archors每一類別的概率 23 # bbox_pred:預測位置信息的偏移 24 return rois, cls_prob, bbox_pred
1.3 _image_to_head
_image_to_head用於得到輸入圖像的特征
該函數位於vgg16.py中,定義如下

1 def _image_to_head(self, is_training, reuse=None): 2 with tf.variable_scope(self._scope, self._scope, reuse=reuse): 3 net = slim.repeat(self._image, 2, slim.conv2d, 64, [3, 3], trainable=False, scope='conv1') 4 net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool1') 5 net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], trainable=False, scope='conv2') 6 net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool2') 7 net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], trainable=is_training, scope='conv3') 8 net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool3') 9 net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv4') 10 net = slim.max_pool2d(net, [2, 2], padding='SAME', scope='pool4') 11 net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], trainable=is_training, scope='conv5') 12 13 self._act_summaries.append(net) 14 self._layers['head'] = net 15 16 return net
1.4 _anchor_component
_anchor_component:用於得到所有可能的archors在原始圖像中的坐標(可能超出圖像邊界)及archors的數量(特征圖寬*特征圖高*9)。該函數使用的self._im_info,為一個3維向量,[0]代表圖像高,[1]代表圖像寬(感謝carrot359提醒,之前寬高寫反了),[2]代表圖像縮放的比例(將圖像寬縮放到600,或高縮放到1000的最小比例,比如縮放到600*900、850*1000)。該函數調用generate_anchors_pre_tf並進一步調用generate_anchors來得到所有可能的archors在原始圖像中的坐標及archors的個數(由於圖像大小不一樣,因而最終archor的個數也不一樣)。
generate_anchors_pre_tf步驟如下:
1. 通過_ratio_enum得到archor時,使用 (0, 0, 15, 15) 的基准窗口,先通過ratio=[0.5,1,2]的比例得到archors。ratio指的是像素總數(寬*高)的比例,而不是寬或者高的比例,得到如下三個archor(每個archor為左上角和右下角的坐標):
2. 而后在通過scales=(8, 16, 32)得到放大倍數后的archors。scales時,將上面的每個都直接放大對應的倍數,最終得到9個archors(每個archor為左上角和右下角的坐標)。將上面三個archors分別放大就行了,因而本文未給出該圖。
之后通過tf.add(anchor_constant, shifts)得到縮放后的每個點的9個archor在原始圖的矩形框。anchor_constant:1*9*4。shifts:N*1*4。N為縮放后特征圖的像素數。將維度從N*9*4變換到(N*9)*4,得到縮放后的圖像每個點在原始圖像中的archors。
_anchor_component如下:

1 def _anchor_component(self): 2 with tf.variable_scope('ANCHOR_' + self._tag) as scope: 3 height = tf.to_int32(tf.ceil(self._im_info[0] / np.float32(self._feat_stride[0]))) # 圖像經過vgg16得到特征圖的寬高 4 width = tf.to_int32(tf.ceil(self._im_info[1] / np.float32(self._feat_stride[0]))) 5 if cfg.USE_E2E_TF: 6 # 通過特征圖寬高、_feat_stride(特征圖相對原始圖縮小的比例)及_anchor_scales、_anchor_ratios得到原始圖像上 7 # 所有可能的archors(坐標可能超出原始圖像邊界)和archor的數量 8 anchors, anchor_length = generate_anchors_pre_tf(height, width, self._feat_stride, self._anchor_scales, self._anchor_ratios ) 9 else: 10 anchors, anchor_length = tf.py_func(generate_anchors_pre, 11 [height, width, self._feat_stride, self._anchor_scales, self._anchor_ratios], [tf.float32, tf.int32], name="generate_anchors") 12 anchors.set_shape([None, 4]) # 起點坐標,終點坐標,共4個值 13 anchor_length.set_shape([]) 14 self._anchors = anchors 15 self._anchor_length = anchor_length 16 17 def generate_anchors_pre_tf(height, width, feat_stride=16, anchor_scales=(8, 16, 32), anchor_ratios=(0.5, 1, 2)): 18 shift_x = tf.range(width) * feat_stride # 得到所有archors在原始圖像的起始x坐標:(0,feat_stride,2*feat_stride...) 19 shift_y = tf.range(height) * feat_stride # 得到所有archors在原始圖像的起始y坐標:(0,feat_stride,2*feat_stride...) 20 shift_x, shift_y = tf.meshgrid(shift_x, shift_y) # shift_x:height個(0,feat_stride,2*feat_stride...);shift_y:width個(0,feat_stride,2*feat_stride...)' 21 sx = tf.reshape(shift_x, shape=(-1,)) # 0,feat_stride,2*feat_stride...0,feat_stride,2*feat_stride...0,feat_stride,2*feat_stride... 22 sy = tf.reshape(shift_y, shape=(-1,)) # 0,0,0...feat_stride,feat_stride,feat_stride...2*feat_stride,2*feat_stride,2*feat_stride.. 23 shifts = tf.transpose(tf.stack([sx, sy, sx, sy])) # width*height個四位矩陣 24 K = tf.multiply(width, height) # 特征圖總共像素數 25 shifts = tf.transpose(tf.reshape(shifts, shape=[1, K, 4]), perm=(1, 0, 2)) # 增加一維,變成1*(width*height)*4矩陣,而后變換維度為(width*height)*1*4矩陣 26 27 anchors = generate_anchors(ratios=np.array(anchor_ratios), scales=np.array(anchor_scales)) #得到9個archors的在原始圖像中的四個坐標(放大比例默認為16) 28 A = anchors.shape[0] # A=9 29 anchor_constant = tf.constant(anchors.reshape((1, A, 4)), dtype=tf.int32) # anchors增加維度為1*9*4 30 31 length = K * A # 總共的archors的個數(每個點對應A=9個archor,共K=height*width個點) 32 # 1*9*4的base archors和(width*height)*1*4的偏移矩陣進行broadcast相加,得到(width*height)*9*4,並改變形狀為(width*height*9)*4,得到所有的archors的四個坐標 33 anchors_tf = tf.reshape(tf.add(anchor_constant, shifts), shape=(length, 4)) 34 35 return tf.cast(anchors_tf, dtype=tf.float32), length 36 37 def generate_anchors(base_size=16, ratios=[0.5, 1, 2], scales=2 ** np.arange(3, 6)): 38 """Generate anchor (reference) windows by enumerating aspect ratios X scales wrt a reference (0, 0, 15, 15) window.""" 39 base_anchor = np.array([1, 1, base_size, base_size]) - 1 # base archor的四個坐標 40 ratio_anchors = _ratio_enum(base_anchor, ratios) # 通過ratio得到3個archors的坐標(3*4矩陣) 41 anchors = np.vstack([_scale_enum(ratio_anchors[i, :], scales) for i in range(ratio_anchors.shape[0])]) # 3*4矩陣變成9*4矩陣,得到9個archors的坐標 42 return anchors 43 44 45 def _whctrs(anchor): 46 """ Return width, height, x center, and y center for an anchor (window). """ 47 w = anchor[2] - anchor[0] + 1 # 寬 48 h = anchor[3] - anchor[1] + 1 # 高 49 x_ctr = anchor[0] + 0.5 * (w - 1) # 中心x 50 y_ctr = anchor[1] + 0.5 * (h - 1) # 中心y 51 return w, h, x_ctr, y_ctr 52 53 54 def _mkanchors(ws, hs, x_ctr, y_ctr): 55 """ Given a vector of widths (ws) and heights (hs) around a center (x_ctr, y_ctr), output a set of anchors (windows).""" 56 ws = ws[:, np.newaxis] # 3維向量變成3*1矩陣 57 hs = hs[:, np.newaxis] # 3維向量變成3*1矩陣 58 anchors = np.hstack((x_ctr - 0.5 * (ws - 1), y_ctr - 0.5 * (hs - 1), x_ctr + 0.5 * (ws - 1), y_ctr + 0.5 * (hs - 1))) # 3*4矩陣 59 return anchors 60 61 62 def _ratio_enum(anchor, ratios): # 縮放比例為像素總數的比例,而非單獨寬或者高的比例 63 """ Enumerate a set of anchors for each aspect ratio wrt an anchor. """ 64 w, h, x_ctr, y_ctr = _whctrs(anchor) # 得到中心位置和寬高 65 size = w * h # 總共像素數 66 size_ratios = size / ratios # 縮放比例 67 ws = np.round(np.sqrt(size_ratios)) # 縮放后的寬,3維向量(值由大到小) 68 hs = np.round(ws * ratios) # 縮放后的高,兩個3維向量對應元素相乘,為3維向量(值由小到大) 69 anchors = _mkanchors(ws, hs, x_ctr, y_ctr) # 根據中心及寬高得到3個archors的四個坐標 70 return anchors 71 72 73 def _scale_enum(anchor, scales): 74 """ Enumerate a set of anchors for each scale wrt an anchor. """ 75 w, h, x_ctr, y_ctr = _whctrs(anchor) # 得到中心位置和寬高 76 ws = w * scales # 得到寬的放大倍數 77 hs = h * scales # 得到寬的放大倍數 78 anchors = _mkanchors(ws, hs, x_ctr, y_ctr) # 根據中心及寬高得到3個archors的四個坐標 79 return anchors
1.5 _region_proposal
_region_proposal用於將vgg16的conv5的特征通過3*3的滑動窗得到rpn特征,進行兩條並行的線路,分別送入cls和reg網絡。cls網絡判斷通過1*1的卷積得到archors是正樣本還是負樣本(由於archors過多,還有可能有不關心的archors,使用時只使用正樣本和負樣本),用於二分類rpn_cls_score;reg網絡對通過1*1的卷積回歸出archors的坐標偏移rpn_bbox_pred。這兩個網絡共用3*3 conv(rpn)。由於每個位置有k個archor,因而每個位置均有2k個soores和4k個coordinates。
cls(將輸入的512維降低到2k維):3*3 conv + 1*1 conv(2k個scores,k為每個位置archors個數,如9)
在第一次使用_reshape_layer時,由於輸入bottom為1*?*?*2k,先得到caffe中的數據順序(tf為batchsize*height*width*channels,caffe中為batchsize*channels*height*width)to_caffe:1*2k*?*?,而后reshape后得到reshaped為1*2*?*?,最后在轉回tf的順序to_tf為1*?*?*2,得到rpn_cls_score_reshape。之后通過rpn_cls_prob_reshape(softmax的值,只針對最后一維,即2計算softmax),得到概率rpn_cls_prob_reshape(其最大值,即為預測值rpn_cls_pred),再次_reshape_layer,得到1*?*?*2k的rpn_cls_prob,為原始的概率。
reg(將輸入的512維降低到4k維):3*3 conv + 1*1 conv(4k個coordinates,k為每個位置archors個數,如9)。
_region_proposal定義如下:

1 def _region_proposal(self, net_conv, is_training, initializer): # 對輸入特征圖進行處理 2 rpn = slim.conv2d(net_conv, cfg.RPN_CHANNELS, [3, 3], trainable=is_training, weights_initializer=initializer, scope="rpn_conv/3x3") #3*3的conv,作為rpn網絡 3 self._act_summaries.append(rpn) 4 rpn_cls_score = slim.conv2d(rpn, self._num_anchors * 2, [1, 1], trainable=is_training, weights_initializer=initializer, # _num_anchors為9 5 padding='VALID', activation_fn=None, scope='rpn_cls_score') #1*1的conv,得到每個位置的9個archors分類特征1*?*?*(9*2)(二分類),判斷當前archors是正樣本還是負樣本 6 rpn_cls_score_reshape = self._reshape_layer(rpn_cls_score, 2, 'rpn_cls_score_reshape') # 1*?*?*18==>1*(?*9)*?*2 7 rpn_cls_prob_reshape = self._softmax_layer(rpn_cls_score_reshape, "rpn_cls_prob_reshape") # 以最后一維為特征長度,得到所有特征的概率1*(?*9)*?*2 8 rpn_cls_pred = tf.argmax(tf.reshape(rpn_cls_score_reshape, [-1, 2]), axis=1, name="rpn_cls_pred") # 得到每個位置的9個archors預測的類別,(1*?*9*?)的列向量 9 rpn_cls_prob = self._reshape_layer(rpn_cls_prob_reshape, self._num_anchors * 2, "rpn_cls_prob") # 變換會原始維度1*(?*9)*?*2==>1*?*?*(9*2) 10 rpn_bbox_pred = slim.conv2d(rpn, self._num_anchors * 4, [1, 1], trainable=is_training, weights_initializer=initializer, 11 padding='VALID', activation_fn=None, scope='rpn_bbox_pred') #1*1的conv,每個位置的9個archors回歸位置偏移1*?*?*(9*4) 12 if is_training: 13 # 每個位置的9個archors的類別概率和每個位置的9個archors的回歸位置偏移得到post_nms_topN=2000個archors的位置(包括全0的batch_inds)及為1的概率 14 rois, roi_scores = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois") 15 rpn_labels = self._anchor_target_layer(rpn_cls_score, "anchor") # rpn_labels:特征圖中每個位置對應的是正樣本、負樣本還是不關注 16 with tf.control_dependencies([rpn_labels]): # Try to have a deterministic order for the computing graph, for reproducibility 17 rois, _ = self._proposal_target_layer(rois, roi_scores, "rpn_rois") #通過post_nms_topN個archors的位置及為1(正樣本)的概率得到256個rois(第一列的全0更新為每個archors對應的類別)及對應信息 18 else: 19 if cfg.TEST.MODE == 'nms': 20 # 每個位置的9個archors的類別概率和每個位置的9個archors的回歸位置偏移得到post_nms_topN=300個archors的位置(包括全0的batch_inds)及為1的概率 21 rois, _ = self._proposal_layer(rpn_cls_prob, rpn_bbox_pred, "rois") 22 elif cfg.TEST.MODE == 'top': 23 rois, _ = self._proposal_top_layer(rpn_cls_prob, rpn_bbox_pred, "rois") 24 else: 25 raise NotImplementedError 26 27 self._predictions["rpn_cls_score"] = rpn_cls_score # 每個位置的9個archors是正樣本還是負樣本 28 self._predictions["rpn_cls_score_reshape"] = rpn_cls_score_reshape # 每個archors是正樣本還是負樣本 29 self._predictions["rpn_cls_prob"] = rpn_cls_prob # 每個位置的9個archors是正樣本和負樣本的概率 30 self._predictions["rpn_cls_pred"] = rpn_cls_pred # 每個位置的9個archors預測的類別,(1*?*9*?)的列向量 31 self._predictions["rpn_bbox_pred"] = rpn_bbox_pred # 每個位置的9個archors回歸位置偏移 32 self._predictions["rois"] = rois # 256個archors的類別(第一維)及位置(后四維) 33 34 return rois # 返回256個archors的類別(第一維,訓練時為每個archors的類別,測試時全0)及位置(后四維) 35 36 def _reshape_layer(self, bottom, num_dim, name): 37 input_shape = tf.shape(bottom) 38 with tf.variable_scope(name) as scope: 39 to_caffe = tf.transpose(bottom, [0, 3, 1, 2]) # NHWC(TF數據格式)變成NCHW(caffe格式) 40 reshaped = tf.reshape(to_caffe, tf.concat(axis=0, values=[[1, num_dim, -1], [input_shape[2]]])) # 1*(num_dim*9)*?*?==>1*num_dim*(9*?)*? 或 1*num_dim*(9*?)*?==>1*(num_dim*9)*?*? 41 to_tf = tf.transpose(reshaped, [0, 2, 3, 1]) 42 return to_tf 43 44 45 def _softmax_layer(self, bottom, name): 46 if name.startswith('rpn_cls_prob_reshape'): # bottom:1*(?*9)*?*2 47 input_shape = tf.shape(bottom) 48 bottom_reshaped = tf.reshape(bottom, [-1, input_shape[-1]]) # 只保留最后一維,用於計算softmax的概率,其他的全合並:1*(?*9)*?*2==>(1*?*9*?)*2 49 reshaped_score = tf.nn.softmax(bottom_reshaped, name=name) # 得到所有特征的概率 50 return tf.reshape(reshaped_score, input_shape) # (1*?*9*?)*2==>1*(?*9)*?*2 51 return tf.nn.softmax(bottom, name=name)
1.6 _proposal_layer
_proposal_layer調用proposal_layer_tf,通過(N*9)*4個archors,計算估計后的坐標(bbox_transform_inv_tf),並對坐標進行裁剪(clip_boxes_tf)及非極大值抑制(tf.image.non_max_suppression,可得到符合條件的索引indices)的archors:rois及這些archors為正樣本的概率:rpn_scores。rois為m*5維,rpn_scores為m*4維,其中m為經過非極大值抑制后得到的候選區域個數(訓練時2000個,測試時300個)。m*5的第一列為全為0的batch_inds,后4列為坐標(坐上+右下)
_proposal_layer如下

1 def _proposal_layer(self, rpn_cls_prob, rpn_bbox_pred, name): #每個位置的9個archors的類別概率和每個位置的9個archors的回歸位置偏移得到post_nms_topN個archors的位置及為1的概率 2 with tf.variable_scope(name) as scope: 3 if cfg.USE_E2E_TF: # post_nms_topN*5的rois(第一列為全0的batch_inds,后4列為坐標);rpn_scores:post_nms_topN*1個對應的為1的概率 4 rois, rpn_scores = proposal_layer_tf(rpn_cls_prob, rpn_bbox_pred, self._im_info, self._mode, self._feat_stride, self._anchors, self._num_anchors) 5 else: 6 rois, rpn_scores = tf.py_func(proposal_layer, [rpn_cls_prob, rpn_bbox_pred, self._im_info, self._mode, 7 self._feat_stride, self._anchors, self._num_anchors], [tf.float32, tf.float32], name="proposal") 8 9 rois.set_shape([None, 5]) 10 rpn_scores.set_shape([None, 1]) 11 12 return rois, rpn_scores 13 14 def proposal_layer_tf(rpn_cls_prob, rpn_bbox_pred, im_info, cfg_key, _feat_stride, anchors, num_anchors): #每個位置的9個archors的類別概率和每個位置的9個archors的回歸位置偏移 15 if type(cfg_key) == bytes: 16 cfg_key = cfg_key.decode('utf-8') 17 pre_nms_topN = cfg[cfg_key].RPN_PRE_NMS_TOP_N 18 post_nms_topN = cfg[cfg_key].RPN_POST_NMS_TOP_N # 訓練時為2000,測試時為300 19 nms_thresh = cfg[cfg_key].RPN_NMS_THRESH # nms的閾值,為0.7 20 21 scores = rpn_cls_prob[:, :, :, num_anchors:] # 1*?*?*(9*2)取后9個:1*?*?*9。應該是前9個代表9個archors為背景景的概率,后9個代表9個archors為前景的概率(二分類,只有背景和前景) 22 scores = tf.reshape(scores, shape=(-1,)) # 所有的archors為1的概率 23 rpn_bbox_pred = tf.reshape(rpn_bbox_pred, shape=(-1, 4)) # 所有的archors的四個坐標 24 25 proposals = bbox_transform_inv_tf(anchors, rpn_bbox_pred) # 已知archor和偏移求預測的坐標 26 proposals = clip_boxes_tf(proposals, im_info[:2]) # 限制預測坐標在原始圖像上 27 28 indices = tf.image.non_max_suppression(proposals, scores, max_output_size=post_nms_topN, iou_threshold=nms_thresh) # 通過nms得到分值最大的post_nms_topN個坐標的索引 29 30 boxes = tf.gather(proposals, indices) # 得到post_nms_topN個對應的坐標 31 boxes = tf.to_float(boxes) 32 scores = tf.gather(scores, indices) # 得到post_nms_topN個對應的為1的概率 33 scores = tf.reshape(scores, shape=(-1, 1)) 34 35 batch_inds = tf.zeros((tf.shape(indices)[0], 1), dtype=tf.float32) # Only support single image as input 36 blob = tf.concat([batch_inds, boxes], 1) # post_nms_topN*1個batch_inds和post_nms_topN*4個坐標concat,得到post_nms_topN*5的blob 37 38 return blob, scores 39 40 def bbox_transform_inv_tf(boxes, deltas): # 已知archor和偏移求預測的坐標 41 boxes = tf.cast(boxes, deltas.dtype) 42 widths = tf.subtract(boxes[:, 2], boxes[:, 0]) + 1.0 # 寬 43 heights = tf.subtract(boxes[:, 3], boxes[:, 1]) + 1.0 # 高 44 ctr_x = tf.add(boxes[:, 0], widths * 0.5) # 中心x 45 ctr_y = tf.add(boxes[:, 1], heights * 0.5) # 中心y 46 47 dx = deltas[:, 0] # 預測的dx 48 dy = deltas[:, 1] # 預測的dy 49 dw = deltas[:, 2] # 預測的dw 50 dh = deltas[:, 3] # 預測的dh 51 52 pred_ctr_x = tf.add(tf.multiply(dx, widths), ctr_x) # 公式2已知xa,wa,tx反過來求預測的x中心坐標 53 pred_ctr_y = tf.add(tf.multiply(dy, heights), ctr_y) # 公式2已知ya,ha,ty反過來求預測的y中心坐標 54 pred_w = tf.multiply(tf.exp(dw), widths) # 公式2已知wa,tw反過來求預測的w 55 pred_h = tf.multiply(tf.exp(dh), heights) # 公式2已知ha,th反過來求預測的h 56 57 pred_boxes0 = tf.subtract(pred_ctr_x, pred_w * 0.5) # 預測的框的起始和終點四個坐標 58 pred_boxes1 = tf.subtract(pred_ctr_y, pred_h * 0.5) 59 pred_boxes2 = tf.add(pred_ctr_x, pred_w * 0.5) 60 pred_boxes3 = tf.add(pred_ctr_y, pred_h * 0.5) 61 62 return tf.stack([pred_boxes0, pred_boxes1, pred_boxes2, pred_boxes3], axis=1) 63 64 65 def clip_boxes_tf(boxes, im_info): # 限制預測坐標在原始圖像上 66 b0 = tf.maximum(tf.minimum(boxes[:, 0], im_info[1] - 1), 0) 67 b1 = tf.maximum(tf.minimum(boxes[:, 1], im_info[0] - 1), 0) 68 b2 = tf.maximum(tf.minimum(boxes[:, 2], im_info[1] - 1), 0) 69 b3 = tf.maximum(tf.minimum(boxes[:, 3], im_info[0] - 1), 0) 70 return tf.stack([b0, b1, b2, b3], axis=1)
1.7 _anchor_target_layer
通過_anchor_target_layer首先去除archors中邊界超出圖像的archors。而后通過bbox_overlaps計算archors(N*4)和gt_boxes(M*4)的重疊區域的值overlaps(N*M),並得到每個archor對應的最大的重疊ground_truth的值max_overlaps(1*N),以及ground_truth的背景對應的最大重疊archors的值gt_max_overlaps(1*M)和每個背景對應的archor的位置gt_argmax_overlaps。之后通過_compute_targets計算anchors和最大重疊位置的gt_boxes的變換后的坐標bbox_targets(見公式2后四個)。最后通過_unmap在變換回和原始的archors一樣大小的rpn_labels(archors是正樣本、負樣本還是不關注),rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights。
_anchor_target_layer定義:

1 def _anchor_target_layer(self, rpn_cls_score, name): # rpn_cls_score:每個位置的9個archors分類特征1*?*?*(9*2) 2 with tf.variable_scope(name) as scope: 3 # rpn_labels; 特征圖中每個位置對應的是正樣本、負樣本還是不關注(去除了邊界在圖像外面的archors) 4 # rpn_bbox_targets:# 特征圖中每個位置和對應的正樣本的坐標偏移(很多為0) 5 # rpn_bbox_inside_weights: 正樣本的權重為1(去除負樣本和不關注的樣本,均為0) 6 # rpn_bbox_outside_weights: 正樣本和負樣本(不包括不關注的樣本)歸一化的權重 7 rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights = tf.py_func( 8 anchor_target_layer, [rpn_cls_score, self._gt_boxes, self._im_info, self._feat_stride, self._anchors, self._num_anchors], 9 [tf.float32, tf.float32, tf.float32, tf.float32], name="anchor_target") 10 11 rpn_labels.set_shape([1, 1, None, None]) 12 rpn_bbox_targets.set_shape([1, None, None, self._num_anchors * 4]) 13 rpn_bbox_inside_weights.set_shape([1, None, None, self._num_anchors * 4]) 14 rpn_bbox_outside_weights.set_shape([1, None, None, self._num_anchors * 4]) 15 16 rpn_labels = tf.to_int32(rpn_labels, name="to_int32") 17 self._anchor_targets['rpn_labels'] = rpn_labels # 特征圖中每個位置對應的是正樣本、負樣本還是不關注(去除了邊界在圖像外面的archors) 18 self._anchor_targets['rpn_bbox_targets'] = rpn_bbox_targets # 特征圖中每個位置和對應的正樣本的坐標偏移(很多為0) 19 self._anchor_targets['rpn_bbox_inside_weights'] = rpn_bbox_inside_weights # 正樣本的權重為1(去除負樣本和不關注的樣本,均為0) 20 self._anchor_targets['rpn_bbox_outside_weights'] = rpn_bbox_outside_weights # 正樣本和負樣本(不包括不關注的樣本)歸一化的權重 21 22 self._score_summaries.update(self._anchor_targets) 23 24 return rpn_labels 25 26 def anchor_target_layer(rpn_cls_score, gt_boxes, im_info, _feat_stride, all_anchors, num_anchors):# 1*?*?*(9*2); ?*5; 3; [16], ?*4; [9] 27 """Same as the anchor target layer in original Fast/er RCNN """ 28 A = num_anchors # [9] 29 total_anchors = all_anchors.shape[0] # 所有archors的個數,9*特征圖寬*特征圖高 個 30 K = total_anchors / num_anchors 31 32 _allowed_border = 0 # allow boxes to sit over the edge by a small amount 33 height, width = rpn_cls_score.shape[1:3] # rpn網絡得到的特征的高寬 34 35 inds_inside = np.where( # 所有archors邊界可能超出圖像,取在圖像內部的archors的索引 36 (all_anchors[:, 0] >= -_allowed_border) & (all_anchors[:, 1] >= -_allowed_border) & 37 (all_anchors[:, 2] < im_info[1] + _allowed_border) & # width 38 (all_anchors[:, 3] < im_info[0] + _allowed_border) # height 39 )[0] 40 41 anchors = all_anchors[inds_inside, :] # 得到在圖像內部archors的坐標 42 43 labels = np.empty((len(inds_inside),), dtype=np.float32) # label: 1 正樣本, 0 負樣本, -1 不關注 44 labels.fill(-1) 45 46 # 計算每個anchors:n*4和每個真實位置gt_boxes:m*4的重疊區域的比的矩陣:n*m 47 overlaps = bbox_overlaps(np.ascontiguousarray(anchors, dtype=np.float), np.ascontiguousarray(gt_boxes, dtype=np.float)) 48 argmax_overlaps = overlaps.argmax(axis=1) # 找到每行最大值的位置,即每個archors對應的正樣本的位置,得到n維的行向量 49 max_overlaps = overlaps[np.arange(len(inds_inside)), argmax_overlaps] # 取出每個archors對應的正樣本的重疊區域,n維向量 50 gt_argmax_overlaps = overlaps.argmax(axis=0) # 找到每列最大值的位置,即每個真實位置對應的archors的位置,得到m維的行向量 51 gt_max_overlaps = overlaps[gt_argmax_overlaps, np.arange(overlaps.shape[1])] # 取出每個真實位置對應的archors的重疊區域,m維向量 52 gt_argmax_overlaps = np.where(overlaps == gt_max_overlaps)[0] # 得到從小到大順序的位置 53 54 if not cfg.TRAIN.RPN_CLOBBER_POSITIVES: # assign bg labels first so that positive labels can clobber them first set the negatives 55 labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0 # 將archors對應的正樣本的重疊區域中小於閾值的置0 56 57 labels[gt_argmax_overlaps] = 1 # fg label: for each gt, anchor with highest overlap 每個真實位置對應的archors置1 58 labels[max_overlaps >= cfg.TRAIN.RPN_POSITIVE_OVERLAP] = 1 # fg label: above threshold IOU 將archors對應的正樣本的重疊區域中大於閾值的置1 59 60 if cfg.TRAIN.RPN_CLOBBER_POSITIVES: # assign bg labels last so that negative labels can clobber positives 61 labels[max_overlaps < cfg.TRAIN.RPN_NEGATIVE_OVERLAP] = 0 62 63 # 如果有過多的正樣本,則只隨機選擇num_fg=0.5*256=128個正樣本 64 num_fg = int(cfg.TRAIN.RPN_FG_FRACTION * cfg.TRAIN.RPN_BATCHSIZE) # subsample positive labels if we have too many 65 fg_inds = np.where(labels == 1)[0] 66 if len(fg_inds) > num_fg: 67 disable_inds = npr.choice(fg_inds, size=(len(fg_inds) - num_fg), replace=False) 68 labels[disable_inds] = -1 # 將多於的正樣本設置為不關注 69 70 # 如果有過多的負樣本,則只隨機選擇 num_bg=256-正樣本個數 個負樣本 71 num_bg = cfg.TRAIN.RPN_BATCHSIZE - np.sum(labels == 1) # subsample negative labels if we have too many 72 bg_inds = np.where(labels == 0)[0] 73 if len(bg_inds) > num_bg: 74 disable_inds = npr.choice(bg_inds, size=(len(bg_inds) - num_bg), replace=False) 75 labels[disable_inds] = -1 # 將多於的負樣本設置為不關注 76 77 bbox_targets = np.zeros((len(inds_inside), 4), dtype=np.float32) 78 bbox_targets = _compute_targets(anchors, gt_boxes[argmax_overlaps, :]) # 通過archors和archors對應的正樣本計算坐標的偏移 79 80 bbox_inside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32) 81 bbox_inside_weights[labels == 1, :] = np.array(cfg.TRAIN.RPN_BBOX_INSIDE_WEIGHTS) # 正樣本的四個坐標的權重均設置為1 82 83 bbox_outside_weights = np.zeros((len(inds_inside), 4), dtype=np.float32) 84 if cfg.TRAIN.RPN_POSITIVE_WEIGHT < 0: # uniform weighting of examples (given non-uniform sampling) 85 num_examples = np.sum(labels >= 0) # 正樣本和負樣本的總數(去除不關注的樣本) 86 positive_weights = np.ones((1, 4)) * 1.0 / num_examples # 歸一化的權重 87 negative_weights = np.ones((1, 4)) * 1.0 / num_examples # 歸一化的權重 88 else: 89 assert ((cfg.TRAIN.RPN_POSITIVE_WEIGHT > 0) & (cfg.TRAIN.RPN_POSITIVE_WEIGHT < 1)) 90 positive_weights = (cfg.TRAIN.RPN_POSITIVE_WEIGHT / np.sum(labels == 1)) 91 negative_weights = ((1.0 - cfg.TRAIN.RPN_POSITIVE_WEIGHT) / np.sum(labels == 0)) 92 bbox_outside_weights[labels == 1, :] = positive_weights # 歸一化的權重 93 bbox_outside_weights[labels == 0, :] = negative_weights # 歸一化的權重 94 95 # 由於上面使用了inds_inside,此處將labels,bbox_targets,bbox_inside_weights,bbox_outside_weights映射到原始的archors(包含未知 96 # 參數超出圖像邊界的archors)對應的labels,bbox_targets,bbox_inside_weights,bbox_outside_weights,同時將不需要的填充fill的值 97 labels = _unmap(labels, total_anchors, inds_inside, fill=-1) 98 bbox_targets = _unmap(bbox_targets, total_anchors, inds_inside, fill=0) 99 bbox_inside_weights = _unmap(bbox_inside_weights, total_anchors, inds_inside, fill=0) # 所有archors中正樣本的四個坐標的權重均設置為1,其他為0 100 bbox_outside_weights = _unmap(bbox_outside_weights, total_anchors, inds_inside, fill=0) 101 102 labels = labels.reshape((1, height, width, A)).transpose(0, 3, 1, 2) # (1*?*?)*9==>1*?*?*9==>1*9*?*? 103 labels = labels.reshape((1, 1, A * height, width)) # 1*9*?*?==>1*1*(9*?)*? 104 rpn_labels = labels # 特征圖中每個位置對應的是正樣本、負樣本還是不關注(去除了邊界在圖像外面的archors) 105 106 bbox_targets = bbox_targets.reshape((1, height, width, A * 4)) # 1*(9*?)*?*4==>1*?*?*(9*4) 107 108 rpn_bbox_targets = bbox_targets # 特征圖中每個位置和對應的正樣本的坐標偏移(很多為0) 109 bbox_inside_weights = bbox_inside_weights.reshape((1, height, width, A * 4)) # 1*(9*?)*?*4==>1*?*?*(9*4) 110 rpn_bbox_inside_weights = bbox_inside_weights 111 bbox_outside_weights = bbox_outside_weights.reshape((1, height, width, A * 4)) # 1*(9*?)*?*4==>1*?*?*(9*4) 112 rpn_bbox_outside_weights = bbox_outside_weights # 歸一化的權重 113 return rpn_labels, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights 114 115 116 def _unmap(data, count, inds, fill=0): 117 """ Unmap a subset of item (data) back to the original set of items (of size count) """ 118 if len(data.shape) == 1: 119 ret = np.empty((count,), dtype=np.float32) # 得到1維矩陣 120 ret.fill(fill) # 默認填充fill的值 121 ret[inds] = data # 有效位置填充具體數據 122 else: 123 ret = np.empty((count,) + data.shape[1:], dtype=np.float32) # 得到對應維數的矩陣 124 ret.fill(fill) # 默認填充fill的值 125 ret[inds, :] = data # 有效位置填充具體數據 126 return ret 127 128 129 def _compute_targets(ex_rois, gt_rois): 130 """Compute bounding-box regression targets for an image.""" 131 assert ex_rois.shape[0] == gt_rois.shape[0] 132 assert ex_rois.shape[1] == 4 133 assert gt_rois.shape[1] == 5 134 135 # 通過公式2后四個,結合archor和對應的正樣本的坐標計算坐標的偏移 136 return bbox_transform(ex_rois, gt_rois[:, :4]).astype(np.float32, copy=False) # 由於gt_rois是5列,去掉最后一列的batch_inds 137 138 def bbox_transform(ex_rois, gt_rois): 139 ex_widths = ex_rois[:, 2] - ex_rois[:, 0] + 1.0 # archor的寬 140 ex_heights = ex_rois[:, 3] - ex_rois[:, 1] + 1.0 # archor的高 141 ex_ctr_x = ex_rois[:, 0] + 0.5 * ex_widths #archor的中心x 142 ex_ctr_y = ex_rois[:, 1] + 0.5 * ex_heights #archor的中心y 143 144 gt_widths = gt_rois[:, 2] - gt_rois[:, 0] + 1.0 # 真實正樣本w 145 gt_heights = gt_rois[:, 3] - gt_rois[:, 1] + 1.0 # 真實正樣本h 146 gt_ctr_x = gt_rois[:, 0] + 0.5 * gt_widths # 真實正樣本中心x 147 gt_ctr_y = gt_rois[:, 1] + 0.5 * gt_heights # 真實正樣本中心y 148 149 targets_dx = (gt_ctr_x - ex_ctr_x) / ex_widths # 通過公式2后四個的x*,xa,wa得到dx 150 targets_dy = (gt_ctr_y - ex_ctr_y) / ex_heights # 通過公式2后四個的y*,ya,ha得到dy 151 targets_dw = np.log(gt_widths / ex_widths) # 通過公式2后四個的w*,wa得到dw 152 targets_dh = np.log(gt_heights / ex_heights) # 通過公式2后四個的h*,ha得到dh 153 154 targets = np.vstack((targets_dx, targets_dy, targets_dw, targets_dh)).transpose() 155 return targets
1.8 bbox_overlaps
bbox_overlaps用於計算archors和ground truth box重疊區域的面積。具體可見參考網址https://www.cnblogs.com/darkknightzh/p/9043395.html,程序中的代碼如下:

1 def bbox_overlaps( 2 np.ndarray[DTYPE_t, ndim=2] boxes, 3 np.ndarray[DTYPE_t, ndim=2] query_boxes): 4 """ 5 Parameters 6 ---------- 7 boxes: (N, 4) ndarray of float 8 query_boxes: (K, 4) ndarray of float 9 Returns 10 ------- 11 overlaps: (N, K) ndarray of overlap between boxes and query_boxes 12 """ 13 cdef unsigned int N = boxes.shape[0] 14 cdef unsigned int K = query_boxes.shape[0] 15 cdef np.ndarray[DTYPE_t, ndim=2] overlaps = np.zeros((N, K), dtype=DTYPE) 16 cdef DTYPE_t iw, ih, box_area 17 cdef DTYPE_t ua 18 cdef unsigned int k, n 19 for k in range(K): 20 box_area = ( 21 (query_boxes[k, 2] - query_boxes[k, 0] + 1) * 22 (query_boxes[k, 3] - query_boxes[k, 1] + 1) 23 ) 24 for n in range(N): 25 iw = ( 26 min(boxes[n, 2], query_boxes[k, 2]) - 27 max(boxes[n, 0], query_boxes[k, 0]) + 1 28 ) 29 if iw > 0: 30 ih = ( 31 min(boxes[n, 3], query_boxes[k, 3]) - 32 max(boxes[n, 1], query_boxes[k, 1]) + 1 33 ) 34 if ih > 0: 35 ua = float( 36 (boxes[n, 2] - boxes[n, 0] + 1) * 37 (boxes[n, 3] - boxes[n, 1] + 1) + 38 box_area - iw * ih 39 ) 40 overlaps[n, k] = iw * ih / ua 41 return overlaps
1.9 _proposal_target_layer
_proposal_target_layer調用proposal_target_layer,並進一步調用_sample_rois從之前_proposal_layer中選出的2000個archors篩選出256個archors。_sample_rois將正樣本數量固定為最大64(小於時補負樣本),並根據公式2對坐標歸一化,通過_get_bbox_regression_labels得到bbox_targets。用於rcnn的分類及回歸。該層只在訓練時使用;測試時,直接選擇了300個archors,不需要該層了。
=============================================================
190901更新:
說明:感謝@ pytf 的說明(見第19樓和20樓),此處注釋有誤,146行的注釋:
# rois:從post_nms_topN個archors中選擇256個archors(第一列的全0更新為每個archors對應的類別)
rois第一列解釋錯誤。由於每次只有一張圖像輸入,因而rois第一列全為0.此處並沒有更新rois第一列為每個archors對應的類別。
另一方面,第139行,是將bbox_target_data第一列更新為每個archors對應的類別。該行解釋不太清晰。
190901更新結束
=============================================================
_proposal_target_layer定義如下

1 def _proposal_target_layer(self, rois, roi_scores, name): # post_nms_topN個archors的位置及為1(正樣本)的概率 2 # 只在訓練時使用該層,從post_nms_topN個archors中選擇256個archors 3 with tf.variable_scope(name) as scope: 4 # labels:正樣本和負樣本對應的真實的類別 5 # rois:從post_nms_topN個archors中選擇256個archors(第一列的全0更新為每個archors對應的類別) 6 # roi_scores:256個archors對應的為正樣本的概率 7 # bbox_targets:256*(4*21)的矩陣,只有為正樣本時,對應類別的坐標才不為0,其他類別的坐標全為0 8 # bbox_inside_weights:256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 9 # bbox_outside_weights:256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 10 rois, roi_scores, labels, bbox_targets, bbox_inside_weights, bbox_outside_weights = tf.py_func( 11 proposal_target_layer, [rois, roi_scores, self._gt_boxes, self._num_classes], 12 [tf.float32, tf.float32, tf.float32, tf.float32, tf.float32, tf.float32], name="proposal_target") 13 14 rois.set_shape([cfg.TRAIN.BATCH_SIZE, 5]) 15 roi_scores.set_shape([cfg.TRAIN.BATCH_SIZE]) 16 labels.set_shape([cfg.TRAIN.BATCH_SIZE, 1]) 17 bbox_targets.set_shape([cfg.TRAIN.BATCH_SIZE, self._num_classes * 4]) 18 bbox_inside_weights.set_shape([cfg.TRAIN.BATCH_SIZE, self._num_classes * 4]) 19 bbox_outside_weights.set_shape([cfg.TRAIN.BATCH_SIZE, self._num_classes * 4]) 20 21 self._proposal_targets['rois'] = rois 22 self._proposal_targets['labels'] = tf.to_int32(labels, name="to_int32") 23 self._proposal_targets['bbox_targets'] = bbox_targets 24 self._proposal_targets['bbox_inside_weights'] = bbox_inside_weights 25 self._proposal_targets['bbox_outside_weights'] = bbox_outside_weights 26 27 self._score_summaries.update(self._proposal_targets) 28 29 return rois, roi_scores 30 31 def proposal_target_layer(rpn_rois, rpn_scores, gt_boxes, _num_classes): 32 """Assign object detection proposals to ground-truth targets. Produces proposal classification labels and bounding-box regression targets.""" 33 # Proposal ROIs (0, x1, y1, x2, y2) coming from RPN (i.e., rpn.proposal_layer.ProposalLayer), or any other source 34 all_rois = rpn_rois # rpn_rois為post_nms_topN*5的矩陣 35 all_scores = rpn_scores # rpn_scores為post_nms_topN的矩陣,代表對應的archors為正樣本的概率 36 37 if cfg.TRAIN.USE_GT: # Include ground-truth boxes in the set of candidate rois; USE_GT=False,未使用這段代碼 38 zeros = np.zeros((gt_boxes.shape[0], 1), dtype=gt_boxes.dtype) 39 all_rois = np.vstack((all_rois, np.hstack((zeros, gt_boxes[:, :-1])))) 40 all_scores = np.vstack((all_scores, zeros)) # not sure if it a wise appending, but anyway i am not using it 41 42 num_images = 1 # 該程序只能一次處理一張圖片 43 rois_per_image = cfg.TRAIN.BATCH_SIZE / num_images # 每張圖片中最終選擇的rois 44 fg_rois_per_image = np.round(cfg.TRAIN.FG_FRACTION * rois_per_image) # 正樣本的個數:0.25*rois_per_image 45 46 # Sample rois with classification labels and bounding box regression targets 47 # labels:正樣本和負樣本對應的真實的類別 48 # rois:從post_nms_topN個archors中選擇256個archors(第一列的全0更新為每個archors對應的類別) 49 # roi_scores:256個archors對應的為正樣本的概率 50 # bbox_targets:256*(4*21)的矩陣,只有為正樣本時,對應類別的坐標才不為0,其他類別的坐標全為0 51 # bbox_inside_weights:256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 52 labels, rois, roi_scores, bbox_targets, bbox_inside_weights = _sample_rois(all_rois, all_scores, gt_boxes, fg_rois_per_image, rois_per_image, _num_classes) # 選擇256個archors 53 54 rois = rois.reshape(-1, 5) 55 roi_scores = roi_scores.reshape(-1) 56 labels = labels.reshape(-1, 1) 57 bbox_targets = bbox_targets.reshape(-1, _num_classes * 4) 58 bbox_inside_weights = bbox_inside_weights.reshape(-1, _num_classes * 4) 59 bbox_outside_weights = np.array(bbox_inside_weights > 0).astype(np.float32) # 256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 60 61 return rois, roi_scores, labels, bbox_targets, bbox_inside_weights, bbox_outside_weights 62 63 64 def _get_bbox_regression_labels(bbox_target_data, num_classes): 65 """Bounding-box regression targets (bbox_target_data) are stored in a compact form N x (class, tx, ty, tw, th) 66 This function expands those targets into the 4-of-4*K representation used by the network (i.e. only one class has non-zero targets). 67 Returns: 68 bbox_target (ndarray): N x 4K blob of regression targets 69 bbox_inside_weights (ndarray): N x 4K blob of loss weights 70 """ 71 clss = bbox_target_data[:, 0] # 第1列,為類別 72 bbox_targets = np.zeros((clss.size, 4 * num_classes), dtype=np.float32) # 256*(4*21)的矩陣 73 bbox_inside_weights = np.zeros(bbox_targets.shape, dtype=np.float32) 74 inds = np.where(clss > 0)[0] # 正樣本的索引 75 for ind in inds: 76 cls = clss[ind] # 正樣本的類別 77 start = int(4 * cls) # 每個正樣本的起始坐標 78 end = start + 4 # 每個正樣本的終止坐標(由於坐標為4) 79 bbox_targets[ind, start:end] = bbox_target_data[ind, 1:] # 對應的坐標偏移賦值給對應的類別 80 bbox_inside_weights[ind, start:end] = cfg.TRAIN.BBOX_INSIDE_WEIGHTS # 對應的權重(1.0, 1.0, 1.0, 1.0)賦值給對應的類別 81 82 # bbox_targets:256*(4*21)的矩陣,只有為正樣本時,對應類別的坐標才不為0,其他類別的坐標全為0 83 # bbox_inside_weights:256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 84 return bbox_targets, bbox_inside_weights 85 86 87 def _compute_targets(ex_rois, gt_rois, labels): 88 """Compute bounding-box regression targets for an image.""" 89 assert ex_rois.shape[0] == gt_rois.shape[0] 90 assert ex_rois.shape[1] == 4 91 assert gt_rois.shape[1] == 4 92 93 targets = bbox_transform(ex_rois, gt_rois) # 通過公式2后四個,結合256個archor和對應的正樣本的坐標計算坐標的偏移 94 if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED: # Optionally normalize targets by a precomputed mean and stdev 95 targets = ((targets - np.array(cfg.TRAIN.BBOX_NORMALIZE_MEANS)) / np.array(cfg.TRAIN.BBOX_NORMALIZE_STDS)) # 坐標減去均值除以標准差,進行歸一化 96 return np.hstack((labels[:, np.newaxis], targets)).astype(np.float32, copy=False) # 之前的bbox第一列為全0,此處第一列為對應的類別 97 98 99 def _sample_rois(all_rois, all_scores, gt_boxes, fg_rois_per_image, rois_per_image, num_classes): # all_rois第一列全0,后4列為坐標;gt_boxes前4列為坐標,最后一列為類別 100 """Generate a random sample of RoIs comprising foreground and background examples.""" 101 # 計算archors和gt_boxes重疊區域面積的比值 102 overlaps = bbox_overlaps(np.ascontiguousarray(all_rois[:, 1:5], dtype=np.float), np.ascontiguousarray(gt_boxes[:, :4], dtype=np.float)) # overlaps: (rois x gt_boxes) 103 gt_assignment = overlaps.argmax(axis=1) # 得到每個archors對應的gt_boxes的索引 104 max_overlaps = overlaps.max(axis=1) # 得到每個archors對應的gt_boxes的重疊區域的值 105 labels = gt_boxes[gt_assignment, 4] # 得到每個archors對應的gt_boxes的類別 106 107 # 每個archors對應的gt_boxes的重疊區域的值大於閾值的作為正樣本,得到正樣本的索引 108 fg_inds = np.where(max_overlaps >= cfg.TRAIN.FG_THRESH)[0] # Select foreground RoIs as those with >= FG_THRESH overlap 109 # Guard against the case when an image has fewer than fg_rois_per_image. Select background RoIs as those within [BG_THRESH_LO, BG_THRESH_HI) 110 # 每個archors對應的gt_boxes的重疊區域的值在給定閾值內的作為負樣本,得到負樣本的索引 111 bg_inds = np.where((max_overlaps < cfg.TRAIN.BG_THRESH_HI) & (max_overlaps >= cfg.TRAIN.BG_THRESH_LO))[0] 112 113 # Small modification to the original version where we ensure a fixed number of regions are sampled 114 # 最終選擇256個archors 115 if fg_inds.size > 0 and bg_inds.size > 0: # 正負樣本均存在,則選擇最多fg_rois_per_image個正樣本,不夠的話,補充負樣本 116 fg_rois_per_image = min(fg_rois_per_image, fg_inds.size) 117 fg_inds = npr.choice(fg_inds, size=int(fg_rois_per_image), replace=False) 118 bg_rois_per_image = rois_per_image - fg_rois_per_image 119 to_replace = bg_inds.size < bg_rois_per_image 120 bg_inds = npr.choice(bg_inds, size=int(bg_rois_per_image), replace=to_replace) 121 elif fg_inds.size > 0: # 只有正樣本,選擇rois_per_image個正樣本 122 to_replace = fg_inds.size < rois_per_image 123 fg_inds = npr.choice(fg_inds, size=int(rois_per_image), replace=to_replace) 124 fg_rois_per_image = rois_per_image 125 elif bg_inds.size > 0: # 只有負樣本,選擇rois_per_image個負樣本 126 to_replace = bg_inds.size < rois_per_image 127 bg_inds = npr.choice(bg_inds, size=int(rois_per_image), replace=to_replace) 128 fg_rois_per_image = 0 129 else: 130 import pdb 131 pdb.set_trace() 132 133 keep_inds = np.append(fg_inds, bg_inds) # 正樣本和負樣本的索引 134 labels = labels[keep_inds] # 正樣本和負樣本對應的真實的類別 135 labels[int(fg_rois_per_image):] = 0 # 負樣本對應的類別設置為0 136 rois = all_rois[keep_inds] # 從post_nms_topN個archors中選擇256個archors 137 roi_scores = all_scores[keep_inds] # 256個archors對應的為正樣本的概率 138 139 # 通過256個archors的坐標和每個archors對應的gt_boxes的坐標及這些archors的真實類別得到坐標偏移(將rois第一列的全0更新為每個archors對應的類別) 140 bbox_target_data = _compute_targets(rois[:, 1:5], gt_boxes[gt_assignment[keep_inds], :4], labels) 141 # bbox_targets:256*(4*21)的矩陣,只有為正樣本時,對應類別的坐標才不為0,其他類別的坐標全為0 142 # bbox_inside_weights:256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 143 bbox_targets, bbox_inside_weights = _get_bbox_regression_labels(bbox_target_data, num_classes) 144 145 # labels:正樣本和負樣本對應的真實的類別 146 # rois:從post_nms_topN個archors中選擇256個archors(第一列的全0更新為每個archors對應的類別) 147 # roi_scores:256個archors對應的為正樣本的概率 148 # bbox_targets:256*(4*21)的矩陣,只有為正樣本時,對應類別的坐標才不為0,其他類別的坐標全為0 149 # bbox_inside_weights:256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 150 return labels, rois, roi_scores, bbox_targets, bbox_inside_weights
1.10 _crop_pool_layer
_crop_pool_layer用於將256個archors從特征圖中裁剪出來縮放到14*14,並進一步max pool到7*7的固定大小,得到特征,方便rcnn網絡分類及回歸坐標。
該函數先得到特征圖對應的原始圖像的寬高,而后將原始圖像對應的rois進行歸一化,並使用tf.image.crop_and_resize(該函數需要歸一化的坐標信息)縮放到[cfg.POOLING_SIZE * 2, cfg.POOLING_SIZE * 2],最后通過slim.max_pool2d進行pooling,輸出大小依舊一樣(256*7*7*512)。
tf.slice(rois, [0, 0], [-1, 1])是對輸入進行切片。其中第二個參數為起始的坐標,第三個參數為切片的尺寸。注意,對於二維輸入,后兩個參數均為y,x的順序;對於三維輸入,后兩個均為z,y,x的順序。當第三個參數為-1時,代表取整個該維度。上面那句是將roi的從0,0開始第一列的數據(y為-1,代表所有行,x為1,代表第一列)
_crop_pool_layer定義如下:

1 def _crop_pool_layer(self, bottom, rois, name): 2 with tf.variable_scope(name) as scope: 3 batch_ids = tf.squeeze(tf.slice(rois, [0, 0], [-1, 1], name="batch_id"), [1]) # 得到第一列,為類別 4 bottom_shape = tf.shape(bottom) # Get the normalized coordinates of bounding boxes 5 height = (tf.to_float(bottom_shape[1]) - 1.) * np.float32(self._feat_stride[0]) 6 width = (tf.to_float(bottom_shape[2]) - 1.) * np.float32(self._feat_stride[0]) 7 x1 = tf.slice(rois, [0, 1], [-1, 1], name="x1") / width # 由於crop_and_resize的bboxes范圍為0-1,得到歸一化的坐標 8 y1 = tf.slice(rois, [0, 2], [-1, 1], name="y1") / height 9 x2 = tf.slice(rois, [0, 3], [-1, 1], name="x2") / width 10 y2 = tf.slice(rois, [0, 4], [-1, 1], name="y2") / height 11 bboxes = tf.stop_gradient(tf.concat([y1, x1, y2, x2], axis=1)) # Won't be back-propagated to rois anyway, but to save time 12 pre_pool_size = cfg.POOLING_SIZE * 2 13 14 # 根據bboxes裁剪出256個特征,並縮放到14*14(channels和bottom的channels一樣),batchsize為256 15 crops = tf.image.crop_and_resize(bottom, bboxes, tf.to_int32(batch_ids), [pre_pool_size, pre_pool_size], name="crops") 16 17 return slim.max_pool2d(crops, [2, 2], padding='SAME') # amx pool后得到7*7的特征
1.11 _head_to_tail
_head_to_tail用於將上面得到的256個archors的特征增加兩個fc層(ReLU)和兩個dropout(train時有,test時無),降維到4096維,用於_region_classification的分類及回歸。
_head_to_tail位於vgg16.py中,定義如下

1 def _head_to_tail(self, pool5, is_training, reuse=None): 2 with tf.variable_scope(self._scope, self._scope, reuse=reuse): 3 pool5_flat = slim.flatten(pool5, scope='flatten') 4 fc6 = slim.fully_connected(pool5_flat, 4096, scope='fc6') 5 if is_training: 6 fc6 = slim.dropout(fc6, keep_prob=0.5, is_training=True, scope='dropout6') 7 fc7 = slim.fully_connected(fc6, 4096, scope='fc7') 8 if is_training: 9 fc7 = slim.dropout(fc7, keep_prob=0.5, is_training=True, scope='dropout7') 10 11 return fc7
1.12 _region_classification
fc7通過_region_classification進行分類及回歸。fc7先通過fc層(無ReLU)降維到21層(類別數,得到cls_score),得到概率cls_prob及預測值cls_pred(用於rcnn的分類)。另一方面fc7通過fc層(無ReLU),降維到21*4,得到bbox_pred(用於rcnn的回歸)。
_region_classification定義如下:

1 def _region_classification(self, fc7, is_training, initializer, initializer_bbox): 2 # 增加fc層,輸出為總共類別的個數,進行分類 3 cls_score = slim.fully_connected(fc7, self._num_classes, weights_initializer=initializer, trainable=is_training, activation_fn=None, scope='cls_score') 4 cls_prob = self._softmax_layer(cls_score, "cls_prob") # 得到每一類別的概率 5 cls_pred = tf.argmax(cls_score, axis=1, name="cls_pred") # 得到預測的類別 6 # 增加fc層,預測位置信息的偏移 7 bbox_pred = slim.fully_connected(fc7, self._num_classes * 4, weights_initializer=initializer_bbox, trainable=is_training, activation_fn=None, scope='bbox_pred') 8 9 self._predictions["cls_score"] = cls_score # 用於rcnn分類的256個archors的特征 10 self._predictions["cls_pred"] = cls_pred 11 self._predictions["cls_prob"] = cls_prob 12 self._predictions["bbox_pred"] = bbox_pred 13 14 return cls_prob, bbox_pred
通過以上步驟,完成了網絡的創建rois, cls_prob, bbox_pred = self._build_network(training)。
rois:256*5
cls_prob:256*21(類別數)
bbox_pred:256*84(類別數*4)
2. 損失函數_add_losses
faster rcnn包括兩個損失:rpn網絡的損失+rcnn網絡的損失。其中每個損失又包括分類損失和回歸損失。分類損失使用的是交叉熵,回歸損失使用的是smooth L1 loss。
程序通過_add_losses增加對應的損失函數。其中rpn_cross_entropy和rpn_loss_box是RPN網絡的兩個損失,cls_score和bbox_pred是rcnn網絡的兩個損失。前兩個損失用於判斷archor是否是ground truth(二分類);后兩個損失的batchsize是256。
將rpn_label(1,?,?,2)中不是-1的index取出來,之后將rpn_cls_score(1,?,?,2)及rpn_label中對應於index的取出,計算sparse_softmax_cross_entropy_with_logits,得到rpn_cross_entropy。
計算rpn_bbox_pred(1,?,?,36)和rpn_bbox_targets(1,?,?,36)的_smooth_l1_loss,得到rpn_loss_box。
計算cls_score(256*21)和label(256)的sparse_softmax_cross_entropy_with_logits:cross_entropy。
計算bbox_pred(256*84)和bbox_targets(256*84)的_smooth_l1_loss:loss_box。
最終將上面四個loss相加,得到總的loss(還需要加上regularization_loss)。
至此,損失構造完畢。
程序中通過_add_losses增加損失:

1 def _add_losses(self, sigma_rpn=3.0): 2 with tf.variable_scope('LOSS_' + self._tag) as scope: 3 rpn_cls_score = tf.reshape(self._predictions['rpn_cls_score_reshape'], [-1, 2]) # 每個archors是正樣本還是負樣本 4 rpn_label = tf.reshape(self._anchor_targets['rpn_labels'], [-1]) # 特征圖中每個位置對應的是正樣本、負樣本還是不關注(去除了邊界在圖像外面的archors) 5 rpn_select = tf.where(tf.not_equal(rpn_label, -1)) # 不關注的archor到的索引 6 rpn_cls_score = tf.reshape(tf.gather(rpn_cls_score, rpn_select), [-1, 2]) # 去除不關注的archor 7 rpn_label = tf.reshape(tf.gather(rpn_label, rpn_select), [-1]) # 去除不關注的label 8 rpn_cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=rpn_cls_score, labels=rpn_label)) # rpn二分類的損失 9 10 rpn_bbox_pred = self._predictions['rpn_bbox_pred'] # 每個位置的9個archors回歸位置偏移 11 rpn_bbox_targets = self._anchor_targets['rpn_bbox_targets'] # 特征圖中每個位置和對應的正樣本的坐標偏移(很多為0) 12 rpn_bbox_inside_weights = self._anchor_targets['rpn_bbox_inside_weights'] # 正樣本的權重為1(去除負樣本和不關注的樣本,均為0) 13 rpn_bbox_outside_weights = self._anchor_targets['rpn_bbox_outside_weights'] # 正樣本和負樣本(不包括不關注的樣本)歸一化的權重 14 rpn_loss_box = self._smooth_l1_loss(rpn_bbox_pred, rpn_bbox_targets, rpn_bbox_inside_weights, rpn_bbox_outside_weights, sigma=sigma_rpn, dim=[1, 2, 3]) 15 16 cls_score = self._predictions["cls_score"] # 用於rcnn分類的256個archors的特征 17 label = tf.reshape(self._proposal_targets["labels"], [-1]) # 正樣本和負樣本對應的真實的類別 18 cross_entropy = tf.reduce_mean(tf.nn.sparse_softmax_cross_entropy_with_logits(logits=cls_score, labels=label)) # rcnn分類的損失 19 20 bbox_pred = self._predictions['bbox_pred'] # RCNN, bbox loss 21 bbox_targets = self._proposal_targets['bbox_targets'] # 256*(4*21)的矩陣,只有為正樣本時,對應類別的坐標才不為0,其他類別的坐標全為0 22 bbox_inside_weights = self._proposal_targets['bbox_inside_weights'] # 256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 23 bbox_outside_weights = self._proposal_targets['bbox_outside_weights'] # 256*(4*21)的矩陣,正樣本時,對應類別四個坐標的權重為1,其他全為0 24 loss_box = self._smooth_l1_loss(bbox_pred, bbox_targets, bbox_inside_weights, bbox_outside_weights) 25 26 self._losses['cross_entropy'] = cross_entropy 27 self._losses['loss_box'] = loss_box 28 self._losses['rpn_cross_entropy'] = rpn_cross_entropy 29 self._losses['rpn_loss_box'] = rpn_loss_box 30 31 loss = cross_entropy + loss_box + rpn_cross_entropy + rpn_loss_box # 總共的損失 32 regularization_loss = tf.add_n(tf.losses.get_regularization_losses(), 'regu') 33 self._losses['total_loss'] = loss + regularization_loss 34 35 self._event_summaries.update(self._losses) 36 37 return loss
smooth L1 loss定義如下(見fast rcnn論文):
${{L}_{loc}}({{t}^{u}},v)=\sum\limits_{i\in \{x,y,w,h\}}{smoot{{h}_{{{L}_{1}}}}(t_{i}^{u}-{{v}_{i}})}\text{ (2)}$
in which
程序中先計算pred和target的差box_diff,而后得到正樣本的差in_box_diff(通過乘以權重bbox_inside_weights將負樣本設置為0)及絕對值abs_in_box_diff,之后計算上式(3)中的符號smoothL1_sign,並得到的smooth L1 loss:in_loss_box,乘以bbox_outside_weights權重,並得到最終的loss:loss_box。
其中_smooth_l1_loss定義如下:

1 def _smooth_l1_loss(self, bbox_pred, bbox_targets, bbox_inside_weights, bbox_outside_weights, sigma=1.0, dim=[1]): 2 sigma_2 = sigma ** 2 3 box_diff = bbox_pred - bbox_targets # 預測的和真實的相減 4 in_box_diff = bbox_inside_weights * box_diff # 乘以正樣本的權重1(rpn:去除負樣本和不關注的樣本,rcnn:去除負樣本) 5 abs_in_box_diff = tf.abs(in_box_diff) # 絕對值 6 smoothL1_sign = tf.stop_gradient(tf.to_float(tf.less(abs_in_box_diff, 1. / sigma_2))) # 小於閾值的截斷的標志位 7 in_loss_box = tf.pow(in_box_diff, 2) * (sigma_2 / 2.) * smoothL1_sign + (abs_in_box_diff - (0.5 / sigma_2)) * (1. - smoothL1_sign) # smooth l1 loss 8 out_loss_box = bbox_outside_weights * in_loss_box # rpn:除以有效樣本總數(不考慮不關注的樣本),進行歸一化;rcnn:正樣本四個坐標權重為1,負樣本為0 9 loss_box = tf.reduce_mean(tf.reduce_sum(out_loss_box, axis=dim)) 10 return loss_box
3. 測試階段:
測試時,預測得到的bbox_pred需要乘以(0.1, 0.1, 0.2, 0.2),(而后在加上(0.0, 0.0, 0.0, 0.0))。create_architecture中
1 if testing: 2 stds = np.tile(np.array(cfg.TRAIN.BBOX_NORMALIZE_STDS), (self._num_classes)) 3 means = np.tile(np.array(cfg.TRAIN.BBOX_NORMALIZE_MEANS), (self._num_classes)) 4 self._predictions["bbox_pred"] *= stds # 訓練時_region_proposal中預測的位置偏移減均值除標准差,因而測試時需要反過來。 5 self._predictions["bbox_pred"] += means
具體可參見demo.py中的函數demo(調用test.py中的im_detect)。直接在python中調用該函數時,不需要先乘后加,模型freeze后,得到self._predictions["bbox_pred"]時,結果不對,調試后發現,先乘后加之后結果一致。
_im_info