SSD目標檢測網絡
使用SSD檢測網絡一段時間了,研究過代碼,也踩過坑,算是有能力來總結下SSD目標檢測網絡了。
1. SSD300_Vgg16
最基礎的SSD網絡是以Vgg16作為backbone, 輸入圖片尺寸為300x300,這里以其為示例,詳細剖析下SSD檢測網絡。
SSD(Single Shot MultiBox Detector)檢測網絡可概括為三個特征:one-stage檢測器,多個尺度的特征圖檢測(MultiBox),大量的先驗框(Prior Boxes)。相比於YoLo和Faster-RCNN,在准確度和速度上進行了折衷。對於SSD網絡的理解主要在於三部分,一是對模型整體結構的理解,二是對於正負樣本分配的理解(prior boxes 和gt boxes的匹配策略),三是對於損失函數Multibox_loss函數的理解
1.1 SSD模型結構
SSD模型結構主要包括三部分: VGG-Base, Extra-Layer和Pred-Layer。SSD300的整體網絡結構如下圖所示,在VGG16基礎上加入了四個Extra_layer。 輸入300*300圖片,依次經過VGG-base和Extra_layer網絡層進行卷積特征提取,取VGG-16的兩個卷積層輸出Feature1和Feature2,以及四個Extra_layer輸出Feature3,Feature4, Feature5, Feature6,將這六個尺度特征圖分別送入class_predictors和box_predictors預測類別和坐標,將所有尺度預測值進行合並。(注意VGG_Base中紅色圈主的部分,是SSD對原始VGG的改變)
VGG-Base作為基礎架構用來提取圖像的feature;Extra-layers對VGG的feature做進一步處理,增加模型對圖像的感受野,使Pred-Layer得到的特征圖承載更多抽象信息。待預測的特征圖由6種特征圖組成,這6種特征圖最終通過Pred-Layers(loc-layer 和conf-layer)來得到預測框的坐標信息(loc-layer),置信度信息(conf-layer)和類別信息(conf-layer)。
對於SSD300,6種特征圖的尺寸說明如下:
-
兩個來自VGG部分(38, 19),其特點是特征圖尺寸大,意味着丟失信息少,可以識別較小目標;處理層數淺,感受野小,意味着抽象信息少,准確度不高。
-
四個來自Extra部分(10, 5, 3, 1),其特點與VGG部分相反,這也是Extra的意義所在——彌補VGG部分的不足。
模型計算流程:
-
模型通過Base部分得到兩個特征圖(1, 512, 38, 38)和(1, 1024, 19, 19)
-
通過extra部分得到四個特征圖(1, 512, 10, 10),(1, 256, 5, 5),(1, 256, 3, 3)和(1, 256, 1, 1)
-
這6個特征圖再通過class_predictors和box_predictors分別得到置信度和坐標:
(1, 512, 38, 38) =》 (1, 4x4, 38, 38) 和 (1, 4x21, 38, 38) (1, 1024, 19, 19) =》 (1, 6x4, 19, 19) 和 (1, 6x21, 19, 19) (1, 512, 10, 10) =》 (1, 6x4, 10, 10) 和 (1, 6x21, 10, 10) (1, 256, 5, 5) =》 (1, 6x4, 5, 5) 和 (1, 6x21, 5, 5) (1, 256, 3, 3) =》 (1, 4x4, 3, 3) 和 (1, 4x21, 3, 3) (1, 256, 1, 1) =》 (1, 4x4,1, 1) 和 (1, 4x21, 1, 1)
-
最終得到所有預測框的loc:(1, 8732, 4)和conf:(1, 8732, 21)
訓練階段
-
輸出有預測框的loc_p:(1, 8732, 4),con_p:(1, 8732, 21)和先驗框anchors:(8732, 4);結合gt_box, gt_id計算loss
測試階段
-
根據先驗框anchors和預測box偏移值,計算真實box坐標值和類別置信度,經過NMS,最后輸出預測的box坐標值和置信度
參考:https://zhuanlan.zhihu.com/p/70415140?utm_source=wechat_session
1.2 Anchors和gt_boxes匹配策略
對於SSD300,下圖是VOC數據集的典型anchor配置(size, ratio),可以看到不同尺寸的feature上每個像素點都設置了anchor,六張feature總共設置了8732個anchor。
生成anchor的示例代碼如下:

import numpy as np # 對於第一張feature_map, 產生anchor box的代碼如下: feature_size = (38, 38) offsets = (0.5, 0.5) step = 8 size = (30, 60) ratios = [1, 2, 0.5] anchors = [] for i in range(feature_size[0]): for j in range(feature_size[1]): cy = (i + offsets[0]) * step cx = (j + offsets[1]) * step min = size[0] max = np.sqrt(size[0] * size[1]) anchors.append([cx, cy, min, max]) anchors.append([cx, cy, max, max]) for r in ratios[1:]: sr = np.sqrt(r) w = min * sr h = min / sr anchors.append([cx, cy, w, h]) # anchors = np.array(anchors).reshape(-1, 4) anchors = np.array(anchors).reshape(1, 1, feature_size[0], feature_size[1], -1) print(anchors.shape) print(anchors[:8]) # Feature1: 38*38,每個像素點4個anchor,總共5776個anchor(步長為8) # 每個像素點4個anchor, (0,0)位置對應anchor如下: # w/h=1: [4, 4, 30, 30] # [4, 4, 42.42640687, 42.42640687] # w/h=2: [4, 4, 42.42640687, 21.21320344] # w/h=0.5: [4, 4, 21.21320344, 42.42640687]
一張訓練圖片上可能只有幾個gt_box,因此需要確定選擇那些anchor來負責預測這幾個gt_box
ssd論文中的匹配策略有兩個:
-
所有的gt_box選擇與它iou最大的anchor進行匹配
-
對剩下的anchor,選擇與其iou最大的gt_box進行匹配,而且這個iou必須大於設定閾值(0.5)時才算匹配上
這樣匹配完成后,一個gt_box可以匹配多個anchor,且最少匹配到一個anchor;而一個anchor最多匹配一個gt_box,如果沒有匹配上gt_box則作為負樣本。具體的匹配邏輯看代碼比較清晰和容易理解,下面是示例代碼:

class SSDTargetGenerator(mx.gluon.Block): def __init__(self, iou_thresh=0.5, neg_thresh=0.5, negative_mining_ratio=3, stds=[0.1, 0.1, 0.2, 0.2], **kwargs): super(SSDTargetGenerator, self).__init__() self.iou_thresh = iou_thresh self.stds = mx.nd.array(stds) def forward(self, anchors, cls_preds, gt_boxes, gt_ids): ''' Parameters ---------- anchors: num_anchor*4 cls_preds:None gt_boxes: 1*num_gt*4 gt_ids: 1*num_gt*1 Returns ------- cls_targets: 1*num_anchor box_targets: 1*num_anchor*4 ''' # print(1, anchors.shape, gt_boxes.shape, gt_ids.shape) anchors_cornors = self._center_to_corner(anchors) ious = self.box_iou(gt_boxes, anchors_cornors) #[num_gt, num_anchor] # [num_gt,] best_prior_overlap, best_prior_idx = ious.max(axis=1), ious.argmax(axis=1) #[num_anchor,] best_truth_overlap, best_truth_idx = ious.max(axis=0), ious.argmax(axis=0) #每一個anchor與它IOU最大的gt_box匹配 best_truth_overlap[best_prior_idx] = 2 # ensure best prior #與gt_box的IOU最大的anchor,將其IOU設置為2,保證每個gt_box與其IOU最大的anchor匹配 # ensure every gt matches with its prior of max overlap for j in range(best_prior_idx.shape[0]): #與gt_box的IOU最大的anchor,將best_truth_idx中該anchor的匹配索引設置為對應gt_box best_truth_idx[best_prior_idx[j]] = j matches = gt_boxes[0, best_truth_idx] cls_targets = gt_ids[0, best_truth_idx] + 1 #num_anchor*1 cls_targets[best_truth_overlap < self.iou_thresh] = 0 #num_anchor*4, 若anchor與gt_box的IOU小於閾值,設置為不匹配 box_targets = self.encode(matches, anchors, self.stds) # print(3, cls_targets.shape, box_targets.shape) return cls_targets.reshape(1,-1), box_targets.reshape(1, -1, 4) def _center_to_corner(self, anchors): boxes = mx.nd.concat(anchors[:, :2] - anchors[:, 2:]/2, anchors[:, :2] + anchors[:, 2:]/2, dim=1) return boxes def box_iou(self, box_a, box_b): assert box_a.shape[0] == 1 box_a = box_a[0] A = box_a.shape[0] B = box_b.shape[0] # s1 = box_b[:, 2:].repeat(repeats=A, axis=0).reshape(B, A, 2).transpose(axes=(1, 0, 2)) # print(s1.shape) max_xy = mx.nd.minimum(box_a[:, 2:].repeat(repeats=B, axis=0).reshape(A, B, 2), box_b[:, 2:].repeat(repeats=A, axis=0).reshape(B, A, 2).transpose(axes=(1, 0, 2))) min_xy = mx.nd.maximum(box_a[:, :2].repeat(repeats=B, axis=0).reshape(A, B, 2), box_b[:, :2].repeat(repeats=A, axis=0).reshape(B, A, 2).transpose(axes=(1, 0, 2))) diff = (max_xy-min_xy) inter = mx.nd.clip(diff, a_min=0, a_max=mx.nd.max(diff).asscalar()) inter_area = inter[:, :, 0] * inter[:, :, 1] area_a = (box_a[:, 2] - box_a[:, 0])*(box_a[:, 3] - box_a[:, 1]) area_a = area_a.repeat(repeats=B, axis=0).reshape(A, B) area_b = (box_b[:, 2] - box_b[:, 0])*(box_b[:, 3] - box_b[:, 1]) area_b = area_b.repeat(repeats=A, axis=0).reshape(B, A).transpose(axes=(1, 0)) # print(inter_area.shape, area_a.shape, area_b.shape) return inter_area/(area_a+area_b-inter_area) def encode(self, matches, priors, stds): # encode variance # print(2, matches.shape, priors.shape) g_cxcy = (matches[:, :2] + matches[:, 2:])/2 - priors[:, :2] g_cxcy = g_cxcy/stds[:2] g_wh = (matches[:, 2:]-matches[:, :2])/priors[:, 2:] g_wh = mx.nd.log(g_wh)/stds[2:] return mx.nd.concat(g_cxcy, g_wh, dim=1)
參考:
https://zhuanlan.zhihu.com/p/53182444
https://blog.csdn.net/yuanlunxi/article/details/84746729
1.3 SSD損失函數Multibox_loss
SSD的loss函數包括分類損失cls_losses和坐標損失box_losses,cls_losses采用的交叉熵損失函數,box_losses采用的是smooth_L1損失函數。有兩點值得注意:
-
box_losses只計算正樣本的losses,而且預測值為相對於anchor的偏移值,不是最后的box
-
由於負樣本遠多於正樣本,為了保證正負樣本的均衡,SSD采用了hard negative mining策略,只選擇損失值最大的負樣本
這里的hard negative mining指,在計算cls_closses時,只選擇正樣本和部分loss值最大的負樣本參與計算。例如一張圖片中有100個樣本,其中10個正樣本,90個負樣本,實際計算loss值時,將90個負樣本的loss值排序,選擇30個loss值最大的負樣本,再加上10個正樣本,這樣總共40個樣本參與最終loss的反向傳遞。(保證正負樣本比例為1:3,同時是比較難的負樣本)
Multibox_loss 參考代碼如下:

class SSDMultiBoxLoss(gluon.Block): r"""Single-Shot Multibox Object Detection Loss. .. note:: Since cross device synchronization is required to compute batch-wise statistics, it is slightly sub-optimal compared with non-sync version. However, we find this is better for converged model performance. Parameters ---------- negative_mining_ratio : float, default is 3 Ratio of negative vs. positive samples. rho : float, default is 1.0 Threshold for trimmed mean estimator. This is the smooth parameter for the L1-L2 transition. lambd : float, default is 1.0 Relative weight between classification and box regression loss. The overall loss is computed as :math:`L = loss_{class} + \lambda \times loss_{loc}`. min_hard_negatives : int, default is 0 Minimum number of negatives samples. """ def __init__(self, negative_mining_ratio=3, rho=1.0, lambd=1.0, min_hard_negatives=0, **kwargs): super(SSDMultiBoxLoss, self).__init__(**kwargs) self._negative_mining_ratio = max(0, negative_mining_ratio) self._rho = rho self._lambd = lambd self._min_hard_negatives = max(0, min_hard_negatives) def forward(self, cls_pred, box_pred, cls_target, box_target): """Compute loss in entire batch across devices.""" # require results across different devices at this time cls_pred, box_pred, cls_target, box_target = [_as_list(x) \ for x in (cls_pred, box_pred, cls_target, box_target)] # cross device reduction to obtain positive samples in entire batch num_pos = [] for cp, bp, ct, bt in zip(*[cls_pred, box_pred, cls_target, box_target]): pos_samples = (ct > 0) num_pos.append(pos_samples.sum()) num_pos_all = sum([p.asscalar() for p in num_pos]) if num_pos_all < 1 and self._min_hard_negatives < 1: # no positive samples and no hard negatives, return dummy losses cls_losses = [nd.sum(cp * 0) for cp in cls_pred] box_losses = [nd.sum(bp * 0) for bp in box_pred] sum_losses = [nd.sum(cp * 0) + nd.sum(bp * 0) for cp, bp in zip(cls_pred, box_pred)] return sum_losses, cls_losses, box_losses # compute element-wise cross entropy loss and sort, then perform negative mining cls_losses = [] box_losses = [] sum_losses = [] for cp, bp, ct, bt in zip(*[cls_pred, box_pred, cls_target, box_target]): pred = nd.log_softmax(cp, axis=-1) pos = ct > 0 cls_loss = -nd.pick(pred, ct, axis=-1, keepdims=False) rank = (cls_loss * (pos - 1)).argsort(axis=1).argsort(axis=1) # loss進行排序 hard_negative = rank < nd.maximum(self._min_hard_negatives, pos.sum(axis=1) * self._negative_mining_ratio).expand_dims(-1) #選擇多少個負樣本,作為比較難的負樣本 # mask out if not positive or negative cls_loss = nd.where((pos + hard_negative) > 0, cls_loss, nd.zeros_like(cls_loss)) #正樣本+選擇的負樣本,參與loss計算 cls_losses.append(nd.sum(cls_loss, axis=0, exclude=True) / max(1., num_pos_all)) bp = _reshape_like(nd, bp, bt) box_loss = nd.abs(bp - bt) box_loss = nd.where(box_loss > self._rho, box_loss - 0.5 * self._rho, (0.5 / self._rho) * nd.square(box_loss)) # box loss only apply to positive samples box_loss = box_loss * pos.expand_dims(axis=-1) #只有正樣本有box_loss, 負樣本的box_loss置0 box_losses.append(nd.sum(box_loss, axis=0, exclude=True) / max(1., num_pos_all)) sum_losses.append(cls_losses[-1] + self._lambd * box_losses[-1]) return sum_losses, cls_losses, box_losses
2. SSD512_Vgg16
SSD300輸入的圖片分辨率為300x300,ssd512輸入的圖片分辨率尺寸為 512x512,相比於ssd300,ssd512網絡最大的區別是:extra_layer多增加了一個特征提取層,這樣ssd512總共有7個尺度的feature,比ssd300多一個feature。ssd512的整體網絡結構如下所示:
3. SSD的改進
在實際項目使用過程中,可以根據自身需要對SSD進行改進,比較簡單的改進主要有兩個方面,一是對網絡結構的調整,如將backbone由vgg16替換成resnet50,引入注意力機制等;二是對於anchor的尺寸和比例進行調整,如檢測信號塔時,其寬高比都小,而默認的anchor比例為[1:1, 1:2, 1:3, 2:1, 3:1], 可以改為[1:2, 1:4, 1:6, 1:8, 1:10]