Fork版本項目地址:SSD
參考自集智專欄
一、SSD基礎
在分類器基礎之上想要識別物體,實質就是 用分類器掃描整張圖像,定位特征位置 。這里的關鍵就是用什么算法掃描,比如可以將圖片分成若干網格,用分類器一個格子、一個格子掃描,這種方法有幾個問題:
問題1 : 目標正好處在兩個網格交界處,就會造成分類器的結果在兩邊都不足夠顯著,造成漏報(True Negative)。
問題2 : 目標過大或過小,導致網格中結果不足夠顯著,造成漏報。
針對第一點,可以采用相互重疊的網格。比如一個網格大小是 32x32 像素,那么就網格向下移動時,只動 8 個像素,走四步才完全移出以前的網格。針對第二點,可以采用大小網格相互結合的策略,32x32 網格掃完,64x64 網格再掃描一次,16x16 網格也再掃一次。
但是這樣會帶來其他問題——我們為了保證准確率, 對同一張圖片掃描次數過多,嚴重影響了計算速度 ,造成這種策略 無法做到實時標注 。
為了快速、實時標注圖像特征,對於整個識別定位算法,就有了諸多改進方法。
一個最基本的思路是,合理使用卷積神經網絡的內部結構,避免重復計算。用卷積神經網絡掃描某一圖片時,實際上卷積得到的結果已經存儲了不同大小的網格信息,這一過程實際上已經完成了我們上一部分提出的改進措施,如下圖所示,我們發現前幾層卷積核的結果更關注細節,后面的卷積層結果更加關注整體:

對於問題1,如果一個物體位於兩個格子的中間,雖然兩邊都不一定足夠顯著,但是兩邊的基本特征如果可以合理組合的話,我們就不需要再掃描一次。而后幾層則越來越關注整體,對問題2,目標可能會過大過小,但是特征同樣也會留下。也就是說,用卷積神經網絡掃描圖像過程中,由於深度神經網絡本身就有好幾層卷積、實際上已經反復多次掃描圖像,以上兩個問題可以通過合理使用卷積神經網絡的中間結果得到解決。
在 SSD 算法之前,MultiBox,FastR-CNN 法都采用了兩步的策略,即第一步通過深度神經網絡,對潛在的目標物體進行定位,即先產生Box;至於Box 里面的物體如何分類,這里再進行第二步計算。此外第一代的 YOLO 算法可以做到一步完成計算加定位,但是結構中采用了全連接層,而全連接層有很多問題,並且正在逐步被深度神經網絡架構“拋棄”。

二、TF_SSD項目中網絡的結構
回到項目中,以VGG300(/nets/ssd_vgg_300.py)為例,大體思路就是,用VGG 深度神經網絡的前五層,並額外多加幾層結構,最后提取其中幾層進過卷積后的結果,進行網格搜索,找目標特征。對應到函數里,轉化為三個大部分,原網絡結構、添加網絡結構、SSD處理結構:
def ssd_net(inputs,
num_classes=SSDNet.default_params.num_classes,
feat_layers=SSDNet.default_params.feat_layers,
anchor_sizes=SSDNet.default_params.anchor_sizes,
anchor_ratios=SSDNet.default_params.anchor_ratios,
normalizations=SSDNet.default_params.normalizations,
is_training=True,
dropout_keep_prob=0.5,
prediction_fn=slim.softmax,
reuse=None,
scope='ssd_300_vgg'):
"""SSD net definition.
"""
# if data_format == 'NCHW':
# inputs = tf.transpose(inputs, perm=(0, 3, 1, 2))
# End_points collect relevant activations for external use.
"""
net = layers_lib.repeat(
inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool1')
net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool2')
net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool3')
net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool4')
net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool5')
"""
end_points = {}
with tf.variable_scope(scope, 'ssd_300_vgg', [inputs], reuse=reuse):
######################################
# 前五個 Blocks,首先照搬 VGG16 架構 #
# 注意這里使用 end_points 標注中間結果 #
######################################
# ——————————————————Original VGG-16 blocks.———————————————————————
net = slim.repeat(inputs, 2, slim.conv2d, 64, [3, 3], scope='conv1')
end_points['block1'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool1')
# Block 2.
net = slim.repeat(net, 2, slim.conv2d, 128, [3, 3], scope='conv2')
end_points['block2'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool2')
# Block 3.
net = slim.repeat(net, 3, slim.conv2d, 256, [3, 3], scope='conv3')
end_points['block3'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool3')
# Block 4.
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv4')
end_points['block4'] = net
net = slim.max_pool2d(net, [2, 2], scope='pool4')
# Block 5.
net = slim.repeat(net, 3, slim.conv2d, 512, [3, 3], scope='conv5')
end_points['block5'] = net
net = slim.max_pool2d(net, [3, 3], stride=1, scope='pool5') # 池化層步長由2修改到三
####################################
# 后六個 Blocks,使用額外卷積層 #
####################################
# ————————————Additional SSD blocks.——————————————————————
# Block 6: let's dilate the hell out of it!
net = slim.conv2d(net, 1024, [3, 3], rate=6, scope='conv6')
end_points['block6'] = net
net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)
# Block 7: 1x1 conv. Because the fuck.
net = slim.conv2d(net, 1024, [1, 1], scope='conv7')
end_points['block7'] = net
net = tf.layers.dropout(net, rate=dropout_keep_prob, training=is_training)
# Block 8/9/10/11: 1x1 and 3x3 convolutions stride 2 (except lasts).
end_point = 'block8'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 256, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 512, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block9'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = custom_layers.pad2d(net, pad=(1, 1))
net = slim.conv2d(net, 256, [3, 3], stride=2, scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block10'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
end_points[end_point] = net
end_point = 'block11'
with tf.variable_scope(end_point):
net = slim.conv2d(net, 128, [1, 1], scope='conv1x1')
net = slim.conv2d(net, 256, [3, 3], scope='conv3x3', padding='VALID')
end_points[end_point] = net
######################################
# 每個中間層 end_points 返回中間結果 #
# 將各層預測結果存入列表,返回給優化函數 #
######################################
# Prediction and localisations layers.
predictions = []
logits = []
localisations = []
# feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']
for i, layer in enumerate(feat_layers):
with tf.variable_scope(layer + '_box'):
p, l = ssd_multibox_layer(end_points[layer],
num_classes,
anchor_sizes[i],
anchor_ratios[i],
normalizations[i])
"""
框的數目等於anchor_sizes[i]和anchor_ratios[i]的長度和
anchor_sizes=[(21., 45.),
(45., 99.),
(99., 153.),
(153., 207.),
(207., 261.),
(261., 315.)]
anchor_ratios=[[2, .5],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5],
[2, .5]]
normalizations=[20, -1, -1, -1, -1, -1]
"""
predictions.append(prediction_fn(p)) # prediction_fn=slim.softmax
logits.append(p)
localisations.append(l)
return predictions, localisations, logits, end_points
ssd_net.default_image_size = 300
在整個函數最后,給出了ssd_arg_scope函數,用於約束網絡中的超參數設定,用法腳本頭中已經給了:
Usage:
with slim.arg_scope(ssd_vgg.ssd_vgg()):
outputs, end_points = ssd_vgg.ssd_vgg(inputs)
def ssd_arg_scope(weight_decay=0.0005, data_format='NHWC'):
"""Defines the VGG arg scope.
Args:
weight_decay: The l2 regularization coefficient.
Returns:
An arg_scope.
"""
with slim.arg_scope([slim.conv2d, slim.fully_connected],
activation_fn=tf.nn.relu,
weights_regularizer=slim.l2_regularizer(weight_decay),
weights_initializer=tf.contrib.layers.xavier_initializer(),
biases_initializer=tf.zeros_initializer()):
with slim.arg_scope([slim.conv2d, slim.max_pool2d],
padding='SAME',
data_format=data_format):
with slim.arg_scope([custom_layers.pad2d,
custom_layers.l2_normalization,
custom_layers.channel_to_last],
data_format=data_format) as sc:
return sc
a、超參數設定
實際上原程序中超參數作為一個class屬性給出的,我們現在不關心這個class的信息,僅僅將其包含超參數設定的部分提取出來,提升對上面網絡的理解,
SSDParams = namedtuple('SSDParameters', ['img_shape',
'num_classes',
'no_annotation_label',
'feat_layers',
'feat_shapes',
'anchor_size_bounds',
'anchor_sizes',
'anchor_ratios',
'anchor_steps',
'anchor_offset',
'normalizations',
'prior_scaling'
])
class SSDNet(object):
default_params = SSDParams(
img_shape=(300, 300),
num_classes=21,
no_annotation_label=21,
feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11'],
feat_shapes=[(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)],
anchor_size_bounds=[0.15, 0.90],
# anchor_size_bounds=[0.20, 0.90],
anchor_sizes=[(21., 45.),
(45., 99.),
(99., 153.),
(153., 207.),
(207., 261.),
(261., 315.)],
anchor_ratios=[[2, .5],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5, 3, 1./3],
[2, .5],
[2, .5]],
anchor_steps=[8, 16, 32, 64, 100, 300],
anchor_offset=0.5,
normalizations=[1, -1, -1, -1, -1, -1], # 控制SSD層處理時是否預先沿着HW正則化
prior_scaling=[0.1, 0.1, 0.2, 0.2]
)
b、SSD處理結構
# Prediction and localisations layers.
predictions = []
logits = []
localisations = []
# feat_layers=['block4', 'block7', 'block8', 'block9', 'block10', 'block11']
for i, layer in enumerate(feat_layers):
with tf.variable_scope(layer + '_box'):
p, l = ssd_multibox_layer(end_points[layer], # <-----SSD處理
num_classes,
anchor_sizes[i],
anchor_ratios[i],
normalizations[i])
predictions.append(prediction_fn(p)) # prediction_fn=slim.softmax
logits.append(p)
localisations.append(l)
return predictions, localisations, logits, end_points
在網絡架構的最后,會對選取的特征層外接新的卷積處理(上面代碼),處理函數如下:
def tensor_shape(x, rank=3):
"""Returns the dimensions of a tensor.
Args:
image: A N-D Tensor of shape.
Returns:
A list of dimensions. Dimensions that are statically known are python
integers,otherwise they are integer scalar tensors.
"""
if x.get_shape().is_fully_defined():
return x.get_shape().as_list()
else:
# get_shape返回值,with_rank相當於斷言assert,是否rank為指定值
static_shape = x.get_shape().with_rank(rank).as_list()
# tf.shape返回張量,其中num解釋為"The length of the dimension `axis`.",axis默認為0
dynamic_shape = tf.unstack(tf.shape(x), num=rank)
# list,有定義的給數字,沒有的給tensor
return [s if s is not None else d
for s, d in zip(static_shape, dynamic_shape)]
def ssd_multibox_layer(inputs,
num_classes,
sizes,
ratios=[1],
normalization=-1,
bn_normalization=False):
"""Construct a multibox layer, return a class and localization predictions.
"""
net = inputs
if normalization > 0:
net = custom_layers.l2_normalization(net, scaling=True)
# Number of anchors.
num_anchors = len(sizes) + len(ratios)
# Location.
num_loc_pred = num_anchors * 4 # 每一個框有四個坐標
loc_pred = slim.conv2d(net, num_loc_pred, [3, 3], activation_fn=None,
scope='conv_loc') # 輸出C表示不同框的某個坐標
# 強制轉換為NHWC
loc_pred = custom_layers.channel_to_last(loc_pred)
# NHW(num_anchors+4)->NHW,num_anchors,4
loc_pred = tf.reshape(loc_pred,
tensor_shape(loc_pred, 4)[:-1]+[num_anchors, 4])
# Class prediction.
num_cls_pred = num_anchors * num_classes # 每一個框都要計算所有的類別
cls_pred = slim.conv2d(net, num_cls_pred, [3, 3], activation_fn=None,
scope='conv_cls') # 輸出C表示不同框的對某個類的預測
# 強制轉換為NHWC
cls_pred = custom_layers.channel_to_last(cls_pred)
# NHW(num_anchors+類別)->NHW,num_anchors,類別
cls_pred = tf.reshape(cls_pred,
tensor_shape(cls_pred, 4)[:-1]+[num_anchors, num_classes])
return cls_pred, loc_pred
根據是否正則化的的參數,對特征層進行L2正則化(空間維度C上正則化),具體流程見下節
然后並行的在選定特征層后面加上兩個卷積,一個輸出通道為num_anchors×4,一個輸出通道為num_anchors×類別數
將兩個卷積的輸出格維度各自擴展一維,排序轉換為:[NHW,num_anchors,4] 和 [NHW,num_anchors,類別]

此時我們可以知道網絡結構函數的返回的意義了:各個指定層SSD處理后輸出的框對類別的概率,各個指定層SSD處理后輸出的框坐標修正,各個指定層SSD處理后輸出的框對類別的原始輸出,所有中間層的end_point。
c、custom_layers.l2_normalization:特征層L2正則化
首先在特征層維度進行正則化,過程見nn.l2_normalize,然后對每一個層取一個scale因子,對各個層放縮調整(因子是可學習的),最后返回這個調整后的特征
@add_arg_scope
def l2_normalization(
inputs,
scaling=False,
scale_initializer=init_ops.ones_initializer(),
reuse=None,
variables_collections=None,
outputs_collections=None,
data_format='NHWC',
trainable=True,
scope=None):
"""Implement L2 normalization on every feature (i.e. spatial normalization).
Should be extended in some near future to other dimensions, providing a more
flexible normalization framework.
Args:
inputs: a 4-D tensor with dimensions [batch_size, height, width, channels].
scaling: whether or not to add a post scaling operation along the dimensions
which have been normalized.
scale_initializer: An initializer for the weights.
reuse: whether or not the layer and its variables should be reused. To be
able to reuse the layer scope must be given.
variables_collections: optional list of collections for all the variables or
a dictionary containing a different list of collection per variable.
outputs_collections: collection to add the outputs.
data_format: NHWC or NCHW data format.
trainable: If `True` also add variables to the graph collection
`GraphKeys.TRAINABLE_VARIABLES` (see tf.Variable).
scope: Optional scope for `variable_scope`.
Returns:
A `Tensor` representing the output of the operation.
"""
with variable_scope.variable_scope(
scope, 'L2Normalization', [inputs], reuse=reuse) as sc:
inputs_shape = inputs.get_shape()
inputs_rank = inputs_shape.ndims
dtype = inputs.dtype.base_dtype
# 在C上做l2標准化
if data_format == 'NHWC':
# norm_dim = tf.range(1, inputs_rank-1)
norm_dim = tf.range(inputs_rank-1, inputs_rank)
params_shape = inputs_shape[-1:]
elif data_format == 'NCHW':
# norm_dim = tf.range(2, inputs_rank)
norm_dim = tf.range(1, 2)
params_shape = (inputs_shape[1])
# Normalize along spatial dimensions.
outputs = nn.l2_normalize(inputs, norm_dim, epsilon=1e-12)
# Additional scaling.
if scaling:
# 從collections獲取變量
scale_collections = utils.get_variable_collections(
variables_collections, 'scale')
# 創建變量,shape=C的層數
scale = variables.model_variable('gamma',
shape=params_shape,
dtype=dtype,
initializer=scale_initializer,
collections=scale_collections,
trainable=trainable)
if data_format == 'NHWC':
outputs = tf.multiply(outputs, scale)
elif data_format == 'NCHW':
scale = tf.expand_dims(scale, axis=-1)
scale = tf.expand_dims(scale, axis=-1)
outputs = tf.multiply(outputs, scale)
# outputs = tf.transpose(outputs, perm=(0, 2, 3, 1))
# 為outputs添加別名,並將之收集進collection,返回原節點
return utils.collect_named_outputs(outputs_collections,
sc.original_name_scope, outputs)
至此,網絡結構的介紹就完成了,下一節我們將關注目標檢測模型的關鍵技術之一:定位框的生成,並串聯本節,理解整個SSD網絡的生成過程。
附錄、相關實現
custom_layers.channel_to_last:NHWC轉化
@add_arg_scope # 層可以被slim.arg_scope設定
def channel_to_last(inputs,
data_format='NHWC',
scope=None):
"""Move the channel axis to the last dimension. Allows to
provide a single output format whatever the input data format.
Args:
inputs: Input Tensor;
data_format: NHWC or NCHW.
Return:
Input in NHWC format.
"""
with tf.name_scope(scope, 'channel_to_last', [inputs]):
if data_format == 'NHWC':
net = inputs
elif data_format == 'NCHW':
net = tf.transpose(inputs, perm=(0, 2, 3, 1))
return net
custom_layers.pad2d:2D-tensor填充
@add_arg_scope # 層可以被slim.arg_scope設定
def pad2d(inputs,
pad=(0, 0),
mode='CONSTANT',
data_format='NHWC',
trainable=True,
scope=None):
"""2D Padding layer, adding a symmetric padding to H and W dimensions.
Aims to mimic padding in Caffe and MXNet, helping the port of models to
TensorFlow. Tries to follow the naming convention of `tf.contrib.layers`.
Args:
inputs: 4D input Tensor;
pad: 2-Tuple with padding values for H and W dimensions;(填充的寬度)
mode: Padding mode. C.f. `tf.pad`
data_format: NHWC or NCHW data format.
"""
with tf.name_scope(scope, 'pad2d', [inputs]):
# Padding shape.
if data_format == 'NHWC':
paddings = [[0, 0], [pad[0], pad[0]], [pad[1], pad[1]], [0, 0]]
elif data_format == 'NCHW':
paddings = [[0, 0], [0, 0], [pad[0], pad[0]], [pad[1], pad[1]]]
net = tf.pad(inputs, paddings, mode=mode)
return net
slim的vgg_16
def vgg_16(inputs,
num_classes=1000,
is_training=True,
dropout_keep_prob=0.5,
spatial_squeeze=True,
scope='vgg_16'):
"""Oxford Net VGG 16-Layers version D Example.
Note: All the fully_connected layers have been transformed to conv2d layers.
To use in classification mode, resize input to 224x224.
Args:
inputs: a tensor of size [batch_size, height, width, channels].
num_classes: number of predicted classes.
is_training: whether or not the model is being trained.
dropout_keep_prob: the probability that activations are kept in the dropout
layers during training.
spatial_squeeze: whether or not should squeeze the spatial dimensions of the
outputs. Useful to remove unnecessary dimensions for classification.
scope: Optional scope for the variables.
Returns:
the last op containing the log predictions and end_points dict.
"""
with variable_scope.variable_scope(scope, 'vgg_16', [inputs]) as sc:
end_points_collection = sc.original_name_scope + '_end_points'
# Collect outputs for conv2d, fully_connected and max_pool2d.
with arg_scope(
[layers.conv2d, layers_lib.fully_connected, layers_lib.max_pool2d],
outputs_collections=end_points_collection):
net = layers_lib.repeat(
inputs, 2, layers.conv2d, 64, [3, 3], scope='conv1')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool1')
net = layers_lib.repeat(net, 2, layers.conv2d, 128, [3, 3], scope='conv2')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool2')
net = layers_lib.repeat(net, 3, layers.conv2d, 256, [3, 3], scope='conv3')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool3')
net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv4')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool4')
net = layers_lib.repeat(net, 3, layers.conv2d, 512, [3, 3], scope='conv5')
net = layers_lib.max_pool2d(net, [2, 2], scope='pool5')
# Use conv2d instead of fully_connected layers.
net = layers.conv2d(net, 4096, [7, 7], padding='VALID', scope='fc6')
net = layers_lib.dropout(
net, dropout_keep_prob, is_training=is_training, scope='dropout6')
net = layers.conv2d(net, 4096, [1, 1], scope='fc7')
net = layers_lib.dropout(
net, dropout_keep_prob, is_training=is_training, scope='dropout7')
net = layers.conv2d(
net,
num_classes, [1, 1],
activation_fn=None,
normalizer_fn=None,
scope='fc8')
# Convert end_points_collection into a end_point dict.
end_points = utils.convert_collection_to_dict(end_points_collection)
if spatial_squeeze:
net = array_ops.squeeze(net, [1, 2], name='fc8/squeezed')
end_points[sc.name + '/fc8'] = net
return net, end_points
vgg_16.default_image_size = 224
不常用API記錄
nn.l2_normalize:L2正則化層
slim.repeat:重復層快速構建
Tensor.get_shape().with_rank(rank).as_list():加類似斷言的shape獲取函數
tensorflow.contrib.layers.python.layers.utils.collect_named_outputs:變量添加進collections,並取別名
