一、Detections網絡

經過了ROI網絡,我們已經獲取了全部推薦區域的信息,包含:
推薦區域特征(ROIAlign得到)
推薦區域類別
推薦區域坐標修正項(deltas)
再加上推薦區域原始坐標[IMAGES_PER_GPU, num_rois, (y1, x1, y2, x2)],我們將進行最后的目標檢測精修部分。
# Detections
# output is [batch, num_detections, (y1, x1, y2, x2, class_id, score)] in
# normalized coordinates
detections = DetectionLayer(config, name="mrcnn_detection")(
[rpn_rois, mrcnn_class, mrcnn_bbox, input_image_meta])
1、原始圖片resize參數"window"
注意到我們的輸入中一個input_image_meta項,它記錄了每一張圖片的原始信息,[batch, n]維矩陣,n是固定的,其生成與config.py文件中
# Image meta data length
# See compose_image_meta() for details
self.IMAGE_META_SIZE = 1 + 3 + 3 + 4 + 1 + self.NUM_CLASSES
其信息在未來的(如果有的話)圖像預處理中會介紹,本節使用了其中記錄的原圖大小信息和對應圖片的"window"信息。圖片大小信息為3個整數,對應輸入圖片(即已經預處理之后的圖片)的長寬和深度,"window"信息包含4個整數,其含義為(top_pad, left_pad, h + top_pad, w + left_pad),和重置圖片大小的處理有關,下面代碼見utils.py的resize_image函數,
if mode == "square":
# Get new height and width
h, w = image.shape[:2]
top_pad = (max_dim - h) // 2
bottom_pad = max_dim - h - top_pad
left_pad = (max_dim - w) // 2
right_pad = max_dim - w - left_pad
padding = [(top_pad, bottom_pad), (left_pad, right_pad), (0, 0)]
image = np.pad(image, padding, mode='constant', constant_values=0)
window = (top_pad, left_pad, h + top_pad, w + left_pad)
即我們將深藍色的原圖(不要求w等於h)通過填充的方式擴展為淺灰色的大圖用於feed網絡,"window"記錄了以新圖左上角為原點建立坐標系,原圖的左上角點和右下角點的坐標,由於坐標系選取的是像素坐標,"window"記錄的就是原始圖片的大小,其蘊含了輸入圖片中真正有意義的位置信息。

2、從"window"還原原始圖片大小
有一點注意,假如top_pad=5,也就是我們在圖像頂部填充了5行,實際上0、1、2、3、4為非圖像區域,所以我們從第5行開始是圖像;假設圖像有3行(很極端),即5、6、7行為圖像,但是:
top_pad+h=5+3=8
即[top_pad:top_pad+h-1]行為真實圖片,列同理。
另外,用於解析image_meta結構的函數如下:
def parse_image_meta_graph(meta):
"""Parses a tensor that contains image attributes to its components.
See compose_image_meta() for more details.
meta: [batch, meta length] where meta length depends on NUM_CLASSES
Returns a dict of the parsed tensors.
"""
image_id = meta[:, 0]
original_image_shape = meta[:, 1:4]
image_shape = meta[:, 4:7]
window = meta[:, 7:11] # (y1, x1, y2, x2) window of image in in pixels
scale = meta[:, 11]
active_class_ids = meta[:, 12:]
return {
"image_id": image_id,
"original_image_shape": original_image_shape,
"image_shape": image_shape,
"window": window,
"scale": scale,
"active_class_ids": active_class_ids,
}
二、源碼講解
首先接收參數,初始化網絡,
class DetectionLayer(KE.Layer):
"""Takes classified proposal boxes and their bounding box deltas and
returns the final detection boxes.
Returns:
[batch, num_detections, (y1, x1, y2, x2, class_id, class_score)] where
coordinates are normalized.
"""
def __init__(self, config=None, **kwargs):
super(DetectionLayer, self).__init__(**kwargs)
self.config = config
def call(self, inputs):
rois = inputs[0] # [batch, num_rois, (y1, x1, y2, x2)]
mrcnn_class = inputs[1] # [batch, num_rois, NUM_CLASSES]
mrcnn_bbox = inputs[2] # [batch, num_rois, NUM_CLASSES, (dy, dx, log(dh), log(dw))]
image_meta = inputs[3]
1、原始圖片尺寸獲取
然后獲取"window"參數即原始圖片尺寸,然后獲取其相對於輸入圖片的image_shape即[w, h, channels]的尺寸,
# Get windows of images in normalized coordinates. Windows are the area
# in the image that excludes the padding.
# Use the shape of the first image in the batch to normalize the window
# because we know that all images get resized to the same size.
m = parse_image_meta_graph(image_meta)
image_shape = m['image_shape'][0]
window = norm_boxes_graph(m['window'], image_shape[:2]) # (y1, x1, y2, x2)
上面第5行調用函數如下(本文第一節中已經貼了),用於解析並獲取輸入圖片的shape和原始圖片的shape(即"window")。第7行函數如下:
def norm_boxes_graph(boxes, shape):
"""Converts boxes from pixel coordinates to normalized coordinates.
boxes: [..., (y1, x1, y2, x2)] in pixel coordinates
shape: [..., (height, width)] in pixels
Note: In pixel coordinates (y2, x2) is outside the box. But in normalized
coordinates it's inside the box.
Returns:
[..., (y1, x1, y2, x2)] in normalized coordinates
"""
h, w = tf.split(tf.cast(shape, tf.float32), 2)
scale = tf.concat([h, w, h, w], axis=-1) - tf.constant(1.0)
shift = tf.constant([0., 0., 1., 1.])
return tf.divide(boxes - shift, scale)
我們經過"window"獲取了原始圖片相對輸入圖片的坐標(像素空間),然后除以輸入圖片的寬高,得到了原始圖片相對於輸入圖片的normalized坐標,分布於[0,1]區間上。
事實上由於anchors生成的4個坐標值均位於[0,1],在網絡中所有的坐標都是位於[0,1]的,原始圖片信息是新的被引入的量,不可或缺的需要被處理到正則空間。
對於每一張圖片,我們有:
每個推薦區域的坐標
每個推薦區域的粗分類情況
每個推薦區域的坐標粗修
圖片中真正有意義的位置坐標
下面我們基於這些信息,進行精提。
2、分類、回歸信息精煉
# Run detection refinement graph on each item in the batch
detections_batch = utils.batch_slice(
[rois, mrcnn_class, mrcnn_bbox, window],
lambda x, y, w, z: refine_detections_graph(x, y, w, z, self.config),
注意,下面調用的函數,每次處理的是單張圖片。
邏輯流程如下:
a 獲取每個推薦區域得分最高的class的得分
b 獲取每個推薦區域經過粗修后的坐標和"window"交集的坐標
c 剔除掉最高得分為背景的推薦區域
d 剔除掉最高得分達不到閾值的推薦區域
e 對屬於同一類別的候選框進行非極大值抑制
f 對非極大值抑制后的框索引:剔除-1占位符,獲取top k(100)
最后返回每個框(y1, x1, y2, x2, class_id, score)信息
step1
調用函數前半部分如下,
def refine_detections_graph(rois, probs, deltas, window, config):
"""Refine classified proposals and filter overlaps and return final
detections.
Inputs:
rois: [N, (y1, x1, y2, x2)] in normalized coordinates
probs: [N, num_classes]. Class probabilities.
deltas: [N, num_classes, (dy, dx, log(dh), log(dw))]. Class-specific
bounding box deltas.
window: (y1, x1, y2, x2) in normalized coordinates. The part of the image
that contains the image excluding the padding.
Returns detections shaped: [num_detections, (y1, x1, y2, x2, class_id, score)] where
coordinates are normalized.
"""
# Class IDs per ROI
class_ids = tf.argmax(probs, axis=1, output_type=tf.int32) # [N],每張圖片最高得分類
# Class probability of the top class of each ROI
indices = tf.stack([tf.range(probs.shape[0]), class_ids], axis=1) # [N, (圖片序號, 最高class序號)]
class_scores = tf.gather_nd(probs, indices) # [N], 每張圖片最高得分類得分值
# Class-specific bounding box deltas
deltas_specific = tf.gather_nd(deltas, indices) # [N, 4]
# Apply bounding box deltas
# Shape: [boxes, (y1, x1, y2, x2)] in normalized coordinates
refined_rois = apply_box_deltas_graph(
rois, deltas_specific * config.BBOX_STD_DEV) # [N, 4]
# Clip boxes to image window
refined_rois = clip_boxes_graph(refined_rois, window)
# TODO: Filter out boxes with zero area
# Filter out background boxes
# class_ids: N, where(class_ids > 0): [M, 1] 即where會升維
keep = tf.where(class_ids > 0)[:, 0]
# Filter out low confidence boxes
if config.DETECTION_MIN_CONFIDENCE: # 0.7
conf_keep = tf.where(class_scores >= config.DETECTION_MIN_CONFIDENCE)[:, 0]
# 求交集,返回稀疏Tensor,要求a、b除最后一維外維度相同,最后一維的各個子列分別求交集
# a = np.array([[{1, 2}, {3}], [{4}, {5, 6}]])
# b = np.array([[{1} , {}] , [{4}, {5, 6, 7, 8}]])
# res = np.array([[{1} , {}] , [{4}, {5, 6}]])
keep = tf.sets.set_intersection(tf.expand_dims(keep, 0),
tf.expand_dims(conf_keep, 0))
keep = tf.sparse_tensor_to_dense(keep)[0]
# Apply per-class NMS
# 1. Prepare variables
pre_nms_class_ids = tf.gather(class_ids, keep) # [n]
pre_nms_scores = tf.gather(class_scores, keep) # [n]
pre_nms_rois = tf.gather(refined_rois, keep) # [n, 4]
unique_pre_nms_class_ids = tf.unique(pre_nms_class_ids)[0] # 去重后class類別
'''
# tensor 'x' is [1, 1, 2, 4, 4, 4, 7, 8, 8]
y, idx = unique(x)
y ==> [1, 2, 4, 7, 8]
idx ==> [0, 0, 1, 2, 2, 2, 3, 4, 4]
'''
這一部分代碼主要對於當前的信息進行整理為精煉做准備,流程很清晰:
a 獲取每個推薦區域得分最高的class的得分
b 獲取每個推薦區域經過粗修后的坐標和"window"交集的坐標
c 剔除掉最高得分為背景的推薦區域
d 剔除掉最高得分達不到閾值的推薦區域
此時使用張量keep保存符合條件的推薦區域的index,即一個一維數組,每個值為一個框的序號,后面會繼續對這個keep中的序號做進一步的篩選。
step2
e 對屬於同一類別的候選框進行非極大值抑制。
注意下面的內嵌函數,包含keep(step1中保留的框索引)、pre_nms_class_ids(step1中保留的框類別)、pre_nms_scores(step1中保留的框得分)幾個外部變量,
def nms_keep_map(class_id):
"""Apply Non-Maximum Suppression on ROIs of the given class."""
# 接受了外部變量pre_nms_class_ids、keep
# Indices of ROIs of the given class
# class_id表示當前NMS的目標類的數字,pre_nms_class_ids為全部的疑似目標類
ixs = tf.where(tf.equal(pre_nms_class_ids, class_id))[:, 0]
# Apply NMS
class_keep = tf.image.non_max_suppression(
tf.gather(pre_nms_rois, ixs), # 當前class的全部推薦區坐標
tf.gather(pre_nms_scores, ixs), # 當前class的全部推薦區得分
max_output_size=config.DETECTION_MAX_INSTANCES, # 100
iou_threshold=config.DETECTION_NMS_THRESHOLD) # 0.3
# Map indices
# class_keep是對ixs的索引,ixs是對keep的索引
class_keep = tf.gather(keep, tf.gather(ixs, class_keep)) # 由索引獲取索引
# Pad with -1 so returned tensors have the same shape
gap = config.DETECTION_MAX_INSTANCES - tf.shape(class_keep)[0]
class_keep = tf.pad(class_keep, [(0, gap)],
mode='CONSTANT', constant_values=-1)
# Set shape so map_fn() can infer result shape
class_keep.set_shape([config.DETECTION_MAX_INSTANCES])
# 返回長度必須固定,否則tf.map_fn不能正常運行
return class_keep
# 2. Map over class IDs
nms_keep = tf.map_fn(nms_keep_map, unique_pre_nms_class_ids,
dtype=tf.int64) # [?, 默認100]:類別順序,每個類別中的框索引
本步驟輸出nms_keep,[?, 100]格式,?表示該張圖片中保留的類別數(不是實例數注意)。
step3
f 對非極大值抑制后的框索引:剔除-1占位符,獲取top k(100),返回每個框(y1, x1, y2, x2, class_id, score)信息。
# 3. Merge results into one list, and remove -1 padding
nms_keep = tf.reshape(nms_keep, [-1]) # 全部框索引
nms_keep = tf.gather(nms_keep, tf.where(nms_keep > -1)[:, 0]) # 剔除-1索引
# 4. Compute intersection between keep and nms_keep
# nms_keep本身就是從keep中截取的,本步感覺冗余
keep = tf.sets.set_intersection(tf.expand_dims(keep, 0),
tf.expand_dims(nms_keep, 0))
keep = tf.sparse_tensor_to_dense(keep)[0]
# Keep top detections
roi_count = config.DETECTION_MAX_INSTANCES
class_scores_keep = tf.gather(class_scores, keep) # 獲取得分
num_keep = tf.minimum(tf.shape(class_scores_keep)[0], roi_count)
top_ids = tf.nn.top_k(class_scores_keep, k=num_keep, sorted=True)[1]
keep = tf.gather(keep, top_ids) # 由索引獲取索引
# Arrange output as [N, (y1, x1, y2, x2, class_id, score)]
# Coordinates are normalized.
detections = tf.concat([
tf.gather(refined_rois, keep), # 索引坐標[?, 4]
tf.to_float(tf.gather(class_ids, keep))[..., tf.newaxis], # 索引class,添加維[?, 1]
tf.gather(class_scores, keep)[..., tf.newaxis] # 索引的分,添加維[?, 1]
], axis=1)
# 如果 detections < DETECTION_MAX_INSTANCES,填充0
gap = config.DETECTION_MAX_INSTANCES - tf.shape(detections)[0]
detections = tf.pad(detections, [(0, gap), (0, 0)], "CONSTANT")
return detections
至此,我們得到了可以用於輸出的目標檢測結果,下一步就是Mask信息生成。
