一、detect和build
前面多節中我們花了大量筆墨介紹build方法的inference分支,這節我們看看它是如何被調用的。
在dimo.ipynb中,涉及model的操作我們簡單進行一下匯總,首先創建圖並載入預訓練權重,

然后規范了類別序列,

實際開始檢測的代碼塊如下,


經由model.detect方法,調用model.build方法(也就是我們前面多節在講解的方法)構建圖,實施預測。
二、detect方法
首先看看detect方法的前幾行(和build一樣,同見model.py),
def detect(self, images, verbose=0):
"""Runs the detection pipeline.
images: List of images, potentially of different sizes.
Returns a list of dicts, one dict per image. The dict contains:
rois: [N, (y1, x1, y2, x2)] detection bounding boxes
class_ids: [N] int class IDs
scores: [N] float probability scores for the class IDs
masks: [H, W, N] instance binary masks
"""
assert self.mode == "inference", "Create model in inference mode."
assert len(
images) == self.config.BATCH_SIZE, "len(images) must be equal to BATCH_SIZE"
# 日志記錄
if verbose:
log("Processing {} images".format(len(images)))
for image in images:
log("image", image)
1、待檢測圖像預處理
# Mold inputs to format expected by the neural network
molded_images, image_metas, windows = self.mold_inputs(images)
# Validate image sizes
# All images in a batch MUST be of the same size
image_shape = molded_images[0].shape
for g in molded_images[1:]:
assert g.shape == image_shape,\
"After resizing, all images must have the same size. Check IMAGE_RESIZE_MODE and image sizes."
簡單的糾錯和日志控制之后,即調用mold_input函數對輸入圖片進行調整,並記錄圖片信息。
self.mold_inputs方法如下,
def mold_inputs(self, images):
"""Takes a list of images and modifies them to the format expected
as an input to the neural network.
images: List of image matrices [height,width,depth]. Images can have
different sizes.
Returns 3 Numpy matrices:
molded_images: [N, h, w, 3]. Images resized and normalized.
image_metas: [N, length of meta data]. Details about each image.
windows: [N, (y1, x1, y2, x2)]. The portion of the image that has the
original image (padding excluded).
"""
molded_images = []
image_metas = []
windows = []
for image in images:
# Resize image
# TODO: move resizing to mold_image()
molded_image, window, scale, padding, crop = utils.resize_image(
image,
min_dim=self.config.IMAGE_MIN_DIM, # 800
min_scale=self.config.IMAGE_MIN_SCALE, # 0
max_dim=self.config.IMAGE_MAX_DIM, # 1024
mode=self.config.IMAGE_RESIZE_MODE) # square
molded_image = mold_image(molded_image, self.config) # 減平均像素
# Build image_meta 形式為np數組
image_meta = compose_image_meta(
0, image.shape, molded_image.shape, window, scale,
np.zeros([self.config.NUM_CLASSES], dtype=np.int32))
# Append
molded_images.append(molded_image)
windows.append(window)
image_metas.append(image_meta)
# Pack into arrays
molded_images = np.stack(molded_images)
image_metas = np.stack(image_metas)
windows = np.stack(windows)
return molded_images, image_metas, windows
utils.resize_image函數用於縮放原圖像,它生成一個scale,返回圖像大小等於輸入圖像大小*scale並保證
- 最短邊等於輸入min_dim,最長邊不大於max_dim
- 如果最長邊超過了max_dim則保證最長邊等於max_dim,最短邊不再限制
最后,將圖片padding到max_dim*max_dim大小(即molded_images大小其實是固定的),其返回值如下:
image.astype(image_dtype), window, scale, padding, crop
表示:resize后圖片,原圖相對resize后圖片的位置信息(詳見『計算機視覺』Mask-RCNN_推斷網絡其五:目標檢測結果精煉),放縮倍數,padding信息(四個整數),crop信息(四個整數或者None)。
mold_image函數更為簡單,就是把圖片像素減去了個平均值,MEAN_PIXEL=[123.7 116.8 103.9]。
compose_image_meta記錄了全部的原始信息,可以看到,crop並未收錄在內,
def compose_image_meta(image_id, original_image_shape, image_shape,
window, scale, active_class_ids):
"""Takes attributes of an image and puts them in one 1D array.
image_id: An int ID of the image. Useful for debugging.
original_image_shape: [H, W, C] before resizing or padding.
image_shape: [H, W, C] after resizing and padding
window: (y1, x1, y2, x2) in pixels. The area of the image where the real
image is (excluding the padding)
scale: The scaling factor applied to the original image (float32)
active_class_ids: List of class_ids available in the dataset from which
the image came. Useful if training on images from multiple datasets
where not all classes are present in all datasets.
"""
meta = np.array(
[image_id] + # size=1
list(original_image_shape) + # size=3
list(image_shape) + # size=3
list(window) + # size=4 (y1, x1, y2, x2) in image cooredinates
[scale] + # size=1
list(active_class_ids) # size=num_classes
)
return meta
最后拼接返回。
2、anchors生成
首先調用方法get_anchors生成錨框(見『計算機視覺』Mask-RCNN_錨框生成),shape為[anchor_count, (y1, x1, y2, x2)],
# Anchors
anchors = self.get_anchors(image_shape)
# Duplicate across the batch dimension because Keras requires it
# TODO: can this be optimized to avoid duplicating the anchors?
# [anchor_count, (y1, x1, y2, x2)] --> [batch, anchor_count, (y1, x1, y2, x2)]
anchors = np.broadcast_to(anchors, (self.config.BATCH_SIZE,) + anchors.shape)
然后為之添加batch維度,最終[batch, anchor_count, (y1, x1, y2, x2)]。
3、inference網絡預測
調用keras的predict方法前向傳播,在預測任務中我們僅僅關注detections和mrcnn_mask兩個輸出。
# Run object detection
# 於__init__中定義:self.keras_model = self.build(mode=mode, config=config)
# 返回list: [detections, mrcnn_class, mrcnn_bbox,
# mrcnn_mask, rpn_rois, rpn_class, rpn_bbox]
# detections, [batch, num_detections, (y1, x1, y2, x2, class_id, score)]
# mrcnn_mask, [batch, num_detections, MASK_POOL_SIZE, MASK_POOL_SIZE, NUM_CLASSES]
detections, _, _, mrcnn_mask, _, _, _ =\
self.keras_model.predict([molded_images, image_metas, anchors], verbose=0)
4、坐標框重映射
我們對於坐標的操作都是基於輸入圖片的相對位置,且單位長度也是其寬高,在最后我們需要將之修正回像素空間坐標。
令輸入圖片list不需要輸入圖片具有相同的尺寸,所以我們在恢復時必須注意單張處理之。
# Process detections
results = []
for i, image in enumerate(images):
# 需要單張處理,因為原始圖片images不保證每張尺寸一致
final_rois, final_class_ids, final_scores, final_masks =\
self.unmold_detections(detections[i], mrcnn_mask[i],
image.shape, molded_images[i].shape,
windows[i])
目標檢測框重映射:unmold_detections函數
def unmold_detections(self, detections, mrcnn_mask, original_image_shape,
image_shape, window):
"""Reformats the detections of one image from the format of the neural
network output to a format suitable for use in the rest of the
application.
detections: [N, (y1, x1, y2, x2, class_id, score)] in normalized coordinates
mrcnn_mask: [N, height, width, num_classes]
original_image_shape: [H, W, C] Original image shape before resizing
image_shape: [H, W, C] Shape of the image after resizing and padding
window: [y1, x1, y2, x2] Pixel coordinates of box in the image where the real
image is excluding the padding.
Returns:
boxes: [N, (y1, x1, y2, x2)] Bounding boxes in pixels
class_ids: [N] Integer class IDs for each bounding box
scores: [N] Float probability scores of the class_id
masks: [height, width, num_instances] Instance masks
"""
# How many detections do we have?
# Detections array is padded with zeros. Find the first class_id == 0.
zero_ix = np.where(detections[:, 4] == 0)[0] # DetectionLayer 末尾對結果進行了全0填充
N = zero_ix[0] if zero_ix.shape[0] > 0 else detections.shape[0] # 有意義的檢測結果數N
# Extract boxes, class_ids, scores, and class-specific masks
boxes = detections[:N, :4] # [N, (y1, x1, y2, x2)]
class_ids = detections[:N, 4].astype(np.int32) # [N, class_id]
scores = detections[:N, 5] # [N, score]
masks = mrcnn_mask[np.arange(N), :, :, class_ids] # [N, height, width, num_classes]
# Translate normalized coordinates in the resized image to pixel
# coordinates in the original image before resizing
window = utils.norm_boxes(window, image_shape[:2]) # window相對輸入圖片規范化
wy1, wx1, wy2, wx2 = window
shift = np.array([wy1, wx1, wy1, wx1])
wh = wy2 - wy1 # window height
ww = wx2 - wx1 # window width
scale = np.array([wh, ww, wh, ww])
# Convert boxes to normalized coordinates on the window
boxes = np.divide(boxes - shift, scale) # box相對window坐標規范化
# Convert boxes to pixel coordinates on the original image
boxes = utils.denorm_boxes(boxes, original_image_shape[:2]) # box相對原圖解規范化
# Filter out detections with zero area. Happens in early training when
# network weights are still random
exclude_ix = np.where(
(boxes[:, 2] - boxes[:, 0]) * (boxes[:, 3] - boxes[:, 1]) <= 0)[0]
if exclude_ix.shape[0] > 0:
boxes = np.delete(boxes, exclude_ix, axis=0)
class_ids = np.delete(class_ids, exclude_ix, axis=0)
scores = np.delete(scores, exclude_ix, axis=0)
masks = np.delete(masks, exclude_ix, axis=0)
N = class_ids.shape[0]
# Resize masks to original image size and set boundary threshold.
full_masks = []
for i in range(N): # 單個box操作
# Convert neural network mask to full size mask
full_mask = utils.unmold_mask(masks[i], boxes[i], original_image_shape)
full_masks.append(full_mask)
full_masks = np.stack(full_masks, axis=-1)\
if full_masks else np.empty(original_image_shape[:2] + (0,))
# [n, (y1, x1, y2, x2)]
# [n, class_id]
# [n, class_id]
# [h, w, n]
return boxes, class_ids, scores, full_masks
為了將輸出結果格式還原,我們需要進行如下幾步:
剔除為了湊齊DETECTION_MAX_INSTANCES 填充的全0檢測結果
將box放縮回原始圖片對應尺寸
剔除面積為0的box
mask輸出尺寸還原
在網絡中操作的box尺寸為基於輸入圖片的規范化坐標,window為像素坐標,所以我們先將window相對輸入圖片規范化,使得window和box處於同一坐標系,然后這兩者坐標就可以直接交互了,使box相對window規范化,此時box坐標尺寸都是window的相對值,而window和原始圖片是直接有映射關系的,所以box遵循其關系,映射回原始像素大小即可。
完成box映射后,我們開始對mask進行處理。
Mask信息重映射:utils.unmold_mask函數
utils.unmold_mask受調用於unmold_detections尾部:
# Resize masks to original image size and set boundary threshold.
full_masks = []
for i in range(N): # 單個box操作
# Convert neural network mask to full size mask
full_mask = utils.unmold_mask(masks[i], boxes[i], original_image_shape)
full_masks.append(full_mask)
full_masks = np.stack(full_masks, axis=-1)\
if full_masks else np.empty(original_image_shape[:2] + (0,))
首先重申我們的unmold_detections函數是對單張圖片進行處理的,而mask處理進一步的是對每一個檢測框進行處理的,
def unmold_mask(mask, bbox, image_shape):
"""Converts a mask generated by the neural network to a format similar
to its original shape.
mask: [height, width] of type float. A small, typically 28x28 mask.
bbox: [y1, x1, y2, x2]. The box to fit the mask in.
Returns a binary mask with the same size as the original image.
"""
threshold = 0.5
y1, x1, y2, x2 = bbox
mask = resize(mask, (y2 - y1, x2 - x1))
mask = np.where(mask >= threshold, 1, 0).astype(np.bool)
# Put the mask in the right location.
full_mask = np.zeros(image_shape[:2], dtype=np.bool)
full_mask[y1:y2, x1:x2] = mask
return full_mask
我們在inference中輸出的mask信息僅僅是一般的生成網絡輸出,所以為了得到掩碼格式我們需要一個閾值。明確了這個概念,下一步就簡單了,我們將mask輸出放縮到對應的box大小即可(此時的box已經相對原始圖片進行了放縮,是像素坐標),然后將放縮后的掩碼按照box相對原始圖片的位置貼在一張和原始圖片等大的空白圖片上。
我們對每一個檢測目標做這個操作,就可以得到等同於檢測目標數的原始圖片大小的掩碼圖片(每個掩碼圖片上有一個掩碼對象),將之按照axis=-1拼接,最終獲取[h, w, n]格式輸出,hw為原始圖片大小,n為最終檢測到的目標數目。
最終,將計算結果返回,退出函數。
# [n, (y1, x1, y2, x2)]
# [n, class_id]
# [n, class_id]
# [h, w, n]
return boxes, class_ids, scores, full_masks
實際調用如下:

