所需文件: 本地下載
無人駕駛 - 車輛檢測
本文使用非常強大的 YOLO 模型用來進行目標檢測。本文所采用的思想都是來自兩篇論文: Redmon et al., 2016 和 Redmon and Farhadi, 2016 。
導入依賴庫
import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.io
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body
%matplotlib inline
1 - 問題描述
若想實現無人駕駛汽車,作為一個非常重要的部分就是要構建一個汽車檢測系統。為了收集數據,可以在汽車前引擎蓋上安置一個攝像頭,當駕駛汽車時,每隔幾秒鍾就拍攝道路前方的圖片。
圖片拍攝於汽車在硅谷行駛時從車載攝像頭拍攝。
此數據集來自於 drive.ai
已經將所有收集到的圖像放到一個文件夾里並且已經通過矩形邊界框標記出每一個發現的車輛。下面有一個邊界框定義的例子:
邊界的定義
有80個類別希望被目標檢測器識別出,使用 \(c\) 來表示類別標簽,可用兩種方式,分別是使用整數從1到80或者是一個80維的向量其中一個是1其它全為0。在文本中,將使用兩種表示方式,這依賴於哪一種方式更加方便。
本文中,將使用學習“You Only Look Once”(YOLO)執行目標檢測,並且用於車輛檢測。因為YOLO模型訓練需要非常大的計算量,將加載預訓練的權重使用。
2 - YOLO
“You Only Look Once”(YOLO)是一個流行的算法,因為它實現了非常高的准確率的同時有能力實時運行。這個算法“only look once”在某種意義上說圖像只需要一次前向傳播到網絡上做出預測。在抑制非最大檢測(non-max suppression)后,將輸出識別到的目標並使用邊框標記出。
2.1 - 模型細節
輸入和輸出
- 輸入 是一個批量的圖像,每一批量尺寸為 \((m, 608, 608, 3)\) 。
- 輸出 是一系列識別到的類別和邊界框。每一個邊界框被6個數字表示 \((p_c, b_x, b_y, b_h, b_w, c\) 。如果將 \(c\) 展開為一個80維的向量,則每一個邊界框由85個數字表示。
錨框(Anchor Boxes)
- Anchor boxes是通過探索訓練集來選擇的,選擇為一個合理的 height/width 比例表示不同的類別。在本文中,使用了5個錨框,並將其存在
./model_data/yolo_anchors.txt
文件中。 - 對於每一個錨框的維度是在編碼為倒數第二個維度: \((m, n_H, n_W, anchors, classes)\) 。
- YOLO結構為: \(Image(m, 608, 608, 3) -> Deep CNN -> Encoding(m, 19, 19, 5, 85)\) 。
編碼(Encoding)
下面是一個更加細節的編碼表示。
YOLO的編碼結構
如果一個目標的中心點在一個網格中,網格單元負責檢測該目標。
因為使用了5種錨框,因此 \(19\times 19\) 單元中的每一個單元都帶有編碼信息為關於5個錨框。錨框僅被定義為它的寬度和長度。
為了簡單起見,將 \((19, 19, 5, 85)\) 的最后兩個維度展開編碼,所以深層卷積網絡的輸出尺寸為 \((19, 19, 425)\) 。
將最后兩個維度展開
類別得分(Class score)
現在,對於每一個邊框(每一個單元的)將計算按元素的乘積並得到一個邊框包含了某個類別的概率。
類別得分計算公式 \(scire_{c, i}=p_c\times c_i\) :存在一個目標的概率 \(p_c\) 乘以某個類別的的概率 \(c_i\) 。
找到每個邊框檢測到的類別
- 在上圖中,對於第一個單元格的邊框1,一個目標存在的概率為 \(p_1=0.60\) 。所以在第一個邊框中存在目標有 \(60\%\) 的概率。
- 一個目標是“類別3(汽車)”的概率是 \(c_3=0.73\) 。
- 對於邊框1和類別3的得分為 \(score_{1, 3}=0.60\times 0.73=0.44\) 。
- 對於邊框1中的全部80個類別計算得到的分數,並找出得分的最大值即類別3(即屬於汽車類別)的。所以將此得分0.44和類別3賦到邊框1中。
類別可視化
下面一種方法可以將YOLO的預測在圖片上展示出:
- 對於每一個 \(19\times 19\) 的單元網格,找出最大的概率得分(選取80個類別中的最大值,每一個最大值對應着5個錨框)。
- 將每個單元網格認為的左右可能的目標上色
每一個 \(19\times 19\) 的單元格根據最大概率的類別被填充對應的顏色
此可視化方法不是用來做預測的YOLO算法本身的核心部分,這僅僅是一個將中間結果輸出的比較好的方法。
邊框可視化
另一個將YOLO輸出結果可視化的方法是畫出輸出的邊框。如下圖所示:
每一個單元給了5個邊框。總的來說,模型的預測:僅僅一張圖片就有(一次前向傳播通過網絡) \(19\times 19\times 5=1805\) 個邊框,不同的類別使用不同的顏色標記。
抑制非最大值輸出(Non-Max suppression)
在上面的圖片中,僅僅畫出模型預測的所有類別中的高概率的邊框,但是這仍然有太多的邊框。要對算法的輸出的檢測到的目標減少到一個比較少的數量。
為了這么做,將使用 non-max suppression 方法。具體來說,要執行以下步驟:
- 丟棄得分低的邊框(這意味着,邊框對於那些檢測到的類別中比較低的可信度(檢測到任何目標的低概率和某個特定類別的低概率)。
- 當幾個邊框彼此都選擇了同一個物體,邊框有着重疊部分,僅選擇一個邊框。
2.2 - 使用閾值在類別得分上進行過濾
首先使用一個閾值進行過濾。丟棄那些類別得分少於閾值的邊框。
模型輸出總共 \(19\times 19\times 5\times 85\) 個數字,每一個邊框使用85數字描述。有一個很方便的方式是重新排列 \((19, 19, 5, 85\) (或 \((19, 19, 425)\) )維度張量到下面的變量中:
box_confidence
:形狀為 \((19, 19, 5, 1)\) 的張量,包含了在每個 \(19\times 19\) 的單元格中的每5個邊框的 \(p_c\) (識別到某個物品的可信度)。boxes
:形狀為 \((19, 19, 5, 4)\) 的張量,包含了每一個單元格中的5個邊框的中心點和維度 \((b_x, b_y, b_h, b_w)\) 。box_class_probs
:形狀為 \((19, 19, 5, 1)\) 的張量,包含了每個單元格中每5個邊框中的80個類別的概率 \((c_1, c_2, \dots, c_{80})\) 。
編程實踐: 實現 yolo_filter_boxes()
。
- 通過圖片4描述的按元素乘積 \((p\times c)\) 計算邊框得分。
a = np.random.randn(19, 19, 5, 1)
b = np.random.randn(19, 19, 5, 80)
c = a * b # shape of c will be (19, 19, 5, 80)
這涉及到一個 廣播(broadcasting) 機制(不同形狀的向量相乘)的例子。
- 對於每一個邊框,找到:
- 最大值的得分邊框的類別索引
- 對應的邊框得分
參考文檔:
說明:
- 對於
argmax
和max
的axis
參數,如果想選擇最后一個軸,一個方法是直接賦值為axis=-1
。這類似於Python的數組索引。 - 執行
max
通常會將指定的axis
軸折疊到其上一個維度。keepdims=False
是默認選項,並且這允許指定的維度被移除。在本例中執行完最大值后不需要保持最后一個維度。
-
使用閾值創建一個遮罩。例如
([0.9, 0.3, 0.4, 0.5, 0.1] < 0.4)
返回:[False, True, False, False, True]
。這個遮罩中為True
的即為想要保留的邊框。 -
使用 TensorFlow 應用遮罩到
box_class_scores
,boxes
和box_classes
過濾出不想要的邊框。應該留下想要保留的邊框的一個子集。
參考文檔:
說明:
- 對於
tf.boolean_mask
函數,可以保持默認值axis=None
def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
"""
Filters YOLO boxes by thresholding on object and class confidence.
Arguments:
box_confidence -- tensor of shape (19, 19, 5, 1)
boxes -- tensor of shape (19, 19, 5, 4)
box_class_probs -- tensor of shape (19, 19, 5, 80)
threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
Returns:
scores -- tensor of shape (None,), containing the class probability score for selected boxes
boxes -- tensor of shape (None, 4), containing (b_x, b_y, b_h, b_w) coordinates of selected boxes
classes -- tensor of shape (None,), containing the index of the class detected by the selected boxes
Note: "None" is here because you don't know the exact number of selected boxes, as it depends on the threshold.
For example, the actual output size of scores would be (10,) if there are 10 boxes.
"""
# Step 1: Compute box scores
### START CODE HERE ### (≈ 1 line)
box_scores = box_confidence * box_class_probs
### END CODE HERE ###
# Step 2: Find the box_classes using the max box_scores, keep track of the corresponding score
### START CODE HERE ### (≈ 2 lines)
box_classes = K.argmax(box_scores, axis=-1)
box_class_scores = K.max(box_scores, axis=-1)
### END CODE HERE ###
# Step 3: Create a filtering mask based on "box_class_scores" by using "threshold". The mask should have the
# same dimension as box_class_scores, and be True for the boxes you want to keep (with probability >= threshold)
### START CODE HERE ### (≈ 1 line)
filtering_mask = box_class_scores >= threshold
### END CODE HERE ###
# Step 4: Apply the mask to box_class_scores, boxes and box_classes
### START CODE HERE ### (≈ 3 lines)
scores = tf.boolean_mask(box_class_scores, filtering_mask)
boxes = tf.boolean_mask(boxes, filtering_mask)
classes = tf.boolean_mask(box_classes, filtering_mask)
### END CODE HERE ###
return scores, boxes, classes
測試用例:
with tf.Session() as test_a:
box_confidence = tf.random_normal([19, 19, 5, 1], mean=1, stddev=4, seed = 1)
boxes = tf.random_normal([19, 19, 5, 4], mean=1, stddev=4, seed = 1)
box_class_probs = tf.random_normal([19, 19, 5, 80], mean=1, stddev=4, seed = 1)
scores, boxes, classes = yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = 0.5)
print("scores[2] = " + str(scores[2].eval()))
print("boxes[2] = " + str(boxes[2].eval()))
print("classes[2] = " + str(classes[2].eval()))
print("scores.shape = " + str(scores.shape))
print("boxes.shape = " + str(boxes.shape))
print("classes.shape = " + str(classes.shape))
輸出:
scores[2] = 10.7506
boxes[2] = [ 8.42653275 3.27136683 -0.5313437 -4.94137383]
classes[2] = 7
scores.shape = (?,)
boxes.shape = (?, 4)
classes.shape = (?,)
注意測試用例中,使用了隨機數來測試
yolo_filter_boxes
函數。在實際數據中,box_class_probs
將只包含0到1直接的非零數字用來代表概率值;在boxes
中的邊框坐標,其長度和寬度都是非負數。
2.3 - 非最大抑制(Non-max suppression)
在使用閾值對每一個類得分上進行過濾之后,仍然會有很多重疊的邊框。第二個過濾器是選擇正確的邊框,這被稱為non-max suppression(NMS)。
在這個例子中,模型預測了3輛汽車,但是實際上3個預測都是相同的一輛汽車。運行non-max suppression(NMS)將會從3個邊框中僅選擇一個最准確的(最高的概率)邊框。
非最大抑制使用非常重要的函數稱為 Intersection over Union (IOU)。
Intersection over Union(IOU)的定義
編程實踐: 實現iou()
- 在本文中,使用 \((0, 0)\) 表示圖像的左上角, \((1, 0)\) 表示右上角, \((1, 1)\) 表示右下角。換句話說, \((0, 0)\) 是從圖像的左上角開始。當增加 \(x\) 時,就是向右移動。增加 \(y\) 時,就是向下移動。
- 使用兩個邊角定義一個邊框:左上方為 \((x_1, y_1)\) 和右下方表示 \((x_2, y_2)\) ,而不是使用中心點、長度和寬度。這讓計算交集部分的時候簡單點。
- 為了計算矩形的面積,其高度 \((y_2 - y_1)\) 乘以寬度 \((x_2 - x_1)\) 。(因為 \((x_1, y_1)\) 是左上角並且 \((x_2, y_2)\) 是右下角,它們的差應該是非負數。
- 定義兩個邊框 \((xi_1, yi_1, xi_2, yi_2)\) 的 交集(intersection) 部分:
- 交集的左上角 \((xi_1, yi_1)\) 是通過比較兩個邊框的左上角 \((x_1, y_1)\) 並找出哪一個頂點的x坐標更靠右些,和y坐標更靠下些。
- 交集的右下角 \((xi_2, yi_2)\) 是通過比較兩個邊框的右下角 \((x_2, y_2)\) 並找出哪一個頂點的x坐標更靠左些,和y坐標更靠上些。
- 兩個邊框也許 沒有交集部分 。可以檢測計算得到的交集坐標是否是右上角(或左下角)。另一個方法是判斷計算得到的高度 \((y_2 - y_1)\) 或者寬度 \((x_2 - x_1)\) 至少一個長度是負數;就可以得到沒有交集(即交集面積為0)。
- 兩個邊框的交集也許在點上或邊上,這種情況下交集面積仍然為0。這發生在計算得到的交集部分的長度或寬度(或兩個長度)為0。
附加說明:
xi1
= 兩個邊框的x1坐標的最大值yi1
= 兩個邊框的y1坐標的最大值xi2
= 兩個邊框的x2坐標的最小值yi2
= 兩個邊框的y2坐標的最小值inter_area
= 可以使用max(height, 0)
和max(width, 0)
def iou(box1, box2):
"""
Implement the intersection over union (IoU) between box1 and box2
Arguments:
box1 -- first box, list object with coordinates (box1_x1, box1_y1, box1_x2, box_1_y2)
"""
# Assign variable names to coordinates for clarity
(box1_x1, box1_y1, box1_x2, box1_y2) = box1
(box2_x1, box2_y1, box2_x2, box2_y2) = box2
# Calculate the (yi1, xi1, yi2, xi2) coordinates of the intersection of box1 and box2. Calculate its Area.
### START CODE HERE ### (≈ 7 lines)
xi1 = max(box1_x1, box2_x1)
yi1 = max(box1_y1, box2_y1)
xi2 = min(box1_x2, box2_x2)
yi2 = min(box1_y2, box2_y2)
inter_width = xi2 - xi1
inter_height = yi2 - yi1
inter_area = inter_width * inter_height if(inter_width > 0 and inter_height > 0) else 0
### END CODE HERE ###
# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)
### START CODE HERE ### (≈ 3 lines)
box1_area = (box1_x2 - box1_x1) * (box1_y2 - box1_y1)
box2_area = (box2_x2 - box2_x1) * (box2_y2 - box2_y1)
union_area = box1_area + box2_area - inter_area
### END CODE HERE ###
# compute the IoU
### START CODE HERE ### (≈ 1 line)
iou = inter_area / union_area
### END CODE HERE ###
return iou
測試用例:
## Test case 1: boxes intersect
box1 = (2, 1, 4, 3)
box2 = (1, 2, 3, 4)
print("iou for intersecting boxes = " + str(iou(box1, box2)))
## Test case 2: boxes do not intersect
box1 = (1,2,3,4)
box2 = (5,6,7,8)
print("iou for non-intersecting boxes = " + str(iou(box1,box2)))
## Test case 3: boxes intersect at vertices only
box1 = (1,1,2,2)
box2 = (2,2,3,3)
print("iou for boxes that only touch at vertices = " + str(iou(box1,box2)))
## Test case 4: boxes intersect at edge only
box1 = (1,1,3,3)
box2 = (2,3,3,4)
print("iou for boxes that only touch at edges = " + str(iou(box1,box2)))
輸出:
iou for intersecting boxes = 0.14285714285714285
iou for non-intersecting boxes = 0.0
iou for boxes that only touch at vertices = 0.0
iou for boxes that only touch at edges = 0.0
YOLO non-max suppression
實現non-max suppression的幾個關鍵步驟是:
- 選擇最高得分的邊框。
- 計算每一個邊框的重疊(overlap)部分,然后移除有明顯重疊的邊框( \(iou\ge iou_threshold\) )。
- 返回第一步繼續迭代直到不再有比當前選擇的邊框得分更低的邊框。
這將會移除掉與當前選擇的邊框有着高度重疊部分的所有邊框。僅剩下最高得分的邊框。
編程實踐:使用TensorFlow實現 yolo_non_max_suppression()
函數。TensorFlow有兩個內建函數,它們用來實現non-max suppression(所以,實際上不需要使用上面自己實現的 iou()
函數)。
參考文檔:
tf.image.non_max_suppression(
boxes,
scores,
max_output_size,
iou_threshold=0.5,
name=None
)
注意TensorFlow的版本,這里沒有 score_threshold
參數(在文檔里顯示的最新版本)。
tf.keras.backend.gather(
reference, indices
)
def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
"""
Applies Non-max suppression (NMS) to set of boxes
Arguments:
scores -- tensor of shape (None,), output of yolo_filter_boxes()
boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
classes -- tensor of shape (None,), output of yolo_filter_boxes()
max_boxes -- integer, maximum number of predicted boxes you'd like
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
Returns:
scores -- tensor of shape (, None), predicted score for each box
boxes -- tensor of shape (4, None), predicted box coordinates
classes -- tensor of shape (, None), predicted class for each box
Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
function will transpose the shapes of scores, boxes, classes. This is made for convenience.
"""
max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor
# Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep
### START CODE HERE ### (≈ 1 line)
nms_indices = tf.image.non_max_suppression(boxes=boxes, scores=scores, max_output_size=max_boxes_tensor, iou_threshold=iou_threshold)
### END CODE HERE ###
# Use K.gather() to select only nms_indices from scores, boxes and classes
### START CODE HERE ### (≈ 3 lines)
scores = K.gather(reference=scores, indices=nms_indices)
boxes = K.gather(reference=boxes, indices=nms_indices)
classes = K.gather(reference=classes, indices=nms_indices)
### END CODE HERE ###
return scores, boxes, classes
測試用例:
with tf.Session() as test_b:
scores = tf.random_normal([54,], mean=1, stddev=4, seed = 1)
boxes = tf.random_normal([54, 4], mean=1, stddev=4, seed = 1)
classes = tf.random_normal([54,], mean=1, stddev=4, seed = 1)
scores, boxes, classes = yolo_non_max_suppression(scores, boxes, classes)
print("scores[2] = " + str(scores[2].eval()))
print("boxes[2] = " + str(boxes[2].eval()))
print("classes[2] = " + str(classes[2].eval()))
print("scores.shape = " + str(scores.eval().shape))
print("boxes.shape = " + str(boxes.eval().shape))
print("classes.shape = " + str(classes.eval().shape))
輸出:
scores[2] = 6.9384
boxes[2] = [-5.299932 3.13798141 4.45036697 0.95942086]
classes[2] = -2.24527
scores.shape = (10,)
boxes.shape = (10, 4)
classes.shape = (10,)
2.4 封裝過濾器
現在要實現一個函數,輸入深度卷積神經網絡(deep CNN)的輸出(編碼為 \(19\times 19\times 5\times 85\) 的維度)通過剛才實現的函數進行過濾邊框。
編程實踐:實現 yolo_eval()
其輸入為YOLO編碼的輸出並使用得分閾值和NMS(non-max suppression)進行過濾。有幾個方法表示邊框,比如通過它們的邊角或者通過它們的中心點和長度/寬度。YOLO使用下面的函數實現在不同時間的幾種此類格式之間的轉換。
boxes = yolo_boxes_to_corners(box_xy, box_wh)
其轉換YOLO邊框坐標 \((x, y, w, h)\) 到邊角邊框的坐標 \((x_1, y_1, x_2, y_2)\) 來擬合 yolo_filter_boxes
的輸入。
boxes = scale_boxes(boxes, image_shape)
YOLO的網絡是訓練在 \(608\times 608\) 的圖像上。如果測試數據是不同尺寸的圖像,比如,汽車檢測數據集是 \(720\times 1280\) ,這一步驟是重新縮放為了讓邊框能在原始圖像 \(720\times 1280\) 上畫出來。
def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
"""
Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.
Arguments:
yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
box_confidence: tensor of shape (None, 19, 19, 5, 1)
box_xy: tensor of shape (None, 19, 19, 5, 2)
box_wh: tensor of shape (None, 19, 19, 5, 2)
box_class_probs: tensor of shape (None, 19, 19, 5, 80)
image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
max_boxes -- integer, maximum number of predicted boxes you'd like
score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering
Returns:
scores -- tensor of shape (None, ), predicted score for each box
boxes -- tensor of shape (None, 4), predicted box coordinates
classes -- tensor of shape (None,), predicted class for each box
"""
### START CODE HERE ###
# Retrieve outputs of the YOLO model (≈1 line)
box_confidence, box_xy, box_wh, box_class_probs = yolo_outputs
# Convert boxes to be ready for filtering functions (convert boxes box_xy and box_wh to corner coordinates)
boxes = yolo_boxes_to_corners(box_xy, box_wh)
# Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold (≈1 line)
scores, boxes, classes = yolo_filter_boxes(box_confidence=box_confidence, boxes=boxes, box_class_probs=box_class_probs, threshold=score_threshold)
# Scale boxes back to original image shape.
boxes = scale_boxes(boxes, image_shape)
# Use one of the functions you've implemented to perform Non-max suppression with
# maximum number of boxes set to max_boxes and a threshold of iou_threshold (≈1 line)
scores, boxes, classes = yolo_non_max_suppression(scores=scores, boxes=boxes, classes=classes, max_boxes=max_boxes, iou_threshold=iou_threshold)
### END CODE HERE ###
return scores, boxes, classes
測試用例:
with tf.Session() as test_b:
yolo_outputs = (tf.random_normal([19, 19, 5, 1], mean=1, stddev=4, seed = 1),
tf.random_normal([19, 19, 5, 2], mean=1, stddev=4, seed = 1),
tf.random_normal([19, 19, 5, 2], mean=1, stddev=4, seed = 1),
tf.random_normal([19, 19, 5, 80], mean=1, stddev=4, seed = 1))
scores, boxes, classes = yolo_eval(yolo_outputs)
print("scores[2] = " + str(scores[2].eval()))
print("boxes[2] = " + str(boxes[2].eval()))
print("classes[2] = " + str(classes[2].eval()))
print("scores.shape = " + str(scores.eval().shape))
print("boxes.shape = " + str(boxes.eval().shape))
print("classes.shape = " + str(classes.eval().shape))
輸出:
scores[2] = 138.791
boxes[2] = [ 1292.32971191 -278.52166748 3876.98925781 -835.56494141]
classes[2] = 54
scores.shape = (10,)
boxes.shape = (10, 4)
classes.shape = (10,)
YOLO 總結
-
輸入圖像為 \((608, 608, 3)\) 。
-
輸入圖像通過CNN輸出結果的維度為 \((19, 19, 5, 85)\) 。
-
展開后兩個維度后,輸出變成了 \((19, 19, 425)\) :
- 輸入圖像上的每一個 \(19\times 19\) 的單元格都給出了425個數字。
- \(425 = 5\times 85\) 因為每一個單元都包含5個邊框(錨框)的預測。
- \(85 = 5+80\) 其中5是因為 \((p_c, b_x, b_y, b_h, b_w)\) 有5個數字,並且有想要檢測的80個類別。
-
僅選擇少許的邊框基於:
- 得分閾值:過濾掉檢測到的類別的得分少於閾值的邊框。
- 非最大抑制(Non-max suppression):計算Intersection over Union(IOU)避免選擇重疊邊框。
-
給出YOLO的最終輸出。
3 - 在圖像上測試YOLO預訓練模型
使用預訓練的模型並用其測試車輛檢測的數據集。首先需要一個session用來執行計算圖和評估張量。
sess = K.get_session()
3.1 - 定義類別、錨框和圖像形狀
- 檢測80個類別並使用5種錨框。
- 80個類別的信息和5個錨框都收集在了
coco_classes.txt
和yolo_anchors.txt
文件中。 - 從文本文件中讀取類別名稱和錨框。
- 車輛檢測的數據集的圖像是 \(720\times 1280\) 將其預處理為 \(608\times 608\) 的圖像。
class_names = read_classes("./model_data/coco_classes.txt")
anchors = read_anchors("./model_data/yolo_anchors.txt")
image_shape = (720., 1280.)
3.2 - 加載預訓練模型
- 訓練一個YOLO模型需要花費非常長的時間並且對於很大范圍的目標類別,需要已經標注了邊框的相當大的數據集。
- 加載的預訓練的Keras YOLO模型,存儲在
yolo.h5
文件中。 - 這些權重來自於YOLO官方網站,並且使用被Allan Zelender編寫的一個函數將其轉換。從技術上來說,這些參數來自於YOLOv2模型。
從文件中加載模型:
yolo_model = load_model("./model_data/yolo.h5")
輸出模型包含的每一層。
yolo_model.summary()
____________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
====================================================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
____________________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
____________________________________________________________________________________________________
batch_normalization_1 (BatchNorm (None, 608, 608, 32) 128 conv2d_1[0][0]
____________________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
____________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
____________________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
____________________________________________________________________________________________________
batch_normalization_2 (BatchNorm (None, 304, 304, 64) 256 conv2d_2[0][0]
____________________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
____________________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
____________________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128) 73728 max_pooling2d_2[0][0]
____________________________________________________________________________________________________
batch_normalization_3 (BatchNorm (None, 152, 152, 128) 512 conv2d_3[0][0]
____________________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128) 0 batch_normalization_3[0][0]
____________________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
____________________________________________________________________________________________________
batch_normalization_4 (BatchNorm (None, 152, 152, 64) 256 conv2d_4[0][0]
____________________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
____________________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128) 73728 leaky_re_lu_4[0][0]
____________________________________________________________________________________________________
batch_normalization_5 (BatchNorm (None, 152, 152, 128) 512 conv2d_5[0][0]
____________________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128) 0 batch_normalization_5[0][0]
____________________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
____________________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
____________________________________________________________________________________________________
batch_normalization_6 (BatchNorm (None, 76, 76, 256) 1024 conv2d_6[0][0]
____________________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
____________________________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
____________________________________________________________________________________________________
batch_normalization_7 (BatchNorm (None, 76, 76, 128) 512 conv2d_7[0][0]
____________________________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
____________________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
____________________________________________________________________________________________________
batch_normalization_8 (BatchNorm (None, 76, 76, 256) 1024 conv2d_8[0][0]
____________________________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
____________________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
____________________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
____________________________________________________________________________________________________
batch_normalization_9 (BatchNorm (None, 38, 38, 512) 2048 conv2d_9[0][0]
____________________________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
____________________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
____________________________________________________________________________________________________
batch_normalization_10 (BatchNor (None, 38, 38, 256) 1024 conv2d_10[0][0]
____________________________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
____________________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
____________________________________________________________________________________________________
batch_normalization_11 (BatchNor (None, 38, 38, 512) 2048 conv2d_11[0][0]
____________________________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
____________________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
____________________________________________________________________________________________________
batch_normalization_12 (BatchNor (None, 38, 38, 256) 1024 conv2d_12[0][0]
____________________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
____________________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
____________________________________________________________________________________________________
batch_normalization_13 (BatchNor (None, 38, 38, 512) 2048 conv2d_13[0][0]
____________________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
____________________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
____________________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
____________________________________________________________________________________________________
batch_normalization_14 (BatchNor (None, 19, 19, 1024) 4096 conv2d_14[0][0]
____________________________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
____________________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
____________________________________________________________________________________________________
batch_normalization_15 (BatchNor (None, 19, 19, 512) 2048 conv2d_15[0][0]
____________________________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
____________________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
____________________________________________________________________________________________________
batch_normalization_16 (BatchNor (None, 19, 19, 1024) 4096 conv2d_16[0][0]
____________________________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
____________________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
____________________________________________________________________________________________________
batch_normalization_17 (BatchNor (None, 19, 19, 512) 2048 conv2d_17[0][0]
____________________________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
____________________________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
____________________________________________________________________________________________________
batch_normalization_18 (BatchNor (None, 19, 19, 1024) 4096 conv2d_18[0][0]
____________________________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
____________________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
____________________________________________________________________________________________________
batch_normalization_19 (BatchNor (None, 19, 19, 1024) 4096 conv2d_19[0][0]
____________________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
____________________________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
____________________________________________________________________________________________________
batch_normalization_21 (BatchNor (None, 38, 38, 64) 256 conv2d_21[0][0]
____________________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
____________________________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
____________________________________________________________________________________________________
batch_normalization_20 (BatchNor (None, 19, 19, 1024) 4096 conv2d_20[0][0]
____________________________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
____________________________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
____________________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
leaky_re_lu_20[0][0]
____________________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
____________________________________________________________________________________________________
batch_normalization_22 (BatchNor (None, 19, 19, 1024) 4096 conv2d_22[0][0]
____________________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
____________________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
====================================================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
____________________________________________________________________________________________________
這個模型轉換輸入圖像 \((m, 608, 608, 3)\) 的一個預處理的批量到一個形狀為 \((m, 19, 19, 5, 85)\) 。
3.3 - 轉換模型的輸出到到可用的邊框張量
yolo_model
的輸出是一個 \((m, 19, 19, 5, 85)\) 張量,這需要通過非平凡(non-trivial)的處理和轉換。
關於 yolo_ahead
的實現,它定義在文件 ./yad2k/models/keras_yolo.py
文件中。
yolo_outputs = yolo_ahead(yolo_model.output, anchors, len(class_names))
增加 yolo_outputs
到計算圖中。這4個張量集已經准備好作為 yolo_eval
函數的輸入。
3.4 - 過濾邊框
yolo_outpus
給出了所有的預測邊框。現在需要執行過濾並選擇的最好邊框。調用前面實現的 yolo_eval
函數。
scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)
3.5 - 在圖像上運行計算圖
已經創建好了計算圖並可以被描述為:
yolo_model.input
傳入yolo_model
。模型被用來計算輸出yolo_model.output
。yolo_model.output
被yolo_ahead
處理。輸出yolo_outpus
。yolo_outpus
通過過濾函數yolo_eval
。輸出預測:scores
、boxes
、clases
。
編程實踐:實現 predict()
其運行在計算圖,使用YOLO測試一張圖像。需要運行一個TensorFlow的session,到輸出 scores
、 boxes
、 clases
。
image, image_data = preprocess_image("./images/" + image_file, model_image_size=(608, 608))
其輸出為:
image
:用於繪制邊框的圖像的Python(PIL)表示。不會用到它。image_data
:一個使用NumPy數組表示的圖像。將會被傳入到CNN的輸入。
提示: 當模型使用 BatchNorm (YOLO就使用了)時,需要傳入一個額外的占位符到 feed_dict {K.learning_phase():0}
。
- 前面使用了
K.get_session()
得到了一個TensorFlow的Session對象並將其存儲到sess
變量中。 - 為了計算張量列表,調用
sess.run()
:
sess.run(fetches=[tensor1,tensor2,tensor3],
feed_dict={yolo_model.input: the_input_variable,
K.learning_phase():0
}
)
- 變量
scores, boxes, classes
都沒有傳入predict
函數,因為這些都是全局變量,可以在函數內直接使用。
def predict(sess, image_file):
"""
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the predictions.
Arguments:
sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.
Returns:
out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes
Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.
"""
# Preprocess your image
image, image_data = preprocess_image("images/" + image_file, model_image_size = (608, 608))
# Run the session with the correct tensors and choose the correct placeholders in the feed_dict.
# You'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})
### START CODE HERE ### (≈ 1 line)
out_scores, out_boxes, out_classes = sess.run([scores, boxes, classes], feed_dict={yolo_model.input:image_data, K.learning_phase():0})
### END CODE HERE ###
# Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image
image.save(os.path.join("out", image_file), quality=90)
# Display the results in the notebook
output_image = scipy.misc.imread(os.path.join("out", image_file))
imshow(output_image)
return out_scores, out_boxes, out_classes
在圖片 test.jpg
上運行 predict
函數。
out_scores, out_boxes, out_classes = predict(sess, "test.jpg")
輸出:
Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)
剛剛運行的模型可以檢測到80個不同的類別,所有類別全部列舉在 coco_classes.txt
文件中。
當駕駛在硅谷街道時,YOLO模型運行在拍攝的圖像中的預測。
drive.ai 提供的數據集。
總結
-
YOLO是一個非常前沿的並且有着高准確率和快速的執行速度目標檢測模型。
-
將輸入圖像通過傳入CNN輸出維度為 \(19\times 19\times 5\times 85\) 。
-
編碼可以被看作為每一個 \(19\times 19\) 的網格包含的信息是關於5個錨框。
-
使用 non-max suppression 過濾邊框。尤其是:
- 檢測到的目標類別被概率的得分閾值過濾掉,僅剩下准確的邊框。
- Intersection over Union(IOU)閾值用來限制重疊的邊框。
-
因為從隨機初始化的權重訓練一個YOLO模型並非易事的而且需要大量的數據和計算力,本文使用之前訓練好的模型參數。也可以在自己的數據集上進行微調(fine-tuning)YOLO模型。
參考
本文提出的想法主要來自於兩篇YOLO的論文。本文的代碼實現中很大一部分靈感和很多部分也都參考自在Allan Zelender的GitHub倉庫。本文使用的預訓練權重來源於YOLO的官方網站。本文是個人來自於Course的課程中翻譯得到。
- Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi - You Only Look Once: Unified, Real-Time Object Detection (2015)
- Joseph Redmon, Ali Farhadi - YOLO9000: Better, Faster, Stronger (2016)
- Allan Zelener - YAD2K: Yet Another Darknet 2 Keras
- YOLO官方網站: https://pjreddie.com/darknet/yolo/