寫給程序員的機器學習入門 (九) - 對象識別 RCNN 與 Fast-RCNN

本文轉載自查看原文 2020-11-27 16:25 7371 機器學習入門

因為這幾個月飯店生意恢復，加上研究 Faster-RCNN 用掉了很多時間，就沒有更新博客了🐶。這篇開始會介紹對象識別的模型與實現方法，首先會介紹最簡單的 RCNN 與 Fast-RCNN 模型，下一篇會介紹 Faster-RCNN 模型，再下一篇會介紹 YOLO 模型。

圖片分類與對象識別

在前面的文章中我們看到了如何使用 CNN 模型識別圖片里面的物體是什么類型，或者識別圖片中固定的文字 (即驗證碼)，因為模型會把整個圖片當作輸入並輸出固定的結果，所以圖片中只能有一個主要的物體或者固定數量的文字。

如果圖片包含了多個物體，我們想識別有哪些物體，各個物體在什么位置，那么只用 CNN 模型是無法實現的。我們需要可以找出圖片哪些區域包含物體並且判斷每個區域包含什么物體的模型，這樣的模型稱為對象識別模型 (Object Detection Model)，最早期的對象識別模型是 RCNN 模型，后來又發展出 Fast-RCNN (SPPnet)，Faster-RCNN ，和 YOLO 等模型。因為對象識別需要處理的數據量多，速度會比較慢 (例如 RCNN 檢測單張圖片包含的物體可能需要幾十秒)，而對象識別通常又要求實時性 (例如來源是攝像頭提供的視頻)，所以如何提升對象識別的速度是一個主要的命題，后面發展出的 Faster-RCNN 與 YOLO 都可以在一秒鍾檢測幾十張圖片。

對象識別的應用范圍比較廣，例如人臉識別，車牌識別，自動駕駛等等都用到了對象識別的技術。對象識別是當今機器學習領域的一個前沿，2017 年研發出來的 Mask-RCNN 模型還可以檢測對象的輪廓。

因為看上去越神奇的東西實現起來越難，對象識別模型相對於之前介紹的模型難度會高很多，請做好心理准備😱。

對象識別模型需要的訓練數據

在介紹具體的模型之前，我們首先看看對象識別模型需要什么樣的訓練數據：

對象識別模型需要給每個圖片標記有哪些區域，與每個區域對應的標簽，也就是訓練數據需要是列表形式的。區域的格式通常有兩種，(x, y, w, h) => 左上角的坐標與長寬，與 (x1, y1, x2, y2) => 左上角與右下角的坐標，這兩種格式可以互相轉換，處理的時候只需要注意是哪種格式即可。標簽除了需要識別的各個分類之外，還需要有一個特殊的非對象 (背景) 標簽，表示這個區域不包含任何可以識別的對象，因為非對象區域通常可以自動生成，所以訓練數據不需要包含非對象區域與標簽。

RCNN

RCNN (Region Based Convolutional Neural Network) 是最早期的對象識別模型，實現比較簡單，可以分為以下步驟：

用某種算法在圖片中選取 2000 個可能出現對象的區域
截取這 2000 個區域到 2000 個子圖片，然后縮放它們到一個固定的大小
用普通的 CNN 模型分別識別這 2000 個子圖片，得出它們的分類
排除標記為 "非對象" 分類的區域
把剩余的區域作為輸出結果

你可能已經從步驟里看出，RCNN 有幾個大問題😠：

結果的精度很大程度取決於選取區域使用的算法
選取區域使用的算法是固定的，不參與學習，如果算法沒有選出某個包含對象區域那么怎么學習都無法識別這個區域出來
慢，賊慢🐢，識別 1 張圖片實際等於識別 2000 張圖片

后面介紹模型結果會解決這些問題，但首先我們需要理解最簡單的 RCNN 模型，接下來我們細看一下 RCNN 實現中幾個重要的部分吧。

選取可能出現對象的區域

選取可能出現對象的區域的算法有很多種，例如滑動窗口法 (Sliding Window) 和選擇性搜索法 (Selective Search)。滑動窗口法非常簡單，決定一個固定大小的區域，然后按一定距離滑動得出下一個區域即可。滑動窗口法實現簡單但選取出來的區域數量非常龐大並且精度很低，所以通常不會使用這種方法，除非物體大小固定並且出現的位置有一定規律。

選擇性搜索法則比較高級，以下是簡單的說明，摘自 opencv 的文章：

你還可以參考這篇文章或原始論文了解具體的計算方法。

如果你覺得難以理解可以跳過，因為接下來我們會直接使用 opencv 類庫中提供的選擇搜索函數。而且選擇搜索法精度也不高，后面介紹的模型將會使用更好的方法。

# 使用 opencv 類庫中提供的選擇搜索函數的代碼例子
import cv2

img = cv2.imread("圖片路徑")
s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
s.setBaseImage(img)
s.switchToSelectiveSearchFast()
boxes = s.process() # 可能出現對象的所有區域，會按可能性排序
candidate_boxes = boxes[:2000] # 選取頭 2000 個區域

按重疊率 (IOU) 判斷每個區域是否包含對象

使用算法選取出來的區域與實際區域通常不會完全重疊，只會重疊一部分，在學習的過程中我們需要根據手頭上的真實區域預先判斷選取出來的區域是否包含對象，再告訴模型預測結果是否正確。判斷選取區域是否包含對象會依據重疊率 (IOU - Intersection Over Union)，所謂重疊率就是兩個區域重疊的面積占兩個區域合並的面積的比率，如下圖所示。

我們可以規定重疊率大於 70% 的候選區域包含對象，重疊率小於 30% 的區域不包含對象，而重疊率介於 30% ~ 70% 的區域不應該參與學習，這是為了給模型提供比較明確的數據，使得學習效果更好。

計算重疊率的代碼如下，如果兩個區域沒有重疊則重疊率會為 0：

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合並部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

原始論文

如果你想看 RCNN 的原始論文可以到以下的地址：

https://arxiv.org/pdf/1311.2524.pdf

使用 RCNN 識別圖片中的人臉

好了，到這里你應該大致了解 RCNN 的實現原理，接下來我們試着用 RCNN 學習識別一些圖片。

因為收集圖片和標記圖片非常累人🤕，為了偷懶這篇我還是使用現成的數據集。以下是包含人臉圖片的數據集，並且帶了各個人臉所在的區域的標記，格式是 (x1, y1, x2, y2)。下載需要注冊帳號，但不需要交錢🤒。

https://www.kaggle.com/vin1234/count-the-number-of-faces-present-in-an-image

下載解壓后可以看到圖片在 train/image_data 下，標記在 bbox_train.csv 中。

例如以下的圖片：

對應 csv 中的以下標記：

Name,width,height,xmin,ymin,xmax,ymax
10001.jpg,612,408,192,199,230,235
10001.jpg,612,408,247,168,291,211
10001.jpg,612,408,321,176,366,222
10001.jpg,612,408,355,183,387,214

數據的意義如下：

Name: 文件名
width: 圖片整體寬度
height: 圖片整體高度
xmin: 人臉區域的左上角的 x 坐標
ymin: 人臉區域的左上角的 y 坐標
xmax: 人臉區域的右下角的 x 坐標
ymax: 人臉區域的右下角的 y 坐標

使用 RCNN 學習與識別這些圖片中的人臉區域的代碼如下：

import os
import sys
import torch
import gzip
import itertools
import random
import numpy
import pandas
import torchvision
import cv2
from torch import nn
from matplotlib import pyplot
from collections import defaultdict

# 各個區域縮放到的圖片大小
REGION_IMAGE_SIZE = (32, 32)
# 分析目標的圖片所在的文件夾
IMAGE_DIR = "./784145_1347673_bundle_archive/train/image_data"
# 定義各個圖片中人臉區域的 CSV 文件
BOX_CSV_PATH = "./784145_1347673_bundle_archive/train/bbox_train.csv"

# 用於啟用 GPU 支持
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class MyModel(nn.Module):
    """識別是否人臉 (ResNet-18)"""
    def __init__(self):
        super().__init__()
        # Resnet 的實現
        # 輸出兩個分類 [非人臉, 人臉]
        self.resnet = torchvision.models.resnet18(num_classes=2)

    def forward(self, x):
        # 應用 ResNet
        y = self.resnet(x)
        return y

def save_tensor(tensor, path):
    """保存 tensor 對象到文件"""
    torch.save(tensor, gzip.GzipFile(path, "wb"))

def load_tensor(path):
    """從文件讀取 tensor 對象"""
    return torch.load(gzip.GzipFile(path, "rb"))

def image_to_tensor(img):
    """轉換 opencv 圖片對象到 tensor 對象"""
    # 注意 opencv 是 BGR，但對訓練沒有影響所以不用轉為 RGB
    img = cv2.resize(img, dsize=REGION_IMAGE_SIZE)
    arr = numpy.asarray(img)
    t = torch.from_numpy(arr)
    t = t.transpose(0, 2) # 轉換維度 H,W,C 到 C,W,H
    t = t / 255.0 # 正規化數值使得范圍在 0 ~ 1
    return t

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合並部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

def selective_search(img):
    """計算 opencv 圖片中可能出現對象的區域，只返回頭 2000 個區域"""
    # 算法參考 https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
    s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    s.setBaseImage(img)
    s.switchToSelectiveSearchFast()
    boxes = s.process()
    return boxes[:2000]

def prepare_save_batch(batch, image_tensors, image_labels):
    """准備訓練 - 保存單個批次的數據"""
    # 生成輸入和輸出 tensor 對象
    tensor_in = torch.stack(image_tensors) # 維度: B,C,W,H
    tensor_out = torch.tensor(image_labels, dtype=torch.long) # 維度: B

    # 切分訓練集 (80%)，驗證集 (10%) 和測試集 (10%)
    random_indices = torch.randperm(tensor_in.shape[0])
    training_indices = random_indices[:int(len(random_indices)*0.8)]
    validating_indices = random_indices[int(len(random_indices)*0.8):int(len(random_indices)*0.9):]
    testing_indices = random_indices[int(len(random_indices)*0.9):]
    training_set = (tensor_in[training_indices], tensor_out[training_indices])
    validating_set = (tensor_in[validating_indices], tensor_out[validating_indices])
    testing_set = (tensor_in[testing_indices], tensor_out[testing_indices])

    # 保存到硬盤
    save_tensor(training_set, f"data/training_set.{batch}.pt")
    save_tensor(validating_set, f"data/validating_set.{batch}.pt")
    save_tensor(testing_set, f"data/testing_set.{batch}.pt")
    print(f"batch {batch} saved")

def prepare():
    """准備訓練"""
    # 數據集轉換到 tensor 以后會保存在 data 文件夾下
    if not os.path.isdir("data"):
        os.makedirs("data")

    # 加載 csv 文件，構建圖片到區域列表的索引 { 圖片名: [ 區域, 區域, .. ] }
    box_map = defaultdict(lambda: [])
    df = pandas.read_csv(BOX_CSV_PATH)
    for row in df.values:
        filename, width, height, x1, y1, x2, y2 = row[:7]
        box_map[filename].append((x1, y1, x2-x1, y2-y1))

    # 從圖片里面提取人臉 (正樣本) 和非人臉 (負樣本) 的圖片
    batch_size = 1000
    batch = 0
    image_tensors = []
    image_labels = []
    for filename, true_boxes in box_map.items():
        path = os.path.join(IMAGE_DIR, filename)
        img = cv2.imread(path) # 加載原始圖片
        candidate_boxes = selective_search(img) # 查找候選區域
        positive_samples = 0
        negative_samples = 0
        for candidate_box in candidate_boxes:
            # 如果候選區域和任意一個實際區域重疊率大於 70%，則認為是正樣本
            # 如果候選區域和所有實際區域重疊率都小於 30%，則認為是負樣本
            # 每個圖片最多添加正樣本數量 + 10 個負樣本，需要提供足夠多負樣本避免偽陽性判斷
            iou_list = [ calc_iou(candidate_box, true_box) for true_box in true_boxes ]
            positive_index = next((index for index, iou in enumerate(iou_list) if iou > 0.70), None)
            is_negative = all(iou < 0.30 for iou in iou_list)
            result = None
            if positive_index is not None:
                result = True
                positive_samples += 1
            elif is_negative and negative_samples < positive_samples + 10:
                result = False
                negative_samples += 1
            if result is not None:
                x, y, w, h = candidate_box
                child_img = img[y:y+h, x:x+w].copy()
                # 檢驗計算是否有問題
                # cv2.imwrite(f"{filename}_{x}_{y}_{w}_{h}_{int(result)}.png", child_img)
                image_tensors.append(image_to_tensor(child_img))
                image_labels.append(int(result))
                if len(image_tensors) >= batch_size:
                    # 保存批次
                    prepare_save_batch(batch, image_tensors, image_labels)
                    image_tensors.clear()
                    image_labels.clear()
                    batch += 1
    # 保存剩余的批次
    if len(image_tensors) > 10:
        prepare_save_batch(batch, image_tensors, image_labels)

def train():
    """開始訓練"""
    # 創建模型實例
    model = MyModel().to(device)

    # 創建損失計算器
    loss_function = torch.nn.CrossEntropyLoss()

    # 創建參數調整器
    optimizer = torch.optim.Adam(model.parameters())

    # 記錄訓練集和驗證集的正確率變化
    training_accuracy_history = []
    validating_accuracy_history = []

    # 記錄最高的驗證集正確率
    validating_accuracy_highest = -1
    validating_accuracy_highest_epoch = 0

    # 讀取批次的工具函數
    def read_batches(base_path):
        for batch in itertools.count():
            path = f"{base_path}.{batch}.pt"
            if not os.path.isfile(path):
                break
            yield [ t.to(device) for t in load_tensor(path) ]

    # 計算正確率的工具函數，正樣本和負樣本的正確率分別計算再平均
    def calc_accuracy(actual, predicted):
        predicted = torch.max(predicted, 1).indices
        acc_positive = ((actual > 0.5) & (predicted > 0.5)).sum().item() / ((actual > 0.5).sum().item() + 0.00001)
        acc_negative = ((actual <= 0.5) & (predicted <= 0.5)).sum().item() / ((actual <= 0.5).sum().item() + 0.00001)
        acc = (acc_positive + acc_negative) / 2
        return acc
 
    # 划分輸入和輸出的工具函數
    def split_batch_xy(batch, begin=None, end=None):
        # shape = batch_size, channels, width, height
        batch_x = batch[0][begin:end]
        # shape = batch_size, num_labels
        batch_y = batch[1][begin:end]
        return batch_x, batch_y

    # 開始訓練過程
    for epoch in range(1, 10000):
        print(f"epoch: {epoch}")

        # 根據訓練集訓練並修改參數
        model.train()
        training_accuracy_list = []
        for batch_index, batch in enumerate(read_batches("data/training_set")):
            # 切分小批次，有助於泛化模型
            training_batch_accuracy_list = []
            for index in range(0, batch[0].shape[0], 100):
                # 划分輸入和輸出
                batch_x, batch_y = split_batch_xy(batch, index, index+100)
                # 計算預測值
                predicted = model(batch_x)
                # 計算損失
                loss = loss_function(predicted, batch_y)
                # 從損失自動微分求導函數值
                loss.backward()
                # 使用參數調整器調整參數
                optimizer.step()
                # 清空導函數值
                optimizer.zero_grad()
                # 記錄這一個批次的正確率，torch.no_grad 代表臨時禁用自動微分功能
                with torch.no_grad():
                    training_batch_accuracy_list.append(calc_accuracy(batch_y, predicted))
            # 輸出批次正確率
            training_batch_accuracy = sum(training_batch_accuracy_list) / len(training_batch_accuracy_list)
            training_accuracy_list.append(training_batch_accuracy)
            print(f"epoch: {epoch}, batch: {batch_index}: batch accuracy: {training_batch_accuracy}")
        training_accuracy = sum(training_accuracy_list) / len(training_accuracy_list)
        training_accuracy_history.append(training_accuracy)
        print(f"training accuracy: {training_accuracy}")

        # 檢查驗證集
        model.eval()
        validating_accuracy_list = []
        for batch in read_batches("data/validating_set"):
            batch_x, batch_y = split_batch_xy(batch)
            predicted = model(batch_x)
            validating_accuracy_list.append(calc_accuracy(batch_y, predicted))
        validating_accuracy = sum(validating_accuracy_list) / len(validating_accuracy_list)
        validating_accuracy_history.append(validating_accuracy)
        print(f"validating accuracy: {validating_accuracy}")

        # 記錄最高的驗證集正確率與當時的模型狀態，判斷是否在 20 次訓練后仍然沒有刷新記錄
        if validating_accuracy > validating_accuracy_highest:
            validating_accuracy_highest = validating_accuracy
            validating_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest validating accuracy updated")
        elif epoch - validating_accuracy_highest_epoch > 20:
            # 在 20 次訓練后仍然沒有刷新記錄，結束訓練
            print("stop training because highest validating accuracy not updated in 20 epoches")
            break

    # 使用達到最高正確率時的模型狀態
    print(f"highest validating accuracy: {validating_accuracy_highest}",
        f"from epoch {validating_accuracy_highest_epoch}")
    model.load_state_dict(load_tensor("model.pt"))

    # 檢查測試集
    testing_accuracy_list = []
    for batch in read_batches("data/testing_set"):
        batch_x, batch_y = split_batch_xy(batch)
        predicted = model(batch_x)
        testing_accuracy_list.append(calc_accuracy(batch_y, predicted))
    testing_accuracy = sum(testing_accuracy_list) / len(testing_accuracy_list)
    print(f"testing accuracy: {testing_accuracy}")

    # 顯示訓練集和驗證集的正確率變化
    pyplot.plot(training_accuracy_history, label="training")
    pyplot.plot(validating_accuracy_history, label="validing")
    pyplot.ylim(0, 1)
    pyplot.legend()
    pyplot.show()

def eval_model():
    """使用訓練好的模型"""
    # 創建模型實例，加載訓練好的狀態，然后切換到驗證模式
    model = MyModel().to(device)
    model.load_state_dict(load_tensor("model.pt"))
    model.eval()

    # 詢問圖片路徑，並顯示所有可能是人臉的區域
    while True:
        try:
            # 選取可能出現對象的區域一覽
            image_path = input("Image path: ")
            if not image_path:
                continue
            img = cv2.imread(image_path)
            candidate_boxes = selective_search(img)
            # 構建輸入
            image_tensors = []
            for candidate_box in candidate_boxes:
                x, y, w, h = candidate_box
                child_img = img[y:y+h, x:x+w].copy()
                image_tensors.append(image_to_tensor(child_img))
            tensor_in = torch.stack(image_tensors).to(device)
            # 預測輸出
            tensor_out = model(tensor_in)
            # 使用 softmax 計算是人臉的概率
            tensor_out = nn.functional.softmax(tensor_out, dim=1)
            tensor_out = tensor_out[:,1].resize(tensor_out.shape[0])
            # 判斷概率大於 99% 的是人臉，添加邊框到圖片並保存
            img_output = img.copy()
            indices = torch.where(tensor_out > 0.99)[0]
            result_boxes = []
            result_boxes_all = []
            for index in indices:
                box = candidate_boxes[index]
                for exists_box in result_boxes_all:
                    # 如果和現存找到的區域重疊度大於 30% 則跳過
                    if calc_iou(exists_box, box) > 0.30:
                        break
                else:
                    result_boxes.append(box)
                result_boxes_all.append(box)
            for box in result_boxes:
                x, y, w, h = box
                print(x, y, w, h)
                cv2.rectangle(img_output, (x, y), (x+w, y+h), (0, 0, 0xff), 1)
            cv2.imwrite("img_output.png", img_output)
            print("saved to img_output.png")
            print()
        except Exception as e:
            print("error:", e)

def main():
    """主函數"""
    if len(sys.argv) < 2:
        print(f"Please run: {sys.argv[0]} prepare|train|eval")
        exit()

    # 給隨機數生成器分配一個初始值，使得每次運行都可以生成相同的隨機數
    # 這是為了讓過程可重現，你也可以選擇不這樣做
    random.seed(0)
    torch.random.manual_seed(0)

    # 根據命令行參數選擇操作
    operation = sys.argv[1]
    if operation == "prepare":
        prepare()
    elif operation == "train":
        train()
    elif operation == "eval":
        eval_model()
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    main()

和之前文章給出的代碼例子一樣，這份代碼也分為了 prepare, train, eval 三個部分，其中 prepare 部分負責選取區域，提取正樣本 (包含人臉的區域) 和負樣本 (不包含人臉的區域) 的子圖片；train 使用普通的 resnet 模型學習子圖片；eval 針對給出的圖片選取區域並識別所有區域中是否包含人臉。

除了選取區域和提取子圖片的處理以外，基本上和之前介紹的 CNN 模型一樣吧🥳。

執行以下命令以后：

python3 example.py prepare
python3 example.py train

的最終輸出如下：

epoch: 101, batch: 106: batch accuracy: 0.9999996838862198
epoch: 101, batch: 107: batch accuracy: 0.999218446914751
epoch: 101, batch: 108: batch accuracy: 0.9999996211125055
training accuracy: 0.999441394076678
validating accuracy: 0.9687856357743619
stop training because highest validating accuracy not updated in 20 epoches
highest validating accuracy: 0.9766918253771755 from epoch 80
testing accuracy: 0.9729761086851993

訓練集和驗證集的正確率變化如下：

正確率看起來很高，但這只是針對選取后的區域判斷的正確率，因為選取算法效果比較一般並且樣本數量比較少，所以最終效果不能說令人滿意😕。

執行以下命令，再輸入圖片路徑可以使用學習好的模型識別圖片：

python3 example.py eval

以下是部分識別結果：

精度一般般😕。

Fast-RCNN

RCNN 慢的原因主要是因為識別幾千個子圖片的計算量非常龐大，特別是這幾千個子圖片的范圍很多是重合的，導致了很多重復的計算。Fast-RCNN 着重改善了這一部分，首先會針對整張圖片生成一個與圖片長寬相同 (或者等比例縮放) 的特征數據，然后再根據可能包含對象的區域截取特征數據，然后再根據截取后的子特征數據識別分類。RCNN 與 Fast-RCNN 的區別如下圖所示：

遺憾的是 Fast-RCNN 只是改善了速度，並不會改善正確率。但下面介紹的例子會引入一個比較重要的處理，即調整區域范圍，它可以讓模型給出的區域更接近實際的區域。

以下是 Fast-RCNN 模型中的一些處理細節。

縮放來源圖片

在 RCNN 中，傳給 CNN 模型的圖片是經過縮放的子圖片，而在 Fast-RCNN 中我們需要傳原圖片給 CNN 模型，那么原圖片也需要進行縮放。縮放使用的方法是填充法，如下圖所示：

縮放圖片使用的代碼如下 (opencv 版)：

IMAGE_SIZE = (128, 88)

def calc_resize_parameters(sw, sh):
    """計算縮放圖片的參數"""
    sw_new, sh_new = sw, sh
    dw, dh = IMAGE_SIZE
    pad_w, pad_h = 0, 0
    if sw / sh < dw / dh:
        sw_new = int(dw / dh * sh)
        pad_w = (sw_new - sw) // 2 # 填充左右
    else:
        sh_new = int(dh / dw * sw)
        pad_h = (sh_new - sh) // 2 # 填充上下
    return sw_new, sh_new, pad_w, pad_h

def resize_image(img):
    """縮放 opencv 圖片，比例不一致時填充"""
    sh, sw, _ = img.shape
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    img = cv2.copyMakeBorder(img, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, (0, 0, 0))
    img = cv2.resize(img, dsize=IMAGE_SIZE)
    return img

縮放圖片后區域的坐標也需要轉換，轉換的代碼如下 (都是枯燥的代碼🤒)：

IMAGE_SIZE = (128, 88)

def map_box_to_resized_image(box, sw, sh):
    """把原始區域轉換到縮放后的圖片對應的區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int((x + pad_w) * scale)
    y = int((y + pad_h) * scale)
    w = int(w * scale)
    h = int(h * scale)
    if x + w > IMAGE_SIZE[0] or y + h > IMAGE_SIZE[1] or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def map_box_to_original_image(box, sw, sh):
    """把縮放后圖片對應的區域轉換到縮放前的原始區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int(x / scale - pad_w)
    y = int(y / scale - pad_h)
    w = int(w / scale)
    h = int(h / scale)
    if x + w > sw or y + h > sh or x < 0 or y < 0 or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

計算區域特征

在前面的文章中我們已經了解過，CNN 模型可以分為卷積層，池化層和全連接層，卷積層，池化層用於抽取圖片中各個區域的特征，全連接層用於把特征扁平化並交給線性模型處理。在 Fast-RCNN 中，我們不需要使用整張圖片的特征，只需要使用部分區域的特征，所以 Fast-RCNN 使用的 CNN 模型只需要卷積層和池化層 (部分模型池化層可以省略)，卷積層輸出的通道數量通常會比圖片原有的通道數量多，並且長寬會按原來圖片的長寬等比例縮小，例如原圖的大小是 3,256,256 的時候，經過處理可能會輸出 512,32,32，代表每個 8x8 像素的區域都對應 512 個特征。

這篇給出的 Fast-RCN 代碼為了易於理解，會讓 CNN 模型輸出和原圖一模一樣的大小，這樣抽取區域特征的時候只需要使用 [] 操作符即可。

抽取區域特征 (ROI Pooling)

Fast-RCNN 根據整張圖片生成特征以后，下一步就是抽取區域特征 (Region of interest Pooling) 了，抽取區域特征簡單的來說就是根據區域在圖片中的位置，截區域中該位置的數據，然后再縮放到相同大小，如下圖所示：

抽取區域特征的層又稱為 ROI 層。

如果特征的長寬和圖片的長寬相同，那么截取特征只需要簡單的 [] 操作，但如果特征的長寬比圖片的長寬要小，那么就需要使用近鄰插值法 (Nearest Neighbor Interpolation) 或者雙線插值法 (Bilinear Interpolation) 進行截取，使用雙線插值法進行截取的 ROI 層又稱作 ROI Align。截取以后的縮放可以使用 MaxPool，近鄰插值法或雙線插值法等算法。

想更好的理解 ROI Align 與雙線插值法可以參考這篇文章。

調整區域范圍

在前面已經提到過，使用選擇搜索法等算法選取出來的區域與對象實際所在的區域可能有一定偏差，這個偏差是可以通過模型來調整的。舉個簡單的例子，如果區域內有臉的左半部分，那么模型在經過學習后應該可以判斷出區域應該向右擴展一些。

區域調整可以分為四個參數：

對左上角 x 坐標的調整
對左上角 y 坐標的調整
對長度的調整
對寬度的調整

因為坐標和長寬的值大小不一定，例如同樣是臉的左半部分，出現在圖片的左上角和圖片的右下角就會讓 x y 坐標不一樣，如果遠近不同那么長寬也會不一樣，我們需要把調整量作標准化，標准化的公式如下：

x1, y1, w1, h1 = 候選區域
x2, y2, w2, h2 = 真實區域
x 偏移 = (x2 - x1) / w1
y 偏移 = (y2 - y1) / h1
w 偏移 = log(w2 / w1)
h 偏移 = log(h2 / h1)

經過標准化后，偏移的值就會作為比例而不是絕對值，不會受具體坐標和長寬的影響。此外，公式中使用 log 是為了減少偏移的增幅，使得偏移比較大的時候模型仍然可以達到比較好的學習效果。

計算區域調整偏移和根據偏移調整區域的代碼如下：

def calc_box_offset(candidate_box, true_box):
    """計算候選區域與實際區域的偏移值"""
    x1, y1, w1, h1 = candidate_box
    x2, y2, w2, h2 = true_box
    x_offset = (x2 - x1) / w1
    y_offset = (y2 - y1) / h1
    w_offset = math.log(w2 / w1)
    h_offset = math.log(h2 / h1)
    return (x_offset, y_offset, w_offset, h_offset)

def adjust_box_by_offset(candidate_box, offset):
    """根據偏移值調整候選區域"""
    x1, y1, w1, h1 = candidate_box
    x_offset, y_offset, w_offset, h_offset = offset
    x2 = w1 * x_offset + x1
    y2 = h1 * y_offset + y1
    w2 = math.exp(w_offset) * w1
    h2 = math.exp(h_offset) * h1
    return (x2, y2, w2, h2)

計算損失

Fast-RCNN 模型會針對各個區域輸出兩個結果，第一個是區域對應的標簽 (人臉，非人臉)，第二個是上面提到的區域偏移，調整參數的時候也需要同時根據這兩個結果調整。實現同時調整多個結果可以把損失相加起來再計算各個參數的導函數值：

各個區域的特征 = ROI層(CNN模型(圖片數據))

計算標簽的線性模型(各個區域的特征) - 真實標簽 = 標簽損失
計算偏移的線性模型(各個區域的特征) - 真實偏移 = 偏移損失

損失 = 標簽損失 + 偏移損失

有一個需要注意的地方是，在這個例子里計算標簽損失需要分別根據正負樣本計算，否則模型在經過調整以后只會輸出負結果。這是因為線性模型計算抽取出來的特征時有可能輸出正 (人臉)，也有可能輸出負 (非人臉)，而 ROI 層抽取的特征很多是重合的，也就是來源相同，當負樣本比正樣本要多的時候，結果的方向就會更偏向於負，這樣每次調整參數的時候都會向輸出負的方向調整。如果把損失分開計算，那么不重合的特征可以分別向輸出正負的方向調整，從而達到學習的效果。

此外，偏移損失只應該根據正樣本計算，負樣本沒有必要學習偏移。

最終的損失計算處理如下：

各個區域的特征 = ROI層(CNN模型(圖片數據))

計算標簽的線性模型(各個區域的特征)[正樣本] - 真實標簽[正樣本] = 正樣本標簽損失
計算標簽的線性模型(各個區域的特征)[負樣本] - 真實標簽[負樣本] = 負樣本標簽損失
計算偏移的線性模型(各個區域的特征)[正樣本] - 真實偏移[正樣本] = 正樣本偏移損失

損失 = 正樣本標簽損失 + 負樣本標簽損失 + 正樣本偏移損失

合並結果區域

因為選取區域的算法本來就會返回很多重合的區域，可能會有有好幾個區域同時和真實區域重疊率大於一定值 (70%)，導致這幾個區域都會被認為是包含對象的區域：

模型經過學習后，針對圖片預測得出結果時也有可能返回這樣的重合區域，合並這樣的區域有幾種方法：

使用最左，最右，最上，或者最下的區域
使用第一個區域 (區域選取算法會按出現對象的可能性排序)
結合所有重合的區域 (如果區域調整效果不行，則可能出現結果區域比真實區域大很多的問題)

上面給出的 RCNN 代碼例子已經使用第二個方法合並結果區域，下面給出的例子也會使用同樣的方法。但下一篇文章的 Faster-RCNN 則會使用第三個方法，因為 Faster-RCNN 的區域調整效果相對比較好。

原始論文

如果你想看 Fast-RCNN 的原始論文可以到以下的地址：

https://arxiv.org/pdf/1504.08083.pdf

使用 Fast-RCNN 識別圖片中的人臉

代碼時間到了😱，這份代碼會使用 Fast-RCNN 模型來圖片中的人臉，使用的數據集和前面的例子一樣。

import os
import sys
import torch
import gzip
import itertools
import random
import numpy
import math
import pandas
import cv2
from torch import nn
from matplotlib import pyplot
from collections import defaultdict

# 縮放圖片的大小
IMAGE_SIZE = (256, 256)
# 分析目標的圖片所在的文件夾
IMAGE_DIR = "./784145_1347673_bundle_archive/train/image_data"
# 定義各個圖片中人臉區域的 CSV 文件
BOX_CSV_PATH = "./784145_1347673_bundle_archive/train/bbox_train.csv"

# 用於啟用 GPU 支持
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class BasicBlock(nn.Module):
    """ResNet 使用的基礎塊"""
    expansion = 1 # 定義這個塊的實際出通道是 channels_out 的幾倍，這里的實現固定是一倍
    def __init__(self, channels_in, channels_out, stride):
        super().__init__()
        # 生成 3x3 的卷積層
        # 處理間隔 stride = 1 時，輸出的長寬會等於輸入的長寬，例如 (32-3+2)//1+1 == 32
        # 處理間隔 stride = 2 時，輸出的長寬會等於輸入的長寬的一半，例如 (32-3+2)//2+1 == 16
        # 此外 resnet 的 3x3 卷積層不使用偏移值 bias
        self.conv1 = nn.Sequential(
            nn.Conv2d(channels_in, channels_out, kernel_size=3, stride=stride, padding=1, bias=False),
            nn.BatchNorm2d(channels_out))
        # 再定義一個讓輸出和輸入維度相同的 3x3 卷積層
        self.conv2 = nn.Sequential(
            nn.Conv2d(channels_out, channels_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(channels_out))
        # 讓原始輸入和輸出相加的時候，需要維度一致，如果維度不一致則需要整合
        self.identity = nn.Sequential()
        if stride != 1 or channels_in != channels_out * self.expansion:
            self.identity = nn.Sequential(
                nn.Conv2d(channels_in, channels_out * self.expansion, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(channels_out * self.expansion))

    def forward(self, x):
        # x => conv1 => relu => conv2 => + => relu
        # |                              ^
        # |==============================|
        tmp = self.conv1(x)
        tmp = nn.functional.relu(tmp, inplace=True)
        tmp = self.conv2(tmp)
        tmp += self.identity(x)
        y = nn.functional.relu(tmp, inplace=True)
        return y

class MyModel(nn.Module):
    """Fast-RCNN (基於 ResNet-18 的變種)"""
    def __init__(self):
        super().__init__()
        # 記錄上一層的出通道數量
        self.previous_channels_out = 4
        # 把 3 通道轉換到 4 通道，長寬不變
        self.conv1 = nn.Sequential(
            nn.Conv2d(3, self.previous_channels_out, kernel_size=3, stride=1, padding=1, bias=False),
            nn.BatchNorm2d(self.previous_channels_out))
        # 抽取圖片各個區域特征的 ResNet (除去 AvgPool 和全連接層)
        # 和原始的 Resnet 不一樣的是輸出的長寬和輸入的長寬會相等，以便 ROI 層按區域抽取R征
        # 此外，為了可以讓模型跑在 4GB 顯存上，這里減少了模型的通道數量
        self.layer1 = self._make_layer(BasicBlock, channels_out=4, num_blocks=2, stride=1)
        self.layer2 = self._make_layer(BasicBlock, channels_out=4, num_blocks=2, stride=1)
        self.layer3 = self._make_layer(BasicBlock, channels_out=8, num_blocks=2, stride=1)
        self.layer4 = self._make_layer(BasicBlock, channels_out=8, num_blocks=2, stride=1)
        # ROI 層抽取各個子區域特征后轉換到固定大小
        self.roi_pool = nn.AdaptiveMaxPool2d((5, 5))
        # 輸出兩個分類 [非人臉, 人臉]
        self.fc_labels_model = nn.Sequential(
            nn.Linear(8*5*5, 32),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(32, 2))
        # 計算區域偏移，分別輸出 x, y, w, h 的偏移
        self.fc_offsets_model = nn.Sequential(
            nn.Linear(8*5*5, 128),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(128, 4))

    def _make_layer(self, block_type, channels_out, num_blocks, stride):
        blocks = []
        # 添加第一個塊
        blocks.append(block_type(self.previous_channels_out, channels_out, stride))
        self.previous_channels_out = channels_out * block_type.expansion
        # 添加剩余的塊，剩余的塊固定處理間隔為 1，不會改變長寬
        for _ in range(num_blocks-1):
            blocks.append(block_type(self.previous_channels_out, self.previous_channels_out, 1))
            self.previous_channels_out *= block_type.expansion
        return nn.Sequential(*blocks)

    def _roi_pooling(self, feature_mapping, roi_boxes):
        result = []
        for box in roi_boxes:
            image_index, x, y, w, h = map(int, box.tolist())
            feature_sub_region = feature_mapping[image_index][:,x:x+w,y:y+h]
            fixed_features = self.roi_pool(feature_sub_region).reshape(-1) # 順道扁平化
            result.append(fixed_features)
        return torch.stack(result)

    def forward(self, x):
        images_tensor = x[0]
        candidate_boxes_tensor = x[1]
        # 轉換出通道
        tmp = self.conv1(images_tensor)
        tmp = nn.functional.relu(tmp)
        # 應用 ResNet 的各個層
        # 結果維度是 B,32,W,H
        tmp = self.layer1(tmp)
        tmp = self.layer2(tmp)
        tmp = self.layer3(tmp)
        tmp = self.layer4(tmp)
        # 使用 ROI 層抽取各個子區域的特征並轉換到固定大小
        # 結果維度是 B,32*9*9
        tmp = self._roi_pooling(tmp, candidate_boxes_tensor)
        # 根據抽取出來的子區域特征分別計算分類 (是否人臉) 和區域偏移
        labels = self.fc_labels_model(tmp)
        offsets = self.fc_offsets_model(tmp)
        y = (labels, offsets)
        return y

def save_tensor(tensor, path):
    """保存 tensor 對象到文件"""
    torch.save(tensor, gzip.GzipFile(path, "wb"))

def load_tensor(path):
    """從文件讀取 tensor 對象"""
    return torch.load(gzip.GzipFile(path, "rb"))

def calc_resize_parameters(sw, sh):
    """計算縮放圖片的參數"""
    sw_new, sh_new = sw, sh
    dw, dh = IMAGE_SIZE
    pad_w, pad_h = 0, 0
    if sw / sh < dw / dh:
        sw_new = int(dw / dh * sh)
        pad_w = (sw_new - sw) // 2 # 填充左右
    else:
        sh_new = int(dh / dw * sw)
        pad_h = (sh_new - sh) // 2 # 填充上下
    return sw_new, sh_new, pad_w, pad_h

def resize_image(img):
    """縮放 opencv 圖片，比例不一致時填充"""
    sh, sw, _ = img.shape
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    img = cv2.copyMakeBorder(img, pad_h, pad_h, pad_w, pad_w, cv2.BORDER_CONSTANT, (0, 0, 0))
    img = cv2.resize(img, dsize=IMAGE_SIZE)
    return img

def image_to_tensor(img):
    """轉換 opencv 圖片對象到 tensor 對象"""
    # 注意 opencv 是 BGR，但對訓練沒有影響所以不用轉為 RGB
    arr = numpy.asarray(img)
    t = torch.from_numpy(arr)
    t = t.transpose(0, 2) # 轉換維度 H,W,C 到 C,W,H
    t = t / 255.0 # 正規化數值使得范圍在 0 ~ 1
    return t

def map_box_to_resized_image(box, sw, sh):
    """把原始區域轉換到縮放后的圖片對應的區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int((x + pad_w) * scale)
    y = int((y + pad_h) * scale)
    w = int(w * scale)
    h = int(h * scale)
    if x + w > IMAGE_SIZE[0] or y + h > IMAGE_SIZE[1] or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def map_box_to_original_image(box, sw, sh):
    """把縮放后圖片對應的區域轉換到縮放前的原始區域"""
    x, y, w, h = box
    sw_new, sh_new, pad_w, pad_h = calc_resize_parameters(sw, sh)
    scale = IMAGE_SIZE[0] / sw_new
    x = int(x / scale - pad_w)
    y = int(y / scale - pad_h)
    w = int(w / scale)
    h = int(h / scale)
    if x + w > sw or y + h > sh or x < 0 or y < 0 or w == 0 or h == 0:
        return 0, 0, 0, 0
    return x, y, w, h

def calc_iou(rect1, rect2):
    """計算兩個區域重疊部分 / 合並部分的比率 (intersection over union)"""
    x1, y1, w1, h1 = rect1
    x2, y2, w2, h2 = rect2
    xi = max(x1, x2)
    yi = max(y1, y2)
    wi = min(x1+w1, x2+w2) - xi
    hi = min(y1+h1, y2+h2) - yi
    if wi > 0 and hi > 0: # 有重疊部分
        area_overlap = wi*hi
        area_all = w1*h1 + w2*h2 - area_overlap
        iou = area_overlap / area_all
    else: # 沒有重疊部分
        iou = 0
    return iou

def calc_box_offset(candidate_box, true_box):
    """計算候選區域與實際區域的偏移值"""
    # 這里計算出來的偏移值基於比例，而不受具體位置和大小影響
    # w h 使用 log 是為了減少過大的值的影響
    x1, y1, w1, h1 = candidate_box
    x2, y2, w2, h2 = true_box
    x_offset = (x2 - x1) / w1
    y_offset = (y2 - y1) / h1
    w_offset = math.log(w2 / w1)
    h_offset = math.log(h2 / h1)
    return (x_offset, y_offset, w_offset, h_offset)

def adjust_box_by_offset(candidate_box, offset):
    """根據偏移值調整候選區域"""
    x1, y1, w1, h1 = candidate_box
    x_offset, y_offset, w_offset, h_offset = offset
    x2 = w1 * x_offset + x1
    y2 = h1 * y_offset + y1
    w2 = math.exp(w_offset) * w1
    h2 = math.exp(h_offset) * h1
    return (x2, y2, w2, h2)

def selective_search(img):
    """計算 opencv 圖片中可能出現對象的區域，只返回頭 2000 個區域"""
    # 算法參考 https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/
    s = cv2.ximgproc.segmentation.createSelectiveSearchSegmentation()
    s.setBaseImage(img)
    s.switchToSelectiveSearchFast()
    boxes = s.process()
    return boxes[:2000]

def prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets):
    """准備訓練 - 保存單個批次的數據"""
    # 按索引值列表生成輸入和輸出 tensor 對象的函數
    def split_dataset(indices):
        image_in = []
        candidate_boxes_in = []
        labels_out = []
        offsets_out = []
        for new_image_index, original_image_index in enumerate(indices):
            image_in.append(image_tensors[original_image_index])
            for box, label, offset in zip(image_candidate_boxes, image_labels, image_box_offsets):
                box_image_index, x, y, w, h = box
                if box_image_index == original_image_index:
                    candidate_boxes_in.append((new_image_index, x, y, w, h))
                    labels_out.append(label)
                    offsets_out.append(offset)
        # 檢查計算是否有問題
        # for box, label in zip(candidate_boxes_in, labels_out):
        #    image_index, x, y, w, h = box
        #    child_img = image_in[image_index][:, x:x+w, y:y+h].transpose(0, 2) * 255
        #    cv2.imwrite(f"{image_index}_{x}_{y}_{w}_{h}_{label}.png", child_img.numpy())
        tensor_image_in = torch.stack(image_in) # 維度: B,C,W,H
        tensor_candidate_boxes_in = torch.tensor(candidate_boxes_in, dtype=torch.float) # 維度: N,5 (index, x, y, w, h)
        tensor_labels_out = torch.tensor(labels_out, dtype=torch.long) # 維度: N
        tensor_box_offsets_out = torch.tensor(offsets_out, dtype=torch.float) # 維度: N,4 (x_offset, y_offset, ..)
        return (tensor_image_in, tensor_candidate_boxes_in), (tensor_labels_out, tensor_box_offsets_out)

    # 切分訓練集 (80%)，驗證集 (10%) 和測試集 (10%)
    random_indices = torch.randperm(len(image_tensors))
    training_indices = random_indices[:int(len(random_indices)*0.8)]
    validating_indices = random_indices[int(len(random_indices)*0.8):int(len(random_indices)*0.9):]
    testing_indices = random_indices[int(len(random_indices)*0.9):]
    training_set = split_dataset(training_indices)
    validating_set = split_dataset(validating_indices)
    testing_set = split_dataset(testing_indices)

    # 保存到硬盤
    save_tensor(training_set, f"data/training_set.{batch}.pt")
    save_tensor(validating_set, f"data/validating_set.{batch}.pt")
    save_tensor(testing_set, f"data/testing_set.{batch}.pt")
    print(f"batch {batch} saved")

def prepare():
    """准備訓練"""
    # 數據集轉換到 tensor 以后會保存在 data 文件夾下
    if not os.path.isdir("data"):
        os.makedirs("data")

    # 加載 csv 文件，構建圖片到區域列表的索引 { 圖片名: [ 區域, 區域, .. ] }
    box_map = defaultdict(lambda: [])
    df = pandas.read_csv(BOX_CSV_PATH)
    for row in df.values:
        filename, width, height, x1, y1, x2, y2 = row[:7]
        box_map[filename].append((x1, y1, x2-x1, y2-y1))

    # 從圖片里面提取人臉 (正樣本) 和非人臉 (負樣本) 的圖片
    batch_size = 50
    max_samples = 10
    batch = 0
    image_tensors = [] # 圖片列表
    image_candidate_boxes = [] # 各個圖片的候選區域列表
    image_labels = [] # 各個圖片的候選區域對應的標簽 (1 人臉 0 非人臉)
    image_box_offsets = [] # 各個圖片的候選區域與真實區域的偏移值
    for filename, true_boxes in box_map.items():
        path = os.path.join(IMAGE_DIR, filename)
        img_original = cv2.imread(path) # 加載原始圖片
        sh, sw, _ = img_original.shape # 原始圖片大小
        img = resize_image(img_original) # 縮放圖片
        candidate_boxes = selective_search(img) # 查找候選區域
        true_boxes = [ map_box_to_resized_image(b, sw, sh) for b in true_boxes ] # 縮放實際區域
        image_index = len(image_tensors) # 圖片在批次中的索引值
        image_tensors.append(image_to_tensor(img.copy()))
        positive_samples = 0
        negative_samples = 0
        for candidate_box in candidate_boxes:
            # 如果候選區域和任意一個實際區域重疊率大於 70%，則認為是正樣本
            # 如果候選區域和所有實際區域重疊率都小於 30%，則認為是負樣本
            # 每個圖片最多添加正樣本數量 + 10 個負樣本，需要提供足夠多負樣本避免偽陽性判斷
            iou_list = [ calc_iou(candidate_box, true_box) for true_box in true_boxes ]
            positive_index = next((index for index, iou in enumerate(iou_list) if iou > 0.70), None)
            is_negative = all(iou < 0.30 for iou in iou_list)
            result = None
            if positive_index is not None:
                result = True
                positive_samples += 1
            elif is_negative and negative_samples < positive_samples + 10:
                result = False
                negative_samples += 1
            if result is not None:
                x, y, w, h = candidate_box
                # 檢驗計算是否有問題
                # child_img = img[y:y+h, x:x+w].copy()
                # cv2.imwrite(f"{filename}_{x}_{y}_{w}_{h}_{int(result)}.png", child_img)
                image_candidate_boxes.append((image_index, x, y, w, h))
                image_labels.append(int(result))
                if positive_index is not None:
                    image_box_offsets.append(calc_box_offset(
                        candidate_box, true_boxes[positive_index])) # 正樣本添加偏移值
                else:
                    image_box_offsets.append((0, 0, 0, 0)) # 負樣本無偏移
            if positive_samples >= max_samples:
                break
        # 保存批次
        if len(image_tensors) >= batch_size:
            prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets)
            image_tensors.clear()
            image_candidate_boxes.clear()
            image_labels.clear()
            image_box_offsets.clear()
            batch += 1
    # 保存剩余的批次
    if len(image_tensors) > 10:
        prepare_save_batch(batch, image_tensors, image_candidate_boxes, image_labels, image_box_offsets)

def train():
    """開始訓練"""
    # 創建模型實例
    model = MyModel().to(device)

    # 創建多任務損失計算器
    celoss = torch.nn.CrossEntropyLoss()
    mseloss = torch.nn.MSELoss()
    def loss_function(predicted, actual):
        # 標簽損失必須根據正負樣本分別計算，否則會導致預測結果總是為負的問題
        positive_indices = actual[0].nonzero(as_tuple=True)[0] # 正樣本的索引值列表
        negative_indices = (actual[0] == 0).nonzero(as_tuple=True)[0] # 負樣本的索引值列表
        loss1 = celoss(predicted[0][positive_indices], actual[0][positive_indices]) # 正樣本標簽的損失
        loss2 = celoss(predicted[0][negative_indices], actual[0][negative_indices]) # 負樣本標簽的損失
        loss3 = mseloss(predicted[1][positive_indices], actual[1][positive_indices]) # 偏移值的損失，僅針對正樣本計算
        return loss1 + loss2 + loss3

    # 創建參數調整器
    optimizer = torch.optim.Adam(model.parameters())

    # 記錄訓練集和驗證集的正確率變化
    training_label_accuracy_history = []
    training_offset_accuracy_history = []
    validating_label_accuracy_history = []
    validating_offset_accuracy_history = []

    # 記錄最高的驗證集正確率
    validating_label_accuracy_highest = -1
    validating_label_accuracy_highest_epoch = 0
    validating_offset_accuracy_highest = -1
    validating_offset_accuracy_highest_epoch = 0

    # 讀取批次的工具函數
    def read_batches(base_path):
        for batch in itertools.count():
            path = f"{base_path}.{batch}.pt"
            if not os.path.isfile(path):
                break
            yield [ [ tt.to(device) for tt in t ] for t in load_tensor(path) ]

    # 計算正確率的工具函數
    def calc_accuracy(actual, predicted):
        # 標簽正確率，正樣本和負樣本的正確率分別計算再平均
        predicted_i = torch.max(predicted[0], 1).indices
        acc_positive = ((actual[0] > 0.5) & (predicted_i > 0.5)).sum().item() / ((actual[0] > 0.5).sum().item() + 0.00001)
        acc_negative = ((actual[0] <= 0.5) & (predicted_i <= 0.5)).sum().item() / ((actual[0] <= 0.5).sum().item() + 0.00001)
        acc_label = (acc_positive + acc_negative) / 2
        # print(acc_positive, acc_negative)
        # 偏移值正確率
        valid_indices = actual[1].nonzero(as_tuple=True)[0]
        if valid_indices.shape[0] == 0:
            acc_offset = 1
        else:
            acc_offset = (1 - (predicted[1][valid_indices] - actual[1][valid_indices]).abs().mean()).item()
            acc_offset = max(acc_offset, 0)
        return acc_label, acc_offset

    # 開始訓練過程
    for epoch in range(1, 10000):
        print(f"epoch: {epoch}")

        # 根據訓練集訓練並修改參數
        model.train()
        training_label_accuracy_list = []
        training_offset_accuracy_list = []
        for batch_index, batch in enumerate(read_batches("data/training_set")):
            # 划分輸入和輸出
            batch_x, batch_y = batch
            # 計算預測值
            predicted = model(batch_x)
            # 計算損失
            loss = loss_function(predicted, batch_y)
            # 從損失自動微分求導函數值
            loss.backward()
            # 使用參數調整器調整參數
            optimizer.step()
            # 清空導函數值
            optimizer.zero_grad()
            # 記錄這一個批次的正確率，torch.no_grad 代表臨時禁用自動微分功能
            with torch.no_grad():
                training_batch_label_accuracy, training_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
            # 輸出批次正確率
            training_label_accuracy_list.append(training_batch_label_accuracy)
            training_offset_accuracy_list.append(training_batch_offset_accuracy)
            print(f"epoch: {epoch}, batch: {batch_index}: " +
                f"batch label accuracy: {training_batch_label_accuracy}, offset accuracy: {training_batch_offset_accuracy}")
        training_label_accuracy = sum(training_label_accuracy_list) / len(training_label_accuracy_list)
        training_offset_accuracy = sum(training_offset_accuracy_list) / len(training_offset_accuracy_list)
        training_label_accuracy_history.append(training_label_accuracy)
        training_offset_accuracy_history.append(training_offset_accuracy)
        print(f"training label accuracy: {training_label_accuracy}, offset accuracy: {training_offset_accuracy}")

        # 檢查驗證集
        model.eval()
        validating_label_accuracy_list = []
        validating_offset_accuracy_list = []
        for batch in read_batches("data/validating_set"):
            batch_x, batch_y = batch
            predicted = model(batch_x)
            validating_batch_label_accuracy, validating_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
            validating_label_accuracy_list.append(validating_batch_label_accuracy)
            validating_offset_accuracy_list.append(validating_batch_offset_accuracy)
        validating_label_accuracy = sum(validating_label_accuracy_list) / len(validating_label_accuracy_list)
        validating_offset_accuracy = sum(validating_offset_accuracy_list) / len(validating_offset_accuracy_list)
        validating_label_accuracy_history.append(validating_label_accuracy)
        validating_offset_accuracy_history.append(validating_offset_accuracy)
        print(f"validating label accuracy: {validating_label_accuracy}, offset accuracy: {validating_offset_accuracy}")

        # 記錄最高的驗證集正確率與當時的模型狀態，判斷是否在 20 次訓練后仍然沒有刷新記錄
        if validating_label_accuracy > validating_label_accuracy_highest:
            validating_label_accuracy_highest = validating_label_accuracy
            validating_label_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest label validating accuracy updated")
        elif validating_offset_accuracy > validating_offset_accuracy_highest:
            validating_offset_accuracy_highest = validating_offset_accuracy
            validating_offset_accuracy_highest_epoch = epoch
            save_tensor(model.state_dict(), "model.pt")
            print("highest offset validating accuracy updated")
        elif (epoch - validating_label_accuracy_highest_epoch > 20 and
            epoch - validating_offset_accuracy_highest_epoch > 20):
            # 在 20 次訓練后仍然沒有刷新記錄，結束訓練
            print("stop training because highest validating accuracy not updated in 20 epoches")
            break

    # 使用達到最高正確率時的模型狀態
    print(f"highest label validating accuracy: {validating_label_accuracy_highest}",
        f"from epoch {validating_label_accuracy_highest_epoch}")
    print(f"highest offset validating accuracy: {validating_offset_accuracy_highest}",
        f"from epoch {validating_offset_accuracy_highest_epoch}")
    model.load_state_dict(load_tensor("model.pt"))

    # 檢查測試集
    testing_label_accuracy_list = []
    testing_offset_accuracy_list = []
    for batch in read_batches("data/testing_set"):
        batch_x, batch_y = batch
        predicted = model(batch_x)
        testing_batch_label_accuracy, testing_batch_offset_accuracy = calc_accuracy(batch_y, predicted)
        testing_label_accuracy_list.append(testing_batch_label_accuracy)
        testing_offset_accuracy_list.append(testing_batch_offset_accuracy)
    testing_label_accuracy = sum(testing_label_accuracy_list) / len(testing_label_accuracy_list)
    testing_offset_accuracy = sum(testing_offset_accuracy_list) / len(testing_offset_accuracy_list)
    print(f"testing label accuracy: {testing_label_accuracy}, offset accuracy: {testing_offset_accuracy}")

    # 顯示訓練集和驗證集的正確率變化
    pyplot.plot(training_label_accuracy_history, label="training_label_accuracy")
    pyplot.plot(training_offset_accuracy_history, label="training_offset_accuracy")
    pyplot.plot(validating_label_accuracy_history, label="validing_label_accuracy")
    pyplot.plot(validating_offset_accuracy_history, label="validing_offset_accuracy")
    pyplot.ylim(0, 1)
    pyplot.legend()
    pyplot.show()

def eval_model():
    """使用訓練好的模型"""
    # 創建模型實例，加載訓練好的狀態，然后切換到驗證模式
    model = MyModel().to(device)
    model.load_state_dict(load_tensor("model.pt"))
    model.eval()

    # 詢問圖片路徑，並顯示所有可能是人臉的區域
    while True:
        try:
            # 選取可能出現對象的區域一覽
            image_path = input("Image path: ")
            if not image_path:
                continue
            img_original = cv2.imread(image_path) # 加載原始圖片
            sh, sw, _ = img_original.shape # 原始圖片大小
            img = resize_image(img_original) # 縮放圖片
            candidate_boxes = selective_search(img) # 查找候選區域
            # 構建輸入
            image_tensor = image_to_tensor(img).unsqueeze(dim=0).to(device) # 維度: 1,C,W,H
            candidate_boxes_tensor = torch.tensor(
                [ (0, x, y, w, h) for x, y, w, h in candidate_boxes ],
                dtype=torch.float).to(device) # 維度: N,5
            tensor_in = (image_tensor, candidate_boxes_tensor)
            # 預測輸出
            labels, offsets = model(tensor_in)
            labels = nn.functional.softmax(labels, dim=1)
            labels = labels[:,1].resize(labels.shape[0])
            # 判斷概率大於 90% 的是人臉，按偏移值調整區域，添加邊框到圖片並保存
            img_output = img_original.copy()
            for box, label, offset in zip(candidate_boxes, labels, offsets):
                if label.item() <= 0.99:
                    continue
                box = adjust_box_by_offset(box, offset.tolist())
                x, y, w, h = map_box_to_original_image(box, sw, sh)
                if w == 0 or h == 0:
                    continue
                print(x, y, w, h)
                cv2.rectangle(img_output, (x, y), (x+w, y+h), (0, 0, 0xff), 1)
            cv2.imwrite("img_output.png", img_output)
            print("saved to img_output.png")
            print()
        except Exception as e:
            print("error:", e)

def main():
    """主函數"""
    if len(sys.argv) < 2:
        print(f"Please run: {sys.argv[0]} prepare|train|eval")
        exit()

    # 給隨機數生成器分配一個初始值，使得每次運行都可以生成相同的隨機數
    # 這是為了讓過程可重現，你也可以選擇不這樣做
    random.seed(0)
    torch.random.manual_seed(0)

    # 根據命令行參數選擇操作
    operation = sys.argv[1]
    if operation == "prepare":
        prepare()
    elif operation == "train":
        train()
    elif operation == "eval":
        eval_model()
    else:
        raise ValueError(f"Unsupported operation: {operation}")

if __name__ == "__main__":
    main()

執行以下命令以后：

python3 example.py prepare
python3 example.py train

在 31 輪訓練以后的輸出如下 (因為訓練時間實在長，這里偷懶了🥺)：

epoch: 31, batch: 112: batch label accuracy: 0.9805490565092065, offset accuracy: 0.9293316006660461
epoch: 31, batch: 113: batch label accuracy: 0.9776784565994586, offset accuracy: 0.9191392660140991
epoch: 31, batch: 114: batch label accuracy: 0.9469732184008024, offset accuracy: 0.9101274609565735
training label accuracy: 0.9707166603858259, offset accuracy: 0.9191886570142663
validating label accuracy: 0.9306134214845806, offset accuracy: 0.9205827381299889
highest offset validating accuracy updated

執行以下命令，再輸入圖片路徑可以使用學習好的模型識別圖片：

python3 example.py eval

以下是部分識別結果：

調整區域前

調整區域后

調整區域前

調整區域后

精度和 RCNN 差不多，甚至有些降低了 (為了支持 4G 顯存縮放圖片了)。不過識別速度有很大的提升，在同一個環境下，Fast-RCNN 處理單張圖片只需要 0.4~0.5 秒，而 RCNN 則需要 2 秒左右。

寫在最后

這篇介紹的 RCNN 與 Fast-RCNN 只是用於入門對象識別的，實用價值並不大 (速度慢，識別精度低)。下一篇介紹的 Faster-RCNN 則是可以用於生產的模型，但復雜程度也會高一個等級🤒。

此外，這篇文章和下一篇文章的代碼實現和論文中的實現、網上的其他實現不完全一樣，這是因為我的機器顯存較低，並且我想用盡量少的代碼來實現相同的原理，使得代碼更容易理解 (網上很多實現都是分一堆文件，甚至把部分邏輯使用 c/c++ 擴展實現，性能上有好處但是初學者看了會頭大)。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。