yolov1， yolo v2 和yolo v3系列

本文轉載自查看原文 2019-10-23 22:49 1504 Computer Vision

　　目標檢測模型主要分為two-stage和one-stage， one-stage的代表主要是yolo系列和ssd。簡單記錄下學習yolo系列的筆記。

1 yolo V1

　　　yolo v1是2015年的論文 you only look once：unified，real-time object detection 中提出，為one-stage目標檢測的開山之作。其網絡架構如下：（24個卷積層和兩個全連接層，注意最后一個全連接層可以理解為1*4096到1*1470（7*7*30）的線性變換）

　　yolo v1的理解主要在於三點：

　　1.1 網格划分：輸入圖片為448*448，yolo將其划為為49(7*7)個cell, 每個cell只負責預測一個物體框，如果這個物體的中心點落在了這個cell中，這個cell就負責預測這個物體

　　1.2 預測結果：最后網絡的輸出為7*7*30，也可以看做49個1*30的向量，每個向量的組成如下： (x, y, w, h, confidence) *2 + 20; 即每一個向量預測兩個bounding box及對應的置信度，還有物體屬於20個分類（VOC數據集包括20分類）的概率。

　 1.3 Loss 函數理解：loss函數如下圖所示，下面幾個概念需要理清楚

　　　　　　　s2：最后網絡的輸出為7*7*30，因此49個cell；

　　　　　　　B: 每個cell（1*30）預測了兩個bbox，因此B=2，只有和ground truth具有最大IOU的bbox才參與計算

　　　　　　　7*7的正掩膜𝕝𝑖𝑗obj：最開始進行網絡划分時，ground truth的中心點落在了該cell中，則該cell出值為1；只有為1出的cell才參與計算

　　　　　　　7*7的反掩膜𝕝𝑖𝑗noobj：正掩膜取反。

　　　　(1) 坐標預測損失（coordinate loss）: 上面損失函數的第一部分是對預測bbox的坐標損失，如下圖所示，有兩個注意點：一是對寬高取平方根，抑制大物體的loss值，平衡小物體和大物體預測的loss差異；二是采用了權重系數5，因為參與計算正樣本太少(如上面7*7掩膜中只有三個cell的坐標參與計算)，增加權重

　　　　（2）置信度損失（Confidence loss）：第二部分是正負樣本bbox的置信度損失，如下圖所示；注意下ground truth的置信度：對於正樣本其置信度為預測框和ground truth之間的IOU*1, 對於負樣本，置信度為IOU*0；另外由於負樣本多余正樣本，取負樣本的權重系數為0.5

　　　　（3）分類損失(Classification Loss): 第三部分是預測所屬分類的損失，如下圖所示，預測值為網絡中softmax計算出，真實值為標注類別的one-hot編碼（可以理解為20分類任務，若為第五類，則編碼為00001000000000000000）

　　yolo v1的主要特點

　　　(1) 優點: one-stage，速度快

　　　缺點:

　　　　(1) 不支持擁擠物體的檢測（划分網格時一個cell只預測一個物體）

　　　　(2) 對小物體的檢測效果差, 且對新的寬高比物體檢測效果不好

　　　　(3)網絡中沒有使用batch normalization

下面是pytorch的實現的Yolo V1 network 和 loss計算方式：（未經實驗，僅供理解用）

import torch
import torch.nn as nn
from torch.nn import functional
from torch.autograd import Variable
import torchvision.models as models


class YoloLoss(nn.Module):
    def __init__(self, n_batch, B, C, lambda_coord, lambda_noobj, use_gpu=False):
        """

        :param n_batch: number of batches
        :param B: number of bounding boxes
        :param C: number of bounding classes
        :param lambda_coord: factor for loss which contain objects
        :param lambda_noobj: factor for loss which do not contain objects
        """
        super(YoloLoss, self).__init__()
        self.n_batch = n_batch
        self.B = B # assume there are two bounding boxes
        self.C = C
        self.lambda_coord = lambda_coord
        self.lambda_noobj = lambda_noobj
        self.use_gpu = use_gpu

    def compute_iou(self, bbox1, bbox2):
        """
        Compute the intersection over union of two set of boxes, each box is [x1,y1,w,h]
        :param bbox1: (tensor) bounding boxes, size [N,4]
        :param bbox2: (tensor) bounding boxes, size [M,4]
        :return:
        """
        # compute [x1,y1,x2,y2] w.r.t. top left and bottom right coordinates separately
        b1x1y1 = bbox1[:,:2]-bbox1[:,2:]**2 # [N, (x1,y1)=2]
        b1x2y2 = bbox1[:,:2]+bbox1[:,2:]**2 # [N, (x2,y2)=2]
        b2x1y1 = bbox2[:,:2]-bbox2[:,2:]**2 # [M, (x1,y1)=2]
        b2x2y2 = bbox2[:,:2]+bbox2[:,2:]**2 # [M, (x1,y1)=2]
        box1 = torch.cat((b1x1y1.view(-1,2), b1x2y2.view(-1, 2)), dim=1) # [N,4], 4=[x1,y1,x2,y2]
        box2 = torch.cat((b2x1y1.view(-1,2), b2x2y2.view(-1, 2)), dim=1) # [M,4], 4=[x1,y1,x2,y2]
        N = box1.size(0)
        M = box2.size(0)

        tl = torch.max(
            box1[:,:2].unsqueeze(1).expand(N,M,2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:,:2].unsqueeze(0).expand(N,M,2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )
        br = torch.min(
            box1[:,2:].unsqueeze(1).expand(N,M,2),  # [N,2] -> [N,1,2] -> [N,M,2]
            box2[:,2:].unsqueeze(0).expand(N,M,2),  # [M,2] -> [1,M,2] -> [N,M,2]
        )

        wh = br - tl  # [N,M,2]
        wh[(wh<0).detach()] = 0
        #wh[wh<0] = 0
        inter = wh[:, :, 0] * wh[:, :, 1]  # [N,M]

        area1 = (box1[:,2]-box1[:,0]) * (box1[:,3]-box1[:,1])  # [N,]
        area2 = (box2[:,2]-box2[:,0]) * (box2[:,3]-box2[:,1])  # [M,]
        area1 = area1.unsqueeze(1).expand_as(inter)  # [N,] -> [N,1] -> [N,M]
        area2 = area2.unsqueeze(0).expand_as(inter)  # [M,] -> [1,M] -> [N,M]

        iou = inter / (area1 + area2 - inter)
        return iou

    def forward(self, pred_tensor, target_tensor):
        """

        :param pred_tensor: [batch,SxSx(Bx5+20))]
        :param target_tensor: [batch,S,S,Bx5+20]
        :return: total loss
        """
        n_elements = self.B * 5 + self.C
        batch = target_tensor.size(0)
        target_tensor = target_tensor.view(batch,-1,n_elements)
        #print(target_tensor.size())
        #print(pred_tensor.size())
        pred_tensor = pred_tensor.view(batch,-1,n_elements)
        coord_mask = target_tensor[:,:,5] > 0
        noobj_mask = target_tensor[:,:,5] == 0
        coord_mask = coord_mask.unsqueeze(-1).expand_as(target_tensor)
        noobj_mask = noobj_mask.unsqueeze(-1).expand_as(target_tensor)

        coord_target = target_tensor[coord_mask].view(-1,n_elements)
        coord_pred = pred_tensor[coord_mask].view(-1,n_elements)
        class_pred = coord_pred[:,self.B*5:]
        class_target = coord_target[:,self.B*5:]
        box_pred = coord_pred[:,:self.B*5].contiguous().view(-1,5)
        box_target = coord_target[:,:self.B*5].contiguous().view(-1,5)

        noobj_target = target_tensor[noobj_mask].view(-1,n_elements)
        noobj_pred = pred_tensor[noobj_mask].view(-1,n_elements)

        # compute loss which do not contain objects
        if self.use_gpu:
            noobj_target_mask = torch.cuda.ByteTensor(noobj_target.size())
        else:
            noobj_target_mask = torch.ByteTensor(noobj_target.size())
        noobj_target_mask.zero_()
        for i in range(self.B):
            noobj_target_mask[:,i*5+4] = 1
        noobj_target_c = noobj_target[noobj_target_mask] # only compute loss of c size [2*B*noobj_target.size(0)]
        noobj_pred_c = noobj_pred[noobj_target_mask]
        noobj_loss = functional.mse_loss(noobj_pred_c, noobj_target_c, size_average=False)

        # compute loss which contain objects
        if self.use_gpu:
            coord_response_mask = torch.cuda.ByteTensor(box_target.size())
            coord_not_response_mask = torch.cuda.ByteTensor(box_target.size())
        else:
            coord_response_mask = torch.ByteTensor(box_target.size())
            coord_not_response_mask = torch.ByteTensor(box_target.size())
        coord_response_mask.zero_()
        coord_not_response_mask = ~coord_not_response_mask.zero_()
        for i in range(0,box_target.size()[0],self.B):
            box1 = box_pred[i:i+self.B]
            box2 = box_target[i:i+self.B]
            iou = self.compute_iou(box1[:, :4], box2[:, :4])
            max_iou, max_index = iou.max(0)
            if self.use_gpu:
                max_index = max_index.data.cuda()
            else:
                max_index = max_index.data
            coord_response_mask[i+max_index]=1
            coord_not_response_mask[i+max_index]=0

        # 1. response loss
        box_pred_response = box_pred[coord_response_mask].view(-1, 5)
        box_target_response = box_target[coord_response_mask].view(-1, 5)
        contain_loss = functional.mse_loss(box_pred_response[:, 4], box_target_response[:, 4], size_average=False)
        loc_loss = functional.mse_loss(box_pred_response[:, :2], box_target_response[:, :2], size_average=False) +\
                   functional.mse_loss(box_pred_response[:, 2:4], box_target_response[:, 2:4], size_average=False)
        # 2. not response loss
        box_pred_not_response = box_pred[coord_not_response_mask].view(-1, 5)
        box_target_not_response = box_target[coord_not_response_mask].view(-1, 5)

        # compute class prediction loss
        class_loss = functional.mse_loss(class_pred, class_target, size_average=False)

        # compute total loss
        total_loss = self.lambda_coord * loc_loss + contain_loss + self.lambda_noobj * noobj_loss + class_loss
        return total_loss



def test():
    voc = False
    vot = 1-voc
    if voc:
        img_folder = '../codedata/voc2012train/JPEGImages'
        file = '../voc2012.txt'
        img_size = 448
        train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, S=7, B=2, C=20, transforms=[transforms.ToTensor()])
        train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0)
        train_iter = iter(train_loader)
        img, target = next(train_iter)
        print(target.size())
        target = Variable(target)
        img = Variable(img)
        net = YOLO_V1()
        pred = net(img)
        yololoss = YoloLoss(n_batch=2, B=2, C=20, lambda_coord=5, lambda_noobj=0.5)
        print(pred.size())
        print(target.size())
        loss = yololoss(pred, target)
        print(loss)

    if vot:
        img_folder = './small_train_dataset'
        bboxes = dd.io.load('girl_bbox_4dim.h5')
        learning_rate = 0.0005
        img_size = 224
        num_epochs = 2
        lambda_coord = 5
        lambda_noobj = .5
        n_batch = 5
        S = 7
        B = 2
        C = 1
        train_dataset = VotDataset(img_folder=img_folder, bboxes=bboxes, img_size=img_size, S=S, B=B, C=C,
                                   transforms=[transforms.ToTensor()])
        train_loader = DataLoader(train_dataset, batch_size=n_batch, shuffle=False, num_workers=2)
        yololoss = YoloLoss(n_batch=n_batch, B=B, C=C, lambda_coord=5, lambda_noobj=0.5)
        train_iter = iter(train_loader)
        img, target = next(train_iter)
        target = Variable(target)
        img = Variable(img)

        model = models.vgg16(pretrained=True)
        model.classifier = nn.Sequential(
            nn.Linear(512 * 7 * 7, 4096),
            nn.LeakyReLU(0.1, inplace=True),
            nn.Dropout(),
            nn.Linear(4096, 11 * 7 * 7),
            nn.Sigmoid(),
        )
        model.train()

        loss_fn = YoloLoss(n_batch, B, C, lambda_coord, lambda_noobj)
        optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-4)

        use_gpu = False
        for epoch in range(num_epochs):
            for i, (images, target) in enumerate(train_loader):
                images = Variable(images)
                target = Variable(target)
                if use_gpu:
                    images, target = images.cuda(), target.cuda()

                pred = model(images)
                print(pred.size())
                print(target.size())
                loss = loss_fn(pred, target)
                print(i + 1, loss)

                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                if i == 10:
                    break
            break



if __name__=='__main__':
    from own_yolo_v1.network import *
    from own_yolo_v1.load_dataset import *
    test()

Yolo_loss

import torch.nn as nn


class Flatten(nn.Module):
    def __init__(self):
        super(Flatten, self).__init__()
    def forward(self, x):
        return x.view(x.size(0), -1)


class YOLO_V1(nn.Module):
    def __init__(self):
        super(YOLO_V1, self).__init__()
        C = 20  # number of classes
        print("\n------Initiating YOLO v1------\n")
        self.conv_layer1 = nn.Sequential(
            nn.Conv2d(in_channels=3, out_channels=64, kernel_size=7, stride=2, padding=7//2),
            nn.BatchNorm2d(64),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer2 = nn.Sequential(
            nn.Conv2d(in_channels=64, out_channels=192, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(192),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer3 = nn.Sequential(
            nn.Conv2d(in_channels=192, out_channels=128, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=256, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(512),
            nn.LeakyReLU(0.1),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer4 = nn.Sequential(
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=256, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=512, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.MaxPool2d(kernel_size=2, stride=2)
        )
        self.conv_layer5 = nn.Sequential(
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=512, kernel_size=1, stride=1, padding=1//2),
            nn.Conv2d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=2, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1),
        )
        self.conv_layer6 = nn.Sequential(
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.Conv2d(in_channels=1024, out_channels=1024, kernel_size=3, stride=1, padding=3//2),
            nn.BatchNorm2d(1024),
            nn.LeakyReLU(0.1)
        )
        self.flatten = Flatten()
        self.conn_layer1 = nn.Sequential(
            nn.Linear(in_features=7*7*1024, out_features=4096),
            nn.Dropout(),
            nn.LeakyReLU(0.1)
        )
        self.conn_layer2 = nn.Sequential(nn.Linear(in_features=4096, out_features=7 * 7 * (2 * 5 + C)))

    def forward(self, input):
        conv_layer1 = self.conv_layer1(input)
        conv_layer2 = self.conv_layer2(conv_layer1)
        conv_layer3 = self.conv_layer3(conv_layer2)
        conv_layer4 = self.conv_layer4(conv_layer3)
        conv_layer5 = self.conv_layer5(conv_layer4)
        conv_layer6 = self.conv_layer6(conv_layer5)
        flatten = self.flatten(conv_layer6)
        conn_layer1 = self.conn_layer1(flatten)
        output = self.conn_layer2(conn_layer1)
        return output


'''
def test():
    from own_yolo_v1.load_dataset import *
    from torch.autograd import Variable
    img_folder = '../codedata/voc2012train/JPEGImages'
    file = '../voc2012.txt'
    img_size = 448
    train_dataset = YoloDataset(img_folder=img_folder, file=file, img_size=img_size, transforms=[transforms.ToTensor()])
    train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, num_workers=0)
    train_iter = iter(train_loader)
    img, target = next(train_iter)
    img = Variable(img)
    net = YOLO_V1()
    output = net(img)
    print(output.size())


if __name__ == '__main__':
    test()
'''

Yolo V1 network

2. Yolo V2

　　　　Yolo V2是在2016年的Yolo9000: Better, Faster, Stronger 中提出的，采用了新的網絡模型，成為Darknet-19, 包括19個卷積層和5個maxpooling層，相比Yolo V1的計算量減小了33%左右。其結構如下：

在ImageNet上預訓練的結構：

進行檢測任務訓練時模型結構：（引入了不同尺度特征融合）

Yolo V2 主要對Yolo V1進行了五處改進：

　　　　　　(1) 加入Batch Normalization, 去掉dropout

　　　　　　(2) High resolution classifier (高分辨率圖片分類器)

　　　　　　(3) 引入 Anchors

　　　　　　(4) Fine-grained Features (低層和高層特征融合)

　　　　　　(5) Multi-scale Training （不同尺度圖片的訓練）

　　 2.1 High resolution classifier （4% mAP）

　　　　yolo v1中分類器在ImageNet數據集（224*224）上預訓練，而檢測時圖片的大小為448*448，網絡需要適應新的尺寸，因此yolo V2中又加入了一步finetune，步驟如下：

　　　　a，在ImageNet上預訓練分類器（224*224），大概160個epoch　　

　　　　b，將ImageNet的圖片resize到448*448，再finetune 10個epoch，讓模型適應大圖片

　　　　c，采用上述預訓練的權重，在實際數據集上finetune（416*416），最終輸出為13*13

　　2.2 Anchors

　　　借鑒Faster RCNN中Anchor的思想，通過kmeans方法在VOC數據集（COCO數據集）上對檢測物體的寬高進行了聚類分析，得出了5個聚類中心，因此選取5個anchor的寬高： (聚類時衡量指標distance = 1-IOU(bbox, cluster))

COCO: (0.57273, 0.677385), (1.87446, 2.06253), (3.33843, 5.47434), (7.88282, 3.52778), (9.77052, 9.16828)
VOC: (1.3221, 1.73145), (3.19275, 4.00944), (5.05587, 8.09892), (9.47112, 4.84053), (11.2364, 10.0071)

　　這樣每個grid cell將對應5個不同寬高的anchor，如下圖所示：(上面給出的寬高是相對於grid cell，對應的實際寬高還需要乘以32)

　　關於預測的bbox的計算：(416*416-------13*13 為例)

　　　　(1) 輸入圖片尺寸為416*416, 最后輸出結果為13*13*125，這里的125指5*（5 + 20），5表示5個anchor，25表示[x, y, w, h, confidence ] + 20 class ）,即每一個anchor預測一組值。

　　　　(2) 對於每一anchor預測的25個值， x, y是相對於該grid cell左上角的偏移值，需要通過sigmoid函數將其處理到0-1之間。如13*13大小的grid，對於index為（6, 6）的cell，預測的x, y通過sigmoid計算為xoffset, yoffset, 則對應的實際x = 6 + xoffset, y = 6+yoffset，由於0<xoffset<1, 0<yoffset<1, 預測的實際x， y總是在（6,6）的cell內。對於預測的w, h是相對於anchor的寬高，還需乘以anchor的(w, h), 就得到相應的寬高

　　　　(3) 由於上述尺度是在13*13下的，需要還原為實際的圖片對應大小，還需乘以縮放倍數32

　　實際計算代碼如下：

　　2.3 Fine-Grained Features

　　　　由上面網絡架構中，可以看到一條shortcut，將低層的的feature map（26*26*512）和最后輸出的feature map（13*13*1024）進行concat，從而將低層的位置信息特征和高層的語義特征進行融合。另外由於26*26尺度較大，網絡采用Reorg層對其進行了reshape，使其轉變為13*13，如下圖所示：

　　2.4 Multi-scale Training

　　　　上述網絡架構中，最后一層的（Conv22）為1*1*125的卷積層代替全連接函數，可以處理任何大小的圖片輸入，因此在訓練時，每10個epoch，作者從[320×320, 352×352, … 608×608](都是32的倍數，最后輸出都降采樣32倍)選一個尺度作為輸入圖片的尺寸進行訓練，增加模型的魯棒性。（當尺度為416*416時，輸出為13*13*125；輸入為320*320，則輸出為10*10*125）

Yolo V2的特點：

　　 (1)采用Darknet19網絡結構，層數比Yolo V1更少，且沒有全連接層，計算量更少；模型運行更快；

　　(2) 使用卷積代替全鏈接：解除了輸入大小的限制, 多尺度的訓練使得模型對不同尺度的圖片的檢測更加魯棒

　　 (3) 每個cell采用5個anchor box進行預測，對擁擠和小物體檢測更有效

3. Yolo 9000

　　Yolo 9000是和yolo v2在同一篇文章中提出，是在YOLOv2的基礎上提出的一種可以檢測超過9000個類別的模型，其主要貢獻點在於提出了一種分類和檢測的聯合訓練策略，具體細節參考：https://zhuanlan.zhihu.com/p/35325884

4. Yolo V3

　　　Yolo V3是在2018年的文章YOLO V3: An Incremental Improvement 中提出，Yolo V3網絡結構為DarkNet53，如下圖所示：(有ResNet， FPN的思想)。Yolo V3每個網格單元預測3個anchor box，每個box需要有(x, y, w, h, confidence)五個基本參數，然后有80個類別（COCO數據集）的概率，所以3*(5 + 80) = 255。（y1， y2，y3的深度都是255）

　　相比於Resnet，Darknet中的殘差結構如下：

　　采用FPN的思想，將不同尺度的Feature map進行融合，並在每個尺度上進行預測，如下圖所示：

　　yolo_v3也和v2一樣，backbone都會將輸出特征圖縮小到輸入的1/32，通常都要求輸入圖片是32的倍數，Yolo v3中的DarkNet 53 和yolo v2 的DarkNet 19對比如下圖所示：

yolo v3的pytorch實現代碼參考：

from __future__ import division

from models import *
# from utils.logger import *
from utils.utils import *
from utils.datasets import *
from utils.parse_config import *
from test import evaluate

# from terminaltables import AsciiTable

import os
import sys
import time
import datetime
import argparse

import torch
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision import transforms
from torch.autograd import Variable
import torch.optim as optim

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--epochs", type=int, default=100, help="number of epochs")
    parser.add_argument("--batch_size", type=int, default=8, help="size of each image batch")
    parser.add_argument("--gradient_accumulations", type=int, default=2, help="number of gradient accums before step")
    parser.add_argument("--model_def", type=str, default="config/yolov3.cfg", help="path to model definition file")
    parser.add_argument("--data_config", type=str, default="config/coco.data", help="path to data config file")
    parser.add_argument("--pretrained_weights", type=str, help="if specified starts from checkpoint model")
    parser.add_argument("--n_cpu", type=int, default=8, help="number of cpu threads to use during batch generation")
    parser.add_argument("--img_size", type=int, default=416, help="size of each image dimension")
    parser.add_argument("--checkpoint_interval", type=int, default=1, help="interval between saving model weights")
    parser.add_argument("--evaluation_interval", type=int, default=1, help="interval evaluations on validation set")
    parser.add_argument("--compute_map", default=False, help="if True computes mAP every tenth batch")
    parser.add_argument("--multiscale_training", default=True, help="allow for multi-scale training")
    opt = parser.parse_args()
    print(opt)

    # logger = Logger("logs")

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # os.makedirs("output", exist_ok=True)
    # os.makedirs("checkpoints", exist_ok=True)

    # Get data configuration
    data_config = parse_data_config(opt.data_config)
    train_path = data_config["train"]
    valid_path = data_config["valid"]
    class_names = load_classes(data_config["names"])

    # Initiate model
    model = Darknet(opt.model_def).to(device)
    model.apply(weights_init_normal)

    # If specified we start from checkpoint
    if opt.pretrained_weights:
        if opt.pretrained_weights.endswith(".pth"):
            model.load_state_dict(torch.load(opt.pretrained_weights))
        else:
            model.load_darknet_weights(opt.pretrained_weights)

    # model = torch.nn.DataParallel(model).cuda()

    # Get dataloader
    dataset = ListDataset(train_path, augment=True, multiscale=opt.multiscale_training)
    dataloader = torch.utils.data.DataLoader(
        dataset,
        batch_size=opt.batch_size,
        shuffle=True,
        num_workers=opt.n_cpu,
        pin_memory=True,
        collate_fn=dataset.collate_fn,
    )

    optimizer = torch.optim.Adam(model.parameters())

    metrics = [
        "grid_size",
        "loss",
        "x",
        "y",
        "w",
        "h",
        "conf",
        "cls",
        "cls_acc",
        "recall50",
        "recall75",
        "precision",
        "conf_obj",
        "conf_noobj",
    ]

    for epoch in range(opt.epochs):
        model.train()
        start_time = time.time()
        for batch_i, (img_pth, imgs, targets) in enumerate(dataloader):
            batches_done = len(dataloader) * epoch + batch_i
            # img: (batch_size, channel, height, width)
            # target: (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
            imgs = Variable(imgs.to(device))
            targets = Variable(targets.to(device), requires_grad=False)

            loss, outputs = model(imgs, targets)
            loss.backward()

            if batches_done % opt.gradient_accumulations:
                # Accumulates gradient before each step
                optimizer.step()
                optimizer.zero_grad()

            # ----------------
            #   Log progress
            # ----------------

            log_str = "\n---- [Epoch %d/%d, Batch %d/%d] ----\n" % (epoch, opt.epochs, batch_i, len(dataloader))

            # metric_table = [["Metrics", *["YOLO Layer {}".format(i) for i in range(len(model.yolo_layers))]]]

            # Log metrics at each YOLO layer
            for i, metric in enumerate(metrics):
                formats = {m: "%.6f" for m in metrics}
                formats["grid_size"] = "%2d"
                formats["cls_acc"] = "%.2f%%"
                row_metrics = [formats[metric] % yolo.metrics.get(metric, 0) for yolo in model.yolo_layers]
                # metric_table += [[metric, *row_metrics]]

                # Tensorboard logging
                tensorboard_log = []
                for j, yolo in enumerate(model.yolo_layers):
                    for name, metric in yolo.metrics.items():
                        if name != "grid_size":
                            tensorboard_log += [("{}_{}".format(name, j+1), metric)]
                tensorboard_log += [("loss", loss.item())]
                # logger.list_of_scalars_summary(tensorboard_log, batches_done)

            # log_str += AsciiTable(metric_table).table
            log_str += "\nTotal loss {}".format(loss.item())

            # Determine approximate time left for epoch
            epoch_batches_left = len(dataloader) - (batch_i + 1)
            time_left = datetime.timedelta(seconds=epoch_batches_left * (time.time() - start_time) / (batch_i + 1))
            log_str += "\n---- ETA {}".format(time_left)

            print(log_str)

            model.seen += imgs.size(0)

            # if batch_i > 10:
            #     break

        if epoch % opt.evaluation_interval == 0:
            print("\n---- Evaluating Model ----")
            # Evaluate the model on the validation set
            precision, recall, AP, f1, ap_class = evaluate(
                model,
                path=valid_path,
                iou_thres=0.5,
                conf_thres=0.5,
                nms_thres=0.5,
                img_size=opt.img_size,
                batch_size=8,
            )
            evaluation_metrics = [
                ("val_precision", precision.mean()),
                ("val_recall", recall.mean()),
                ("val_mAP", AP.mean()),
                ("val_f1", f1.mean()),
            ]
            # logger.list_of_scalars_summary(evaluation_metrics, epoch)

            # Print class APs and mAP
            ap_table = [["Index", "Class name", "AP"]]
            for i, c in enumerate(ap_class):
                ap_table += [[c, class_names[c], "%.5f" % AP[i]]]
            # print(AsciiTable(ap_table).table)
            print("---- mAP {}".format(AP.mean()))

        if epoch % opt.checkpoint_interval == 0:
            torch.save(model.state_dict(), "checkpoints/yolov3_ckpt_%d.pth" % epoch)

train

# -*- coding: utf-8 -*-
from __future__ import division

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np

from utils.parse_config import *
from utils.utils import build_targets, to_cpu, non_max_suppression

# import matplotlib.pyplot as plt
# import matplotlib.patches as patches


def create_modules(module_defs):
    """
    Constructs module list of layer blocks from module configuration in module_defs
    """
    hyperparams = module_defs.pop(0)
    output_filters = [int(hyperparams["channels"])]
    module_list = nn.ModuleList()
    for module_i, module_def in enumerate(module_defs):
        modules = nn.Sequential()

        if module_def["type"] == "convolutional":
            bn = int(module_def["batch_normalize"])
            filters = int(module_def["filters"])
            kernel_size = int(module_def["size"])
            pad = (kernel_size - 1) // 2
            modules.add_module(
                "conv_{}".format(module_i),
                nn.Conv2d(
                    in_channels=output_filters[-1],
                    out_channels=filters,
                    kernel_size=kernel_size,
                    stride=int(module_def["stride"]),
                    padding=pad,
                    bias=not bn,
                ),
            )
            if bn:
                modules.add_module("batch_norm_{}".format(module_i), nn.BatchNorm2d(filters, momentum=0.9, eps=1e-5))
            if module_def["activation"] == "leaky":
                modules.add_module("leaky_{}".format(module_i), nn.LeakyReLU(0.1))

        elif module_def["type"] == "maxpool":
            kernel_size = int(module_def["size"])
            stride = int(module_def["stride"])
            if kernel_size == 2 and stride == 1:
                modules.add_module("_debug_padding_{}".format(module_i), nn.ZeroPad2d((0, 1, 0, 1)))
            maxpool = nn.MaxPool2d(kernel_size=kernel_size, stride=stride, padding=int((kernel_size - 1) // 2))
            modules.add_module("maxpool_{}".format(module_i), maxpool)

        elif module_def["type"] == "upsample":
            upsample = Upsample(scale_factor=int(module_def["stride"]), mode="nearest")
            modules.add_module("upsample_{}".format(module_i), upsample)

        elif module_def["type"] == "route":
            layers = [int(x) for x in module_def["layers"].split(",")]
            filters = sum([output_filters[1:][i] for i in layers])
            modules.add_module("route_{}".format(module_i), EmptyLayer())

        elif module_def["type"] == "shortcut":
            filters = output_filters[1:][int(module_def["from"])]
            modules.add_module("shortcut_{}".format(module_i), EmptyLayer())

        elif module_def["type"] == "yolo":
            anchor_idxs = [int(x) for x in module_def["mask"].split(",")]
            # Extract anchors
            anchors = [int(x) for x in module_def["anchors"].split(",")]
            anchors = [(anchors[i], anchors[i + 1]) for i in range(0, len(anchors), 2)]
            anchors = [anchors[i] for i in anchor_idxs]
            num_classes = int(module_def["classes"])
            img_size = int(hyperparams["height"])
            # Define detection layer
            yolo_layer = YOLOLayer(anchors, num_classes, img_size)
            modules.add_module("yolo_{}".format(module_i), yolo_layer)
        # Register module list and number of output filters
        module_list.append(modules)
        output_filters.append(filters)

    return hyperparams, module_list


class Upsample(nn.Module):
    """ nn.Upsample is deprecated """

    def __init__(self, scale_factor, mode="nearest"):
        super(Upsample, self).__init__()
        self.scale_factor = scale_factor
        self.mode = mode

    def forward(self, x):
        x = F.interpolate(x, scale_factor=self.scale_factor, mode=self.mode)
        return x


class EmptyLayer(nn.Module):
    """Placeholder for 'route' and 'shortcut' layers"""

    def __init__(self):
        super(EmptyLayer, self).__init__()


class YOLOLayer(nn.Module):
    """Detection layer"""

    def __init__(self, anchors, num_classes, img_dim=416):
        super(YOLOLayer, self).__init__()
        self.anchors = anchors
        self.num_anchors = len(anchors)
        self.num_classes = num_classes
        self.ignore_thres = 0.5
        self.mse_loss = nn.MSELoss()
        self.bce_loss = nn.BCELoss()
        self.obj_scale = 1
        self.noobj_scale = 100
        self.metrics = {}
        self.img_dim = img_dim
        self.grid_size = 0  # grid size

    def compute_grid_offsets(self, grid_size, cuda=True):
        self.grid_size = grid_size
        g = self.grid_size
        FloatTensor = torch.cuda.FloatTensor if cuda else torch.FloatTensor
        self.stride = self.img_dim / self.grid_size  # 縮小多少倍
        # Calculate offsets for each grid
        # grid_x， grid_y（1, 1, gride, gride）
        self.grid_x = torch.arange(g).repeat(g, 1).view([1, 1, g, g]).type(FloatTensor)
        self.grid_y = torch.arange(g).repeat(g, 1).t().view([1, 1, g, g]).type(FloatTensor)

        # 圖片縮小多少倍，對應的anchors也要縮小相應倍數
        self.scaled_anchors = FloatTensor([(a_w / self.stride, a_h / self.stride) for a_w, a_h in self.anchors])

        # scaled_anchors shape（3， 2），3個anchors，每個anchor有w,h兩個量。下面步驟是把這兩個量划分開
        self.anchor_w = self.scaled_anchors[:, 0:1].view((1, self.num_anchors, 1, 1))  # （1， 3， 1， 1）
        self.anchor_h = self.scaled_anchors[:, 1:2].view((1, self.num_anchors, 1, 1))  # （1， 3， 1， 1）

    def forward(self, x, targets=None, img_dim=None):

        # Tensors for cuda support
        FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor
        LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor
        ByteTensor = torch.cuda.ByteTensor if x.is_cuda else torch.ByteTensor

        self.img_dim = img_dim  # (img_size)
        num_samples = x.size(0)  # (img_batch)
        grid_size = x.size(2)  # (feature_map_size)
        # print x.shape  # (batch_size, 255, grid_size, grid_size)

        prediction = (
            x.view(num_samples, self.num_anchors, 5 + self.num_classes, grid_size, grid_size)
            .permute(0, 1, 3, 4, 2)
            .contiguous()
        )
        # print prediction.shape (batch_size, num_anchors, grid_size, grid_size, 85)

        # Get outputs
        # 這里的prediction是初步的所有預測，在grid_size*grid_size個網格中，它表示每個網格都會有num_anchor（3）個anchor框
        # x,y,w,h, pred_conf的shape都是一樣的 (batch_size, num_anchor, gride_size, grid_size)
        x = torch.sigmoid(prediction[..., 0])  # Center x
        y = torch.sigmoid(prediction[..., 1])  # Center y
        w = prediction[..., 2]  # Width
        h = prediction[..., 3]  # Height
        pred_conf = torch.sigmoid(prediction[..., 4])  # Conf
        pred_cls = torch.sigmoid(prediction[..., 5:])  # Cls pred. (batch_size, num_anchor, gride_size, grid_size, cls)

        # If grid size does not match current we compute new offsets
        # print grid_size, self.grid_size
        if grid_size != self.grid_size:
            self.compute_grid_offsets(grid_size, cuda=x.is_cuda)

        # print self.grid_x, self.grid_y, self.anchor_w, self.anchor_h
        # Add offset and scale with anchors
        pred_boxes = FloatTensor(prediction[..., :4].shape)
        # 針對每個網格的偏移量，每個網格的單位長度為1，而預測的中心點（x，y）是歸一化的（0，1之間），所以可以直接相加
        pred_boxes[..., 0] = x.data + self.grid_x  # （1, 1, gride, gride）
        pred_boxes[..., 1] = y.data + self.grid_y
        pred_boxes[..., 2] = torch.exp(w.data) * self.anchor_w  # # （1， 3， 1， 1）
        pred_boxes[..., 3] = torch.exp(h.data) * self.anchor_h

        # (batch_size, num_anchors*grid_size*grid_size, 85)
        output = torch.cat(
            (
                # (batch_size, num_anchors*grid_size*grid_size, 4)
                pred_boxes.view(num_samples, -1, 4) * self.stride,  # 放大到最初輸入的尺寸
                # (batch_size, num_anchors*grid_size*grid_size, 1)
                pred_conf.view(num_samples, -1, 1),
                # (batch_size, num_anchors*grid_size*grid_size, 80)
                pred_cls.view(num_samples, -1, self.num_classes),
            ),
            -1,
        )

        if targets is None:
            return output, 0
        else:
            # pred_boxes => (batch_size, anchor_num, gride, gride, 4)
            # pred_cls => (batch_size, anchor_num, gride, gride, 80)
            # targets => (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
            # scaled_anchors => (3, 2)
            # print pred_boxes.shape, pred_cls.shape, targets.shape, self.scaled_anchors.shape
            iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
                pred_boxes=pred_boxes,
                pred_cls=pred_cls,
                target=targets,
                anchors=self.scaled_anchors,
                ignore_thres=self.ignore_thres,
            )
            # iou_scores：預測框pred_boxes中的正確框與目標實體框target_boxes的交集IOU，以IOU作為分數，IOU越大，分值越高。
            # class_mask：將預測正確的標記為1（正確的預測了實體中心點所在的網格坐標，哪個anchor框可以最匹配實體，以及實體的類別）
            # obj_mask：將目標實體框所對應的anchor標記為1，目標實體框所對應的anchor與實體一一對應的
            # noobj_mask：將所有與目標實體框IOU小於某一閾值的anchor標記為1
            # tx, ty, tw, th： 需要擬合目標實體框的坐標和尺寸
            # tcls：目標實體框的所屬類別
            # tconf：所有anchor的目標置信度

            # 這里計算得到的iou_scores，class_mask，obj_mask，noobj_mask，tx, ty, tw, th和tconf都是（batch, anchor_num, gride, gride）
            # 預測的x,y,w,h,pred_conf也都是（batch, anchor_num, gride, gride）

            # tcls 和 pred_cls 都是（batch, anchor_num, gride, gride，num_class）


            # Loss : Mask outputs to ignore non-existing objects (except with conf. loss)
            # 坐標和尺寸的loss計算：
            loss_x = self.mse_loss(x[obj_mask], tx[obj_mask])
            loss_y = self.mse_loss(y[obj_mask], ty[obj_mask])
            loss_w = self.mse_loss(w[obj_mask], tw[obj_mask])
            loss_h = self.mse_loss(h[obj_mask], th[obj_mask])
            # anchor置信度的loss計算：
            loss_conf_obj = self.bce_loss(pred_conf[obj_mask], tconf[obj_mask])  # tconf[obj_mask] 全為1
            loss_conf_noobj = self.bce_loss(pred_conf[noobj_mask], tconf[noobj_mask])  # tconf[noobj_mask] 全為0
            loss_conf = self.obj_scale * loss_conf_obj + self.noobj_scale * loss_conf_noobj
            # 類別的loss計算
            loss_cls = self.bce_loss(pred_cls[obj_mask], tcls[obj_mask])

            # loss匯總
            total_loss = loss_x + loss_y + loss_w + loss_h + loss_conf + loss_cls

            # Metrics
            cls_acc = 100 * class_mask[obj_mask].mean()
            conf_obj = pred_conf[obj_mask].mean()
            conf_noobj = pred_conf[noobj_mask].mean()
            conf50 = (pred_conf > 0.5).float()
            iou50 = (iou_scores > 0.5).float()
            iou75 = (iou_scores > 0.75).float()
            detected_mask = conf50 * class_mask * tconf

            obj_mask = obj_mask.float()

            # print type(iou50), type(detected_mask), type(conf50.sum()), type(iou75), type(obj_mask)
            #
            # print iou50.dtype, detected_mask.dtype, conf50.sum().dtype, iou75.dtype, obj_mask.dtype
            precision = torch.sum(iou50 * detected_mask) / (conf50.sum() + 1e-16)
            recall50 = torch.sum(iou50 * detected_mask) / (obj_mask.sum() + 1e-16)
            recall75 = torch.sum(iou75 * detected_mask) / (obj_mask.sum() + 1e-16)

            self.metrics = {
                "loss": to_cpu(total_loss).item(),
                "x": to_cpu(loss_x).item(),
                "y": to_cpu(loss_y).item(),
                "w": to_cpu(loss_w).item(),
                "h": to_cpu(loss_h).item(),
                "conf": to_cpu(loss_conf).item(),
                "cls": to_cpu(loss_cls).item(),
                "cls_acc": to_cpu(cls_acc).item(),
                "recall50": to_cpu(recall50).item(),
                "recall75": to_cpu(recall75).item(),
                "precision": to_cpu(precision).item(),
                "conf_obj": to_cpu(conf_obj).item(),
                "conf_noobj": to_cpu(conf_noobj).item(),
                "grid_size": grid_size,
            }

            return output, total_loss


class Darknet(nn.Module):
    """YOLOv3 object detection model"""

    def __init__(self, config_path, img_size=416):
        super(Darknet, self).__init__()
        self.module_defs = parse_model_config(config_path)
        self.hyperparams, self.module_list = create_modules(self.module_defs)
        self.yolo_layers = [layer[0] for layer in self.module_list if hasattr(layer[0], "metrics")]
        self.img_size = img_size
        self.seen = 0
        self.header_info = np.array([0, 0, 0, self.seen, 0], dtype=np.int32)

    def forward(self, x, targets=None):
        img_dim = x.shape[2]
        loss = 0
        layer_outputs, yolo_outputs = [], []
        for i, (module_def, module) in enumerate(zip(self.module_defs, self.module_list)):
            if module_def["type"] in ["convolutional", "upsample", "maxpool"]:
                x = module(x)
            elif module_def["type"] == "route":
                x = torch.cat([layer_outputs[int(layer_i)] for layer_i in module_def["layers"].split(",")], 1)
            elif module_def["type"] == "shortcut":
                layer_i = int(module_def["from"])
                x = layer_outputs[-1] + layer_outputs[layer_i]
            elif module_def["type"] == "yolo":
                x, layer_loss = module[0](x, targets, img_dim)
                loss += layer_loss
                yolo_outputs.append(x)
            layer_outputs.append(x)
        yolo_outputs = to_cpu(torch.cat(yolo_outputs, 1))
        return yolo_outputs if targets is None else (loss, yolo_outputs)

    def load_darknet_weights(self, weights_path):
        """Parses and loads the weights stored in 'weights_path'"""

        # Open the weights file
        with open(weights_path, "rb") as f:
            header = np.fromfile(f, dtype=np.int32, count=5)  # First five are header values
            self.header_info = header  # Needed to write header when saving weights
            self.seen = header[3]  # number of images seen during training
            weights = np.fromfile(f, dtype=np.float32)  # The rest are weights

        # Establish cutoff for loading backbone weights
        cutoff = None
        if "darknet53.conv.74" in weights_path:
            cutoff = 75

        ptr = 0
        for i, (module_def, module) in enumerate(zip(self.module_defs, self.module_list)):
            if i == cutoff:
                break
            if module_def["type"] == "convolutional":
                conv_layer = module[0]
                if module_def["batch_normalize"]:
                    # Load BN bias, weights, running mean and running variance
                    bn_layer = module[1]
                    num_b = bn_layer.bias.numel()  # Number of biases
                    # Bias
                    bn_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.bias)
                    bn_layer.bias.data.copy_(bn_b)
                    ptr += num_b
                    # Weight
                    bn_w = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.weight)
                    bn_layer.weight.data.copy_(bn_w)
                    ptr += num_b
                    # Running Mean
                    bn_rm = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.running_mean)
                    bn_layer.running_mean.data.copy_(bn_rm)
                    ptr += num_b
                    # Running Var
                    bn_rv = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(bn_layer.running_var)
                    bn_layer.running_var.data.copy_(bn_rv)
                    ptr += num_b
                else:
                    # Load conv. bias
                    num_b = conv_layer.bias.numel()
                    conv_b = torch.from_numpy(weights[ptr : ptr + num_b]).view_as(conv_layer.bias)
                    conv_layer.bias.data.copy_(conv_b)
                    ptr += num_b
                # Load conv. weights
                num_w = conv_layer.weight.numel()
                conv_w = torch.from_numpy(weights[ptr : ptr + num_w]).view_as(conv_layer.weight)
                conv_layer.weight.data.copy_(conv_w)
                ptr += num_w

    def save_darknet_weights(self, path, cutoff=-1):
        """
            @:param path    - path of the new weights file
            @:param cutoff  - save layers between 0 and cutoff (cutoff = -1 -> all are saved)
        """
        fp = open(path, "wb")
        self.header_info[3] = self.seen
        self.header_info.tofile(fp)

        # Iterate through layers
        for i, (module_def, module) in enumerate(zip(self.module_defs[:cutoff], self.module_list[:cutoff])):
            if module_def["type"] == "convolutional":
                conv_layer = module[0]
                # If batch norm, load bn first
                if module_def["batch_normalize"]:
                    bn_layer = module[1]
                    bn_layer.bias.data.cpu().numpy().tofile(fp)
                    bn_layer.weight.data.cpu().numpy().tofile(fp)
                    bn_layer.running_mean.data.cpu().numpy().tofile(fp)
                    bn_layer.running_var.data.cpu().numpy().tofile(fp)
                # Load conv bias
                else:
                    conv_layer.bias.data.cpu().numpy().tofile(fp)
                # Load conv weights
                conv_layer.weight.data.cpu().numpy().tofile(fp)

        fp.close()

models

# coding:utf8
from __future__ import division
import math
import time
import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import numpy as np
# import matplotlib.pyplot as plt
# import matplotlib.patches as patches


def to_cpu(tensor):
    return tensor.detach().cpu()


def load_classes(path):
    """
    Loads class labels at 'path'
    """
    fp = open(path, "r")
    names = fp.read().split("\n")[:-1]
    return names


def weights_init_normal(m):
    classname = m.__class__.__name__
    if classname.find("Conv") != -1:
        torch.nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find("BatchNorm2d") != -1:
        torch.nn.init.normal_(m.weight.data, 1.0, 0.02)
        torch.nn.init.constant_(m.bias.data, 0.0)


def rescale_boxes(boxes, current_dim, original_shape):
    """ Rescales bounding boxes to the original shape """
    orig_h, orig_w = original_shape
    # The amount of padding that was added
    pad_x = max(orig_h - orig_w, 0) * (current_dim / max(original_shape))
    pad_y = max(orig_w - orig_h, 0) * (current_dim / max(original_shape))
    # Image height and width after padding is removed
    unpad_h = current_dim - pad_y
    unpad_w = current_dim - pad_x
    # Rescale bounding boxes to dimension of original image
    boxes[:, 0] = ((boxes[:, 0] - pad_x // 2) / unpad_w) * orig_w
    boxes[:, 1] = ((boxes[:, 1] - pad_y // 2) / unpad_h) * orig_h
    boxes[:, 2] = ((boxes[:, 2] - pad_x // 2) / unpad_w) * orig_w
    boxes[:, 3] = ((boxes[:, 3] - pad_y // 2) / unpad_h) * orig_h
    return boxes


def xywh2xyxy(x):
    y = x.new(x.shape)
    y[..., 0] = x[..., 0] - x[..., 2] / 2
    y[..., 1] = x[..., 1] - x[..., 3] / 2
    y[..., 2] = x[..., 0] + x[..., 2] / 2
    y[..., 3] = x[..., 1] + x[..., 3] / 2
    return y


def ap_per_class(tp, conf, pred_cls, target_cls):
    """ Compute the average precision, given the recall and precision curves.
    Source: https://github.com/rafaelpadilla/Object-Detection-Metrics.
    # Arguments
        tp:    True positives (list).
        conf:  Objectness value from 0-1 (list).
        pred_cls: Predicted object classes (list).
        target_cls: True object classes (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """

    # Sort by objectness
    i = np.argsort(-conf)
    tp, conf, pred_cls = tp[i], conf[i], pred_cls[i]

    # Find unique classes
    unique_classes = np.unique(target_cls)

    # Create Precision-Recall curve and compute AP for each class
    ap, p, r = [], [], []
    for c in tqdm.tqdm(unique_classes, desc="Computing AP"):
        i = pred_cls == c
        n_gt = (target_cls == c).sum()  # Number of ground truth objects
        n_p = i.sum()  # Number of predicted objects

        if n_p == 0 and n_gt == 0:
            continue
        elif n_p == 0 or n_gt == 0:
            ap.append(0)
            r.append(0)
            p.append(0)
        else:
            # Accumulate FPs and TPs
            fpc = (1 - tp[i]).cumsum()
            tpc = (tp[i]).cumsum()

            # Recall
            recall_curve = tpc / (n_gt + 1e-16)
            r.append(recall_curve[-1])

            # Precision
            precision_curve = tpc / (tpc + fpc)
            p.append(precision_curve[-1])

            # AP from recall-precision curve
            ap.append(compute_ap(recall_curve, precision_curve))

    # Compute F1 score (harmonic mean of precision and recall)
    p, r, ap = np.array(p), np.array(r), np.array(ap)
    f1 = 2 * p * r / (p + r + 1e-16)

    return p, r, ap, f1, unique_classes.astype("int32")


def compute_ap(recall, precision):
    """ Compute the average precision, given the recall and precision curves.
    Code originally from https://github.com/rbgirshick/py-faster-rcnn.

    # Arguments
        recall:    The recall curve (list).
        precision: The precision curve (list).
    # Returns
        The average precision as computed in py-faster-rcnn.
    """
    # correct AP calculation
    # first append sentinel values at the end
    mrec = np.concatenate(([0.0], recall, [1.0]))
    mpre = np.concatenate(([0.0], precision, [0.0]))

    # compute the precision envelope
    for i in range(mpre.size - 1, 0, -1):
        mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

    # to calculate area under PR curve, look for points
    # where X axis (recall) changes value
    i = np.where(mrec[1:] != mrec[:-1])[0]

    # and sum (\Delta recall) * prec
    ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap


def get_batch_statistics(outputs, targets, iou_threshold):
    """ Compute true positives, predicted scores and predicted labels per sample """
    batch_metrics = []

    # print "outputs len: {}".format(len(outputs))
    # print "targets shape: {}".format(targets.shape)
    # outputs: (batch_size, pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
    # target:  (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
    for sample_i in range(len(outputs)):

        if outputs[sample_i] is None:
            continue

        # output: (pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
        output = outputs[sample_i]
        # print "output: {}".format(output.shape)

        pred_boxes = output[:, :4]  # 預測框的x,y,w,h
        pred_scores = output[:, 4]  # 預測框的置信度
        pred_labels = output[:, -1]  # 預測框的類別label

        # 長度為pred_boxes_num的list，初始化為0，如果預測框和實際框匹配，則設置為1
        true_positives = np.zeros(pred_boxes.shape[0])

        # 獲得真實目標框的類別label
        # annotations = targets[targets[:, 0] == sample_i][:, 1:]
        annotations = targets[targets[:, 0] == sample_i]
        annotations = annotations[:, 1:] if len(annotations) else []
        target_labels = annotations[:, 0] if len(annotations) else []

        if len(annotations):  # len(annotations)>0: 表示這張圖片有真實的目標框
            detected_boxes = []
            target_boxes = annotations[:, 1:]  # 真實目標框的x,y,w,h

            for pred_i, (pred_box, pred_label) in enumerate(zip(pred_boxes, pred_labels)):

                # If targets are found break
                if len(detected_boxes) == len(annotations):
                    break

                # Ignore if label is not one of the target labels
                # 如果該預測框的類別標簽不存在與目標框的類別標簽集合中，則必定是預測錯誤
                if pred_label not in target_labels:
                    continue

                # 將一個預測框與所有真實目標框做IOU計算，並獲取IOU最大的值(iou)，和與之對應的真實目標框的索引號(box_index)
                iou, box_index = bbox_iou(pred_box.unsqueeze(0), target_boxes).max(0)
                # 如果最大IOU大於閾值，則認為該真實目標框被發現。注意要防止被重復記錄
                if iou >= iou_threshold and box_index not in detected_boxes:
                    true_positives[pred_i] = 1  # 對該預測框設置為1
                    detected_boxes += [box_index]  # 記錄被發現的實際框索引號，防止預測框重復標記，即一個實際框只能被一個預測框匹配
        # 保存當前圖片被預測的信息
        # true_positives：預測框的正確與否，正確設置為1，錯誤設置為0
        # pred_scores：預測框的x,y,w,h
        # pred_labels：預測框的類別標簽
        batch_metrics.append([true_positives, pred_scores, pred_labels])
    return batch_metrics


def bbox_wh_iou(wh1, wh2):
    wh2 = wh2.t()
    w1, h1 = wh1[0], wh1[1]
    w2, h2 = wh2[0], wh2[1]
    # print w1, w2, h1, h2

    inter_area = torch.min(w1, w2) * torch.min(h1, h2)
    union_area = (w1 * h1 + 1e-16) + w2 * h2 - inter_area
    # print inter_area, union_area
    return inter_area / union_area


def bbox_iou(box1, box2, x1y1x2y2=True):
    """
    Returns the IoU of two bounding boxes
    """
    if not x1y1x2y2:
        # Transform from center and width to exact coordinates
        b1_x1, b1_x2 = box1[:, 0] - box1[:, 2] / 2, box1[:, 0] + box1[:, 2] / 2
        b1_y1, b1_y2 = box1[:, 1] - box1[:, 3] / 2, box1[:, 1] + box1[:, 3] / 2
        b2_x1, b2_x2 = box2[:, 0] - box2[:, 2] / 2, box2[:, 0] + box2[:, 2] / 2
        b2_y1, b2_y2 = box2[:, 1] - box2[:, 3] / 2, box2[:, 1] + box2[:, 3] / 2
    else:
        # Get the coordinates of bounding boxes
        b1_x1, b1_y1, b1_x2, b1_y2 = box1[:, 0], box1[:, 1], box1[:, 2], box1[:, 3]
        b2_x1, b2_y1, b2_x2, b2_y2 = box2[:, 0], box2[:, 1], box2[:, 2], box2[:, 3]

    # get the corrdinates of the intersection rectangle
    inter_rect_x1 = torch.max(b1_x1, b2_x1)
    inter_rect_y1 = torch.max(b1_y1, b2_y1)
    inter_rect_x2 = torch.min(b1_x2, b2_x2)
    inter_rect_y2 = torch.min(b1_y2, b2_y2)
    # Intersection area
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * torch.clamp(
        inter_rect_y2 - inter_rect_y1 + 1, min=0
    )
    # Union Area
    b1_area = (b1_x2 - b1_x1 + 1) * (b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1) * (b2_y2 - b2_y1 + 1)

    iou = inter_area / (b1_area + b2_area - inter_area + 1e-16)

    return iou


def non_max_suppression(prediction, conf_thres=0.5, nms_thres=0.4):
    """
    Removes detections with lower object confidence score than 'conf_thres' and performs
    Non-Maximum Suppression to further filter detections.
    Returns detections with shape:
        (x1, y1, x2, y2, object_conf, class_score, class_pred)
    """
    # prediction: (batch_size, num_anchors*grid_size*grid_size*3, 85) 85 => (x,y,w,h, conf, cls)
    # From (center x, center y, width, height) to (x1, y1, x2, y2)
    prediction[..., :4] = xywh2xyxy(prediction[..., :4])
    output = [None for _ in range(len(prediction))]

    for image_i, image_pred in enumerate(prediction):
        # Filter out confidence scores below threshold
        # 得到置信預測框：過濾anchor置信度小於閾值的預測框
        # print image_pred.shape (num_anchors*grid_size*grid_size*3, 85) 85 => (x,y,w,h, conf, cls)
        image_pred = image_pred[image_pred[:, 4] >= conf_thres]
        # print image_pred.shape  (more_than_conf_thres_num, 85) 85 => (x,y,w,h, conf, cls)

        # If none are remaining => process next image
        # 基於anchor的置信度過濾完后，看看是否還有保留的預測框，如果都被過濾，則認為沒有實體目標被檢測到
        if not image_pred.size(0):
            continue

        # Object confidence times class confidence
        # 計算處理：先選取每個預測框所代表的最大類別值，再將這個值乘以對應的anchor置信度，這樣將類別預測精准度和置信度都考慮在內。
        # 每個置信預測框都會對應一個score值
        score = image_pred[:, 4] * image_pred[:, 5:].max(1)[0]
        # Sort by it
        # 基於score值，將置信預測框從大到小進行排序
        # image_pred = image_pred[(-score).argsort()]
        # 置信預測：image_pred ==》(more_than_conf_thres_num, 85) 85 => (x,y,w,h, conf, cls)
        image_pred = image_pred[torch.sort(-score, dim=0)[1]]
        # image_pred[:, 5:] ==> (more_than_conf_thres_num, cls)
        # 該處理是獲取每個置信預測框所對應的類別預測分值（class_confs）和類別索引（class_preds）
        class_confs, class_preds = image_pred[:, 5:].max(1, keepdim=True)
        # 將置信預測框的 x,y,w,h,conf，類別預測分值和類別索引關聯到一起
        # detections ==》 (more_than_conf_thres_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
        detections = torch.cat((image_pred[:, :5], class_confs.float(), class_preds.float()), 1)

        # Perform non-maximum suppression
        keep_boxes = []
        while detections.size(0):
            # detections[0, :4]是第一個置信預測框，也是當前序列中分值最大的置信預測框
            # 計算當前序列的第一個（分值最大）置信預測框與整個序列預測框的IOU，並將IOU大於閾值的設置為1，小於的設置為0。
            large_overlap = bbox_iou(detections[0, :4].unsqueeze(0), detections[:, :4]) > nms_thres
            # 匹配與當前序列的第一個（分值最大）置信預測框具有相同類別標簽的所有預測框（將相同類別標簽的預測框標記為1）
            label_match = detections[0, -1] == detections[:, -1]

            # Indices of boxes with lower confidence scores, large IOUs and matching labels
            # 與當前序列的第一個（分值最大）置信預測框IOU大，說明這些預測框與其相交面積大，
            # 如果這些預測框的標簽與當前序列的第一個（分值最大）置信預測框的相同，則說明是預測的同一個目標，
            # 對與當前序列第一個（分值最大）置信預測框預測了同一目標的設置為1（包括當前序列第一個（分值最大）置信預測框本身）。
            invalid = large_overlap & label_match
            # 取出對應置信預測框的置信度，將置信度作為權重
            weights = detections[invalid, 4:5]

            # Merge overlapping bboxes by order of confidence
            # 把預測為同一目標的預測框進行合並，合並后認為是最優的預測框。合並方式如下：
            detections[0, :4] = (weights * detections[invalid, :4]).sum(0) / weights.sum()
            # 保存當前序列中最終識別的預測框
            keep_boxes += [detections[0]]
            # ~invalid表示取反，將之前的0變為1，即選取剩下的預測框，進行新一輪的計算
            detections = detections[~invalid]
        if keep_boxes:
            # 每張圖片的最終預測框有pred_boxes_num個，output[image_i]的shape：
            # (pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
            output[image_i] = torch.stack(keep_boxes)

    # (batch_size, pred_boxes_num, 7) 7 =》x,y,w,h,conf,class_conf,class_pred
    return output


def build_targets(pred_boxes, pred_cls, target, anchors, ignore_thres):
    # pred_boxes => (batch_size, anchor_num, gride, gride, 4)
    # pred_cls => (batch_size, anchor_num, gride, gride, 80)
    # targets => (num, 6)  6=>(batch_index, cls, center_x, center_y, widht, height)
    # anchors => (3, 2)

    ByteTensor = torch.cuda.ByteTensor if pred_boxes.is_cuda else torch.ByteTensor
    FloatTensor = torch.cuda.FloatTensor if pred_boxes.is_cuda else torch.FloatTensor

    nB = pred_boxes.size(0)  # batch num
    nA = pred_boxes.size(1)  # anchor num
    nC = pred_cls.size(-1)  # class num => 80
    nG = pred_boxes.size(2)  # gride

    # Output tensors
    obj_mask = ByteTensor(nB, nA, nG, nG).fill_(0)  # (batch_size, anchor_num, gride, gride)
    noobj_mask = ByteTensor(nB, nA, nG, nG).fill_(1)
    class_mask = FloatTensor(nB, nA, nG, nG).fill_(0)
    iou_scores = FloatTensor(nB, nA, nG, nG).fill_(0)
    tx = FloatTensor(nB, nA, nG, nG).fill_(0)
    ty = FloatTensor(nB, nA, nG, nG).fill_(0)
    tw = FloatTensor(nB, nA, nG, nG).fill_(0)
    th = FloatTensor(nB, nA, nG, nG).fill_(0)
    tcls = FloatTensor(nB, nA, nG, nG, nC).fill_(0)  # (batch_size, anchor_num, gride, gride, class_num)

    # Convert to position relative to box
    # 這一步是將x,y,w,h這四個歸一化的變量變為真正的尺寸，因為當前圖像的尺寸是nG，所以乘以nG。
    # print target[:, 2:6].shape  # (num, 4)
    target_boxes = target[:, 2:6] * nG  # (num, 4)  4=>(center_x, center_y, widht, height)
    gxy = target_boxes[:, :2]  # (num, 2)
    gwh = target_boxes[:, 2:]  # (num, 2)
    # print target_boxes.shape, gxy.shape, gwh.shape

    # Get anchors with best iou
    # 這一步是為每一個目標框從三種anchor框中分配一個最優的.
    # anchor 是設置的錨框，gwh是真實標記的寬高，這里是比較兩者的交集，選出最佳的錨框,因為只是選擇哪種錨框，不用考慮中心坐標。
    ious = torch.stack([bbox_wh_iou(anchor, gwh) for anchor in anchors])  # (3, num)
    # ious（3，num），該處理是為每一個目標框選取一個IOU最大的anchor框，best_ious表示最大IOU的值，best_n表示最大IOU對應anchor的index
    best_ious, best_n = ious.max(0)  # best_ious 和 best_n 的長度均為 num， best_n是num個目標框對應的anchor索引

    # Separate target values
    # .t() 表示轉置，（num，2） =》（2，num）
    # （2，num）  2=>(batch_index, cls） =》 b(num)表示對應num個index, target_labels(num)表示對應num個labels
    b, target_labels = target[:, :2].long().t()
    gx, gy = gxy.t()  # gx表示num個x， gy表示num個y
    gw, gh = gwh.t()
    gi, gj = gxy.long().t()  # .long()是把浮點型轉為整型（去尾），這樣就可以得到目標框中心點所在的網格坐標


    # ---------------------------得到目標實體框obj_mask和目標非實體框noobj_mask  start----------------------------
    # Set masks
    # 表示batch中的第b張圖片，其網格坐標為(gj, gi)的單元網格存在目標框的中心點，該目標框所匹配的最優anchor索引為best_n
    obj_mask[b, best_n, gj, gi] = 1  # 對目標實體框中心點所在的單元網格，其最優anchor設置為1
    noobj_mask[b, best_n, gj, gi] = 0  # 對目標實體框中心點所在的單元網格，其最優anchor設置為0 （與obj_mask相反）

    # Set noobj mask to zero where iou exceeds ignore threshold
    # ious.t(): (3, num) => (num, 3)
    # 這里不同與上一個策略，上個策略是找到與目標框最優的anchor框，每個目標框對應一個anchor框。
    # 這里不考慮最優問題，只要目標框與anchor的IOU大於閾值，就認為是有效anchor框，即noobj_mask對應的位置設置為0
    for i, anchor_ious in enumerate(ious.t()):
        noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

    # 以上操作得到了目標實體框obj_mask和目標非實體框noobj_mask，目標實體框是與實體一一對應的，一個實體有一個最匹配的目標框；
    # 目標非實體框noobj_mask，該框既不是實體最匹配的，而且還要該框與實體IOU小於閾值，這也是為了讓正負樣例更加明顯。
    # ---------------------------得到目標實體框obj_mask和目標非實體框noobj_mask  end------------------------------

    # ---------------------------得到目標實體框的歸一化坐標（tx, ty, tw, th）  start------------------------------
    # Coordinates
    # 將x,y,w,h重新歸一化，
    # 注意：要明白這里為什么要這么做，此處的歸一化和傳入target的歸一化方式不一樣，
    # 傳入target的歸一化是實際的x,y,w,h / img_size. 即實際x,y,w,h在img_size中的比例，
    # 此處的歸一化中，中心坐標x,y是基於單元網絡的，w,h是基於anchor框，此處歸一化的x,y,w,h，也是模型要擬合的值。
    tx[b, best_n, gj, gi] = gx - gx.floor()
    ty[b, best_n, gj, gi] = gy - gy.floor()
    # Width and height
    tw[b, best_n, gj, gi] = torch.log(gw / anchors[best_n][:, 0] + 1e-16)
    th[b, best_n, gj, gi] = torch.log(gh / anchors[best_n][:, 1] + 1e-16)
    # ---------------------------得到目標實體框的歸一化坐標（tx, ty, tw, th）  end---------------------------------


    # One-hot encoding of label
    # 表示batch中的第b張圖片，其網格坐標為(gj, gi)的單元網格存在目標框的中心點，該目標框所匹配的最優anchor索引為best_n，其類別為target_labels
    tcls[b, best_n, gj, gi, target_labels] = 1

    # Compute label correctness and iou at best anchor
    # class_mask:將預測正確的標記為1（正確的預測了實體中心點所在的網格坐標，哪個anchor框可以最匹配實體，以及實體的類別）
    class_mask[b, best_n, gj, gi] = (pred_cls[b, best_n, gj, gi].argmax(-1) == target_labels).float()
    # iou_scores：預測框pred_boxes中的正確框與目標實體框target_boxes的交集IOU，以IOU作為分數，IOU越大，分值越高。
    iou_scores[b, best_n, gj, gi] = bbox_iou(pred_boxes[b, best_n, gj, gi], target_boxes, x1y1x2y2=False)
    # tconf：正確的目標實體框，其對應anchor框的置信度為1，即置信度的標簽，這里轉為float，是為了后面和預測的置信度值做loss計算。
    tconf = obj_mask.float()


    # iou_scores：預測框pred_boxes中的正確框與目標實體框target_boxes的交集IOU，以IOU作為分數，IOU越大，分值越高。
    # class_mask：將預測正確的標記為1（正確的預測了實體中心點所在的網格坐標，哪個anchor框可以最匹配實體，以及實體的類別）
    # obj_mask：將目標實體框所對應的anchor標記為1，目標實體框所對應的anchor與實體一一對應的
    # noobj_mask：將所有與目標實體框IOU小於某一閾值的anchor標記為1
    # tx, ty, tw, th： 需要擬合目標實體框的坐標和尺寸
    # tcls：目標實體框的所屬類別
    # tconf：所有anchor的目標置信度
    return iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf

utils

參考：

　　https://www.zybuluo.com/rianusr/note/1417734

　　https://www.cnblogs.com/hellcat/p/10375310.html

　　https://blog.csdn.net/leviopku/article/details/82660381

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 目標檢測之YOLO V2 V3 yolo系列之yolo v3【深度解析】 YOLO V1、V2、V3算法精要解說深度學習之 YOLO v1,v2,v3詳解 yolo v2記錄 YOLO V3 YOLO V3 原理 yolo v2使用總結 YOLO V3 loss為Nan YOLO v3算法介紹