萬字長文，以代碼的思想去詳細講解yolov3算法的實現原理和訓練過程，Visdrone數據集實戰訓練

本文轉載自查看原文 2020-09-09 11:12 1589 項目實戰/ 目標檢測

以代碼的思想去詳細講解yolov3算法的實現原理和訓練過程，並教使用visdrone2019數據集和自己制作數據集兩種方式去訓練自己的pytorch搭建的yolov3模型，吐血整理萬字長文，純屬干貨！

實現思路

第一步：Pytorch搭建yolo3目標檢測平台

模型yolov3和預訓練權重下載

yolo3算法原理實現思路

一、預測部分

1、yolo3的網絡模型架構和實現

2、主干特征網絡darknet53介紹和結果（獲取3個初始特征層）

3、從初始特征獲取預測結果（最終的3個有效的特征層）

4、預測結果的解碼（對最終的3個有效特征層的結果進行解碼

）5、在原圖上進行繪制（對解碼的結果數據在原圖繪制展現）

二、訓練部分

1、計算loss所需參數

2、prediction是什么

3、target是什么。

4、loss的計算過程

5、正式開始訓練

第二步：使用Visdrone2019訓練自己的模型yolov3模型

yolov3整體的文件夾結構

一、數據集准備

1.visdrone數據集訓練

2.自己制作數據集訓練

二、訓練和效果展示

3.正式開始訓練

4.訓練效果

三、利用訓練好了的模型進行預測

yolo算法原理實現思路

模型yolov3和預訓練權重和數據集，關注微信公眾號：碼農的后花園，即可下載使用。

一.預測部分1.yolo3的網絡模型結構如下：

如圖1所示：

輸入一張圖片任意大小的圖片然后數據處理為416*416*3的圖片大小到yolo3的模型中，首先經過主干特征提取網絡darknet53會提取到3個初步的特征層用於進行目標檢測，三個特征層位於yolo模型的主干特征提取網絡darknet53的不同位置，分別位於中間層P3、中下層P4、底層P5(P3，對應的是darknet從上向下的第3個網絡模塊，0開始),如上圖紅色框所示，三個特征層的shape分別為(52,52,256),(26,26,256),(13,13,1024)。

這里的52*52,26*26,13*13可視化的理解是指原始圖片處理后得到模型可用的416*416的圖片分為52*52,26*26,13*13大小的網格，也就是這樣不同尺寸的特征圖，分別用來檢測小目標，中等大小的的目標，較大的目標，因為特征圖其上我們預先設置的先驗框的尺寸大小不一樣,在13*13的特征圖上（有最大的感受野）應用較大的先驗框，用來檢測較大目標，如下圖所示，圖中的藍色框。

在由主干特征提取網絡darknet53得到這樣的3個初始特征層之后，還需要經過一定的處理，最終得到yolo3模型的最終3個有效的特征層out0(P5),out1(P4),out2(P3)，也就是yolov3的網絡預測結果。如上圖1綠色框所示。具體處理過程：

1.這個處理首先對P5經過5次卷積，之后有兩個處理：一是再經過一次conv2D3*3和一次conv2D 1*1最終得到我們初始特征層P5的輸出的有效特征層out0,用於檢測小目標。二是P5這5次卷積之后的結果進行再一次卷積Conv2D和上采樣UpSampling2D得到(batchsize,26,26,256)，這用於和P4進行拼接。

2.P4和P5經過上采樣之后的結果進行一個拼接Concat得到（batchsize,26,26,768）之后再次經過Conv2D Block 256這樣的5次卷積之后得到（batchsize,26,26,256),然后也有倆個處理，和上述一樣，一是再經過一次conv2D3*3和一次conv2D 1*1最終得到我們初始特征層P4的輸出的有效特征層out1，用於檢測中等大小的目標。二是進行Conv2D和上采樣UpSampling2D用於和P3進行拼接。

3.P3和P4經過上采樣之后的結果進行拼接之后，經過5次Conv2D Block 128之后，只有一個處理，就是一次conv2D3*3和一次conv2D 1*1即可最終得到我們初始特征層P3的輸出的有效特征層out3。

2.主干特征網絡darknet53介紹

YOLOv3相比於之前的yolo1和yolo2，改進較大，主要改進方向有：

1、主干網絡修改為darknet53，其重要特點是使用了殘差網絡Residual，darknet53中的殘差卷積塊就是進行一次3X3、步長為2的卷積，然后保存該卷積layer，再進行一次1X1的卷積（用於減少通道數）和一次3X3的卷積(增加通道數)，並把這個結果加上layer作為最后的結果，殘差網絡的特點是容易優化，並且能夠通過增加相當的深度來提高准確率。其內部的殘差塊使用了跳躍連接，緩解了在深度神經網絡中增加深度帶來的梯度消失問題。

殘差塊示意圖：將靠前若干層的某一層數據輸出直接跳過多層引入到后面數據層的輸入部分。意味着后面的特征層的內容會有一部分由其前面的某一層線性貢獻。深度殘差網絡的設計是為了克服由於網絡深度加深而產生的學習效率變低與准確率無法有效提升的問題。殘差塊

2、darknet53的每一個卷積部分使用了特有的DarknetConv2D結構，每一次卷積的時候進行l2正則化，完成卷積后進行BatchNormalization標准化與LeakyReLU。普通的ReLU是將所有的負值都設為零，Leaky ReLU則是給所有負值賦予一個非零斜率。以數學的方式我們可以表示為：

darknet53實現代碼為：詳情請見：darknet53.py（定義主干darknet53的網絡結構）

import torch

import torch.nn as nn

import math

from collections import OrderedDict

#Residual Block

class BasicBlock(nn.Module):

#初始化操作

def __init__(self, inplanes, planes):

super(BasicBlock, self).__init__()

self.conv1 = nn.Conv2d(inplanes, planes[0], kernel_size=1,

stride=1, padding=0, bias=False)

self.bn1 = nn.BatchNorm2d(planes[0])

self.relu1 = nn.LeakyReLU(0.1)

self.conv2 = nn.Conv2d(planes[0], planes[1], kernel_size=3,

stride=1, padding=1, bias=False)

self.bn2 = nn.BatchNorm2d(planes[1])

self.relu2 = nn.LeakyReLU(0.1)

#定義殘差快

def forward(self, x):

residual = x

out = self.conv1(x)

out = self.bn1(out)

out = self.relu1(out)

out = self.conv2(out)

out = self.bn2(out)

out = self.relu2(out)

out += residual

return out

#darknet53網絡結構

class DarkNet(nn.Module):

def __init__(self, layers):

super(DarkNet, self).__init__()

self.inplanes = 32

self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=3, stride=1, padding=1, bias=False)

self.bn1 = nn.BatchNorm2d(self.inplanes)

self.relu1 = nn.LeakyReLU(0.1)

self.layer1 = self._make_layer([32, 64], layers[0])

self.layer2 = self._make_layer([64, 128], layers[1])

self.layer3 = self._make_layer([128, 256], layers[2])

self.layer4 = self._make_layer([256, 512], layers[3])

self.layer5 = self._make_layer([512, 1024], layers[4])

self.layers_out_filters = [64, 128, 256, 512, 1024]

# 進行權值初始化

for m in self.modules():

if isinstance(m, nn.Conv2d):

n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels

m.weight.data.normal_(0, math.sqrt(2. / n))

elif isinstance(m, nn.BatchNorm2d):

m.weight.data.fill_(1)

m.bias.data.zero_()

def _make_layer(self, planes, blocks):

layers = []

# 下采樣，步長為2，卷積核大小為3

layers.append(("ds_conv", nn.Conv2d(self.inplanes, planes[1], kernel_size=3,

stride=2, padding=1, bias=False)))

layers.append(("ds_bn", nn.BatchNorm2d(planes[1])))

layers.append(("ds_relu", nn.LeakyReLU(0.1)))

# 加入darknet模塊

self.inplanes = planes[1]

for i in range(0, blocks):

layers.append(("residual_{}".format(i), BasicBlock(self.inplanes, planes)))

return nn.Sequential(OrderedDict(layers))

def forward(self, x):

x = self.conv1(x)

x = self.bn1(x)

x = self.relu1(x)

x = self.layer1(x)

x = self.layer2(x)

out3 = self.layer3(x)

out4 = self.layer4(out3)

out5 = self.layer5(out4)

return out3, out4, out5

def darknet53(pretrained, **kwargs):

model = DarkNet([1, 2, 8, 8, 4])

if pretrained:

if isinstance(pretrained, str):

model.load_state_dict(torch.load(pretrained))

else:

raise Exception("darknet request a pretrained path. got [{}]".format(pretrained))

return model

3、從初始特征獲取預測結果

1、在特征提取部分，yolo3借助darknet53提取多特征層進行目標檢測，一共提取三個初始特征層P5,P4,P3，三個特征層位於主干部分darknet53的不同位置，分別位於中間層，中下層，底層，三個特征層的shape分別為(52,52,256)、(26,26,512)、(13,13,1024)。

2、對這三個初始的特征層進行5次卷積處理等操作之后，處理完后一部分用於輸出該特征層對應的預測結果out0,out1,out2，一部分用於進行反卷積UmSampling2d后與其它初始特征層進行結合。

3、輸出層(最終的3個有效特征層)的shape分別為(13,13,75)，(26,26,75)，(52,52,75)，最后一個維度為75是因為該圖是基於voc數據集的，它的類為20種，yolo3只有針對每一個特征層存在3個先驗框，所以最后維度為3x25；

如果使用的是coco訓練集，類則為80種，最后的維度應該為255 = 3x85，三個特征層的shape為(13,13,255)，(26,26,255)，(52,52,255)。

其實際情況就是，由於我們使用得是Pytorch，它的通道數默認在第一位，輸入N張416x416的圖片，在經過多層的運算后，會輸出三個shape分別為(N,255,13,13)，(N,255,26,26)，(N,255,52,52)的數據，對應每個圖分為13x13、26x26、52x52的網格上3個先驗框的位置。

實現代碼如下：詳情請見：yolo3.py（定義yolo3的整個網絡結構模型）

import torch

import torch.nn as nn

from collections import OrderedDict

from nets.darknet import darknet53

def conv2d(filter_in, filter_out, kernel_size):

pad = (kernel_size - 1) // 2 if kernel_size else 0

return nn.Sequential(OrderedDict([

("conv", nn.Conv2d(filter_in, filter_out, kernel_size=kernel_size, stride=1, padding=pad, bias=False)),

("bn", nn.BatchNorm2d(filter_out)),

("relu", nn.LeakyReLU(0.1)),

]))

def make_last_layers(filters_list, in_filters, out_filter):

m = nn.ModuleList([

conv2d(in_filters, filters_list[0], 1),

conv2d(filters_list[0], filters_list[1], 3),

conv2d(filters_list[1], filters_list[0], 1),

conv2d(filters_list[0], filters_list[1], 3),

conv2d(filters_list[1], filters_list[0], 1),

conv2d(filters_list[0], filters_list[1], 3),

nn.Conv2d(filters_list[1], out_filter, kernel_size=1,

stride=1, padding=0, bias=True)

])

return m

class YoloBody(nn.Module):

def __init__(self, config):

super(YoloBody, self).__init__()

self.config = config

# backbone

self.backbone = darknet53(None) # darknert53用於提取初始特征

out_filters = self.backbone.layers_out_filters

# last_layer0

final_out_filter0 = len(config["yolo"]["anchors"][0]) * (5 + config["yolo"]["classes"])

self.last_layer0 = make_last_layers([512, 1024], out_filters[-1], final_out_filter0)

# embedding1

final_out_filter1 = len(config["yolo"]["anchors"][1]) * (5 + config["yolo"]["classes"])

self.last_layer1_conv = conv2d(512, 256, 1)

self.last_layer1_upsample = nn.Upsample(scale_factor=2, mode='nearest')

self.last_layer1 = make_last_layers([256, 512], out_filters[-2] + 256, final_out_filter1)

# embedding2

final_out_filter2 = len(config["yolo"]["anchors"][2]) * (5 + config["yolo"]["classes"])

self.last_layer2_conv = conv2d(256, 128, 1)

self.last_layer2_upsample = nn.Upsample(scale_factor=2, mode='nearest')

self.last_layer2 = make_last_layers([128, 256], out_filters[-3] + 128, final_out_filter2)

def forward(self, x):

def _branch(last_layer, layer_in):

for i, e in enumerate(last_layer):

layer_in = e(layer_in)

if i == 4:

out_branch = layer_in

return layer_in, out_branch

# backbone

x2, x1, x0 = self.backbone(x)

# yolo branch 0

out0, out0_branch = _branch(self.last_layer0, x0)

# yolo branch 1

x1_in = self.last_layer1_conv(out0_branch)

x1_in = self.last_layer1_upsample(x1_in)

x1_in = torch.cat([x1_in, x1], 1)

out1, out1_branch = _branch(self.last_layer1, x1_in)

# yolo branch 2

x2_in = self.last_layer2_conv(out1_branch)

x2_in = self.last_layer2_upsample(x2_in)

x2_in = torch.cat([x2_in, x2], 1)

out2, _ = _branch(self.last_layer2, x2_in)

return out0, out1, out2

4、預測結果的解碼和最終預測框篩選

由第三步我們可以獲得最終三個有效特征層的預測結果，shape分別為(N,255,13,13)，(N,255,26,26)，(N,255,52,52)的數據，對應每個圖分為13x13、26x26、52x52的網格上3個預測框的位置。

但是這個預測結果並不對應着最終的預測框在圖片上的位置，還需要解碼才可以完成。我們利用yolov3的網絡預測結果會對我們的預先設定好了的先驗框進行調整，獲得最終的預測框，對先驗框進行調整的過程我們稱作解碼的過程。

總結：先驗框解碼的過程就是利用yolov3網絡的預測結果（3個有效的特征層）對先驗框進行調整的過程，調整完就是預測框。

此處要講一下yolo3的預測原理，yolo3的3個特征層分別將整幅圖分為13x13、26x26、52x52的網格，每個網絡點負責一個區域的檢測。

我們知道特征層的預測結果對應着三個預測框的位置，若是coco數據集，我們先將其reshape一下，其結果為(N,3,85,13,13,)，(N,3,85,26,26)，(N,3,85,52,52)。

維度中的85包含了4+1+80，分別代表x_offset、y_offset、h和w、置信度、分類結果，如果是voc數據，則為25。

yolo3的具體解碼過程：在代碼中就是首先生成特征層大小的網格，然后將我們預先設置好了的在原圖中416*416先驗框的尺寸調整到有效特征層大小上，最后從yolov3的網絡預測結果獲得先驗框的中心調整參數x_offset和y_offset和寬高的調整參數h和w，對在特征層尺寸大小上的先驗框進行調整，將每個網格點加上它對應的x_offset和y_offset的結果就是調整后的先驗框的中心，也就是預測框的中心，然后再利用先驗框和h、w結合計算出調整后的先驗框的的長和寬，也就是預測框的高和寬，這樣就能得到在特征層上整個預測框的位置了，最后我們將在有效特征層上的預測框的位置再調整到原圖416*416的大小上。

以13*13有效特征層為例：左圖是先驗框在有效特征層調整的可視化，右圖是在原圖上繪制的調整后的先驗框，即真實的預測框。

解碼實現代碼如下：詳情見utils.py(對yolov3的網絡預測結果進行解碼顯示)

class DecodeBox(nn.Module):

def __init__(self, anchors, num_classes, img_size):

super(DecodeBox, self).__init__()

self.anchors = anchors

self.num_anchors = len(anchors)

self.num_classes = num_classes

self.bbox_attrs = 5 + num_classes

self.img_size = img_size

def forward(self, input):

batch_size = input.size(0)

input_height = input.size(2)

input_width = input.size(3)

# 計算步長

stride_h = self.img_size[1] / input_height

stride_w = self.img_size[0] / input_width

# 歸一到特征層上

scaled_anchors = [(anchor_width / stride_w, anchor_height / stride_h) for anchor_width, anchor_height in self.anchors]

# 對預測結果進行resize

prediction = input.view(batch_size, self.num_anchors,

self.bbox_attrs, input_height, input_width).permute(0, 1, 3, 4, 2).contiguous()

# 先驗框的中心位置的調整參數

x = torch.sigmoid(prediction[..., 0])

y = torch.sigmoid(prediction[..., 1])

# 先驗框的寬高調整參數

w = prediction[..., 2] # Width

h = prediction[..., 3] # Height

# 獲得置信度，是否有物體

conf = torch.sigmoid(prediction[..., 4])

# 種類置信度

pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred.

FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor

LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

# 生成網格，先驗框中心，網格左上角

grid_x = torch.linspace(0, input_width - 1, input_width).repeat(input_width, 1).repeat(

batch_size * self.num_anchors, 1, 1).view(x.shape).type(FloatTensor)

grid_y = torch.linspace(0, input_height - 1, input_height).repeat(input_height, 1).t().repeat(

batch_size * self.num_anchors, 1, 1).view(y.shape).type(FloatTensor)

# 生成先驗框的寬高

anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))

anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))

anchor_w = anchor_w.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(w.shape)

anchor_h = anchor_h.repeat(batch_size, 1).repeat(1, 1, input_height * input_width).view(h.shape)

# 計算調整后的先驗框中心與寬高

pred_boxes = FloatTensor(prediction[..., :4].shape)

pred_boxes[..., 0] = x.data + grid_x

pred_boxes[..., 1] = y.data + grid_y

pred_boxes[..., 2] = torch.exp(w.data) * anchor_w

pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

# 用於將輸出調整為相對於416x416的大小

_scale = torch.Tensor([stride_w, stride_h] * 2).type(FloatTensor)

output = torch.cat((pred_boxes.view(batch_size, -1, 4) * _scale,

conf.view(batch_size, -1, 1), pred_cls.view(batch_size, -1, self.num_classes)), -1)

return output.data

5、在原圖上進行繪制

通過第四步，我們就可以獲得預測框在原圖上的位置，當然得到最終的預測結果后還要進行得分排序與非極大抑制篩選，因為右圖我們可以看到，由於一個網格點有3個先驗框，則調整后有3個預測框，在原圖上繪制的時候，同一個目標就有3個預測框，那要找出最合適的預測框，我們需要進行篩選。如下圖舉例：假設3個藍色的是我們獲得的預測框，黃色的是真實框，紅色的是用與預測目標的網格，我們就需要對這檢測同一個目標的網格點上的3個調整后的先驗框（也就是預測框）進行篩選。

這一部分基本上是所有目標檢測通用的部分。不過該項目的處理方式與其它項目不同。其對於每一個類進行判別。

1、取出每一類得分大於self.obj_threshold的框和得分。

2、利用框的位置和得分進行非極大抑制。

詳情請見yolo.py和utils.py。

二、訓練部分

1、計算loss所需參數

在計算loss的時候，實際上是網絡預測結果prediction和目標target之間的對比：prediction：就是你輸入一張圖片給yolov3網絡模型最終的預測結果，也就是3個有效特征層，每一張圖片最后都對應3個有效特征層。target：就是你制作的訓練集中標注圖片中的數據信息，這是網絡的真實框情況。

2、prediction是什么

對於yolo3的模型來說，網絡最后輸出的內容就是三個有效特征層，3個有效特征層的每個網格點（特征點）對應着預測框及其種類，即三個特征層分別對應着圖片被分為不同size的網格后，每個網格點上三個先驗框對應的位置、置信度及其種類。

輸出層的shape分別為(13,13,75)，(26,26,75)，(52,52,75)，最后一個維度為75是因為是基於voc數據集的，它的類為20種，yolo3的每一個特征層的每一個特征點（網格點）都預先設置3個先驗框，每個先驗框包含1+4=20個參數信息，1代表這個先驗框內部是否有目標，4代表框的xywh參數信息，20代表框的種類信息，所以每一個特征點對應3*25參數，即最后維度為3x25。如果使用的是coco訓練集，類則為80種，最后的維度應該為255 = 3x85，三個特征層的shape為(13,13,255)，(26,26,255)，(52,52,255)

注意：此處得到的yolov3的網絡預測結果（3個有效特征層）y_prediction此時並沒有解碼，也就是yolov3.py中yolobody類的輸出結果，有效特征層解碼了之后才是真實圖像上的情況。

3、target是什么。

target就是一個真實圖像中，真實框的情況。第一個維度是batch_size，第二個維度是每一張圖片里面真實框的數量，第三個維度內部是真實框的信息，包括位置以及種類。

4、loss的計算過程

拿到pred和target后，不可以簡單的減一下作為對比，需要進行如下步驟。

第一步：對yolov3網絡的預測結果進行解碼,獲得網絡預測結果對先驗框的調整數據

第二步：對真實框進行處理，獲得網絡應該真正有的對先驗框的調整數據，也就是網絡真正應該有的預測結果，然后和我們得到的網絡的預測結果進行對比，代碼中get_target函數

判斷真實框在圖片中的位置，判斷其屬於哪一個網格點去檢測。
判斷真實框和哪個預先設定的先驗框重合程度最高。
計算該網格點應該有怎么樣的預測結果才能獲得真實框（利用真實框的數據去調整預先設定好了的先驗框，得到真實框該網格點應該預測的先驗框的調整數據）
對所有真實框進行如上處理。
獲得網絡應該有的預測結果，將其與yolov3預測實際的預測結果對比。

第三步: 將真實框內部沒有目標的對應的網絡的預測結果的且重合程度較大的先驗框進行忽略，因為圖片的真實框中沒有目標，也就是這個框的內部沒有對象，框的位置信息是沒有用的，網絡輸出的這個先驗框的信息和其代表的種類是沒有意義的，這樣的得到調整的先驗框應該被忽略掉，網絡只輸出框內部有目標的數據信息，代碼中get_ignore函數。第四步：利用真實框得到網絡真正的調整數據和網絡預測的調整數據后，我們就對其進行對比loss計算，如下：

這里需要注意的是：上述處理過程依次對3個有效特征才層進行計算的，因為yolov3是分3個有效特征層進行預測的,計算3個有效特征層的loss的值相加之后就是我們模型最終的loss值，就可以進行反向傳播和梯度下降了。代碼實現上述過程詳情請見yolo_training.py

from random import shuffle

import numpy as np

import torch

import torch.nn as nn

import math

import torch.nn.functional as F

from matplotlib.colors import rgb_to_hsv, hsv_to_rgb

from PIL import Image

from utils.utils import bbox_iou

def clip_by_tensor(t,t_min,t_max):

t=t.float()

result = (t >= t_min).float() * t + (t < t_min).float() * t_min

result = (result <= t_max).float() * result + (result > t_max).float() * t_max

return result

def MSELoss(pred,target):

return (pred-target)**2

def BCELoss(pred,target):

epsilon = 1e-7

pred = clip_by_tensor(pred, epsilon, 1.0 - epsilon)

output = -target * torch.log(pred) - (1.0 - target) * torch.log(1.0 - pred)

return output

class YOLOLoss(nn.Module):

def __init__(self, anchors, num_classes, img_size):

super(YOLOLoss, self).__init__()

self.anchors = anchors

self.num_anchors = len(anchors)

self.num_classes = num_classes

self.bbox_attrs = 5 + num_classes

self.img_size = img_size

self.ignore_threshold = 0.5

self.lambda_xy = 1.0

self.lambda_wh = 1.0

self.lambda_conf = 1.0

self.lambda_cls = 1.0

def forward(self, input, targets=None):

# 一共多少張圖片

bs = input.size(0)

# 特征層的高

in_h = input.size(2)

# 特征層的寬

in_w = input.size(3)

# 計算步長

stride_h = self.img_size[1] / in_h

stride_w = self.img_size[0] / in_w

# 把先驗框的尺寸調整成特征層大小的形式

scaled_anchors = [(a_w / stride_w, a_h / stride_h) for a_w, a_h in self.anchors]

# reshape

prediction = input.view(bs, int(self.num_anchors/3),

self.bbox_attrs, in_h, in_w).permute(0, 1, 3, 4, 2).contiguous()

# 對prediction預測進行調整

x = torch.sigmoid(prediction[..., 0]) # Center x

y = torch.sigmoid(prediction[..., 1]) # Center y

w = prediction[..., 2] # Width

h = prediction[..., 3] # Height

conf = torch.sigmoid(prediction[..., 4]) # Conf

pred_cls = torch.sigmoid(prediction[..., 5:]) # Cls pred.

# 找到哪些先驗框內部包含物體

mask, noobj_mask, tx, ty, tw, th, tconf, tcls, box_loss_scale_x, box_loss_scale_y =\

self.get_target(targets, scaled_anchors,

in_w, in_h,

self.ignore_threshold)

noobj_mask = self.get_ignore(prediction, targets, scaled_anchors, in_w, in_h, noobj_mask)

box_loss_scale_x = (2-box_loss_scale_x).cuda()

box_loss_scale_y = (2-box_loss_scale_y).cuda()

box_loss_scale = box_loss_scale_x*box_loss_scale_y

mask, noobj_mask = mask.cuda(), noobj_mask.cuda()

tx, ty, tw, th = tx.cuda(), ty.cuda(), tw.cuda(), th.cuda()

tconf, tcls = tconf.cuda(), tcls.cuda()

# losses.

loss_x = torch.sum(BCELoss(x, tx) / bs * box_loss_scale * mask)

loss_y = torch.sum(BCELoss(y, ty) / bs * box_loss_scale * mask)

loss_w = torch.sum(MSELoss(w, tw) / bs * 0.5 * box_loss_scale * mask)

loss_h = torch.sum(MSELoss(h, th) / bs * 0.5 * box_loss_scale * mask)

loss_conf = torch.sum(BCELoss(conf, mask) * mask / bs) + \

torch.sum(BCELoss(conf, mask) * noobj_mask / bs)

loss_cls = torch.sum(BCELoss(pred_cls[mask == 1], tcls[mask == 1])/bs)

loss = loss_x * self.lambda_xy + loss_y * self.lambda_xy + \

loss_w * self.lambda_wh + loss_h * self.lambda_wh + \

loss_conf * self.lambda_conf + loss_cls * self.lambda_cls

# print(loss, loss_x.item() + loss_y.item(), loss_w.item() + loss_h.item(),

# loss_conf.item(), loss_cls.item(), \

# torch.sum(mask),torch.sum(noobj_mask))

return loss, loss_x.item(), loss_y.item(), loss_w.item(), \

loss_h.item(), loss_conf.item(), loss_cls.item()

def get_target(self, target, anchors, in_w, in_h, ignore_threshold):

# 計算一共有多少張圖片

bs = len(target)

# 獲得先驗框

anchor_index = [[0,1,2],[3,4,5],[6,7,8]][[13,26,52].index(in_w)]

subtract_index = [0,3,6][[13,26,52].index(in_w)]

# 創建全是0或者全是1的陣列

mask = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

noobj_mask = torch.ones(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

tx = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

ty = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

tw = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

th = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

tconf = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

tcls = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, self.num_classes, requires_grad=False)

box_loss_scale_x = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

box_loss_scale_y = torch.zeros(bs, int(self.num_anchors/3), in_h, in_w, requires_grad=False)

for b in range(bs):

for t in range(target[b].shape[0]):

# 計算出在特征層上的點位

gx = target[b][t, 0] * in_w

gy = target[b][t, 1] * in_h

gw = target[b][t, 2] * in_w

gh = target[b][t, 3] * in_h

# 計算出屬於哪個網格

gi = int(gx)

gj = int(gy)

# 計算真實框的位置

gt_box = torch.FloatTensor(np.array([0, 0, gw, gh])).unsqueeze(0)

# 計算出所有先驗框的位置

anchor_shapes = torch.FloatTensor(np.concatenate((np.zeros((self.num_anchors, 2)),

np.array(anchors)), 1))

# 計算重合程度

anch_ious = bbox_iou(gt_box, anchor_shapes)

# Find the best matching anchor box

best_n = np.argmax(anch_ious)

if best_n not in anchor_index:

continue

# Masks

if (gj < in_h) and (gi < in_w):

best_n = best_n - subtract_index

# 判定哪些先驗框內部真實的存在物體

noobj_mask[b, best_n, gj, gi] = 0

mask[b, best_n, gj, gi] = 1

# 計算先驗框中心調整參數

tx[b, best_n, gj, gi] = gx - gi

ty[b, best_n, gj, gi] = gy - gj

# 計算先驗框寬高調整參數

tw[b, best_n, gj, gi] = math.log(gw / anchors[best_n+subtract_index][0])

th[b, best_n, gj, gi] = math.log(gh / anchors[best_n+subtract_index][1])

# 用於獲得xywh的比例

box_loss_scale_x[b, best_n, gj, gi] = target[b][t, 2]

box_loss_scale_y[b, best_n, gj, gi] = target[b][t, 3]

# 物體置信度

tconf[b, best_n, gj, gi] = 1

# 種類

tcls[b, best_n, gj, gi, int(target[b][t, 4])] = 1

else:

print('Step {0} out of bound'.format(b))

print('gj: {0}, height: {1} | gi: {2}, width: {3}'.format(gj, in_h, gi, in_w))

continue

return mask, noobj_mask, tx, ty, tw, th, tconf, tcls, box_loss_scale_x, box_loss_scale_y

def get_ignore(self,prediction,target,scaled_anchors,in_w, in_h,noobj_mask):

bs = len(target)

anchor_index = [[0,1,2],[3,4,5],[6,7,8]][[13,26,52].index(in_w)]

scaled_anchors = np.array(scaled_anchors)[anchor_index]

# print(scaled_anchors)

# 先驗框的中心位置的調整參數

x_all = torch.sigmoid(prediction[..., 0])

y_all = torch.sigmoid(prediction[..., 1])

# 先驗框的寬高調整參數

w_all = prediction[..., 2] # Width

h_all = prediction[..., 3] # Height

for i in range(bs):

x = x_all[i]

y = y_all[i]

w = w_all[i]

h = h_all[i]

FloatTensor = torch.cuda.FloatTensor if x.is_cuda else torch.FloatTensor

LongTensor = torch.cuda.LongTensor if x.is_cuda else torch.LongTensor

# 生成網格，先驗框中心，網格左上角

grid_x = torch.linspace(0, in_w - 1, in_w).repeat(in_w, 1).repeat(

int(self.num_anchors/3), 1, 1).view(x.shape).type(FloatTensor)

grid_y = torch.linspace(0, in_h - 1, in_h).repeat(in_h, 1).t().repeat(

int(self.num_anchors/3), 1, 1).view(y.shape).type(FloatTensor)

# 生成先驗框的寬高

anchor_w = FloatTensor(scaled_anchors).index_select(1, LongTensor([0]))

anchor_h = FloatTensor(scaled_anchors).index_select(1, LongTensor([1]))

anchor_w = anchor_w.repeat(1, 1, in_h * in_w).view(w.shape)

anchor_h = anchor_h.repeat(1, 1, in_h * in_w).view(h.shape)

# 計算調整后的先驗框中心與寬高

pred_boxes = torch.FloatTensor(prediction[0][..., :4].shape)

pred_boxes[..., 0] = x.data + grid_x

pred_boxes[..., 1] = y.data + grid_y

pred_boxes[..., 2] = torch.exp(w.data) * anchor_w

pred_boxes[..., 3] = torch.exp(h.data) * anchor_h

pred_boxes = pred_boxes.view(-1, 4)

for t in range(target[i].shape[0]):

gx = target[i][t, 0] * in_w

gy = target[i][t, 1] * in_h

gw = target[i][t, 2] * in_w

gh = target[i][t, 3] * in_h

gt_box = torch.FloatTensor(np.array([gx, gy, gw, gh])).unsqueeze(0)

anch_ious = bbox_iou(gt_box, pred_boxes, x1y1x2y2=False)

anch_ious = anch_ious.view(x.size())

noobj_mask[i][anch_ious>self.ignore_threshold] = 0

# print(torch.max(anch_ious))

return noobj_mask

5.正式訓練

正式訓練：包括數據集的加載和預處理（圖片的歸一化、框的坐標格式的轉換、圖片的通道的改變、數據增強等）請見yolotrain.py中的 Generator類，預訓練權重的導入、網路模型的正向傳播和反向傳播梯度下降，請見train.py。

yolov3預訓練權重的下載和visdrone2019數據集，關注我獲取。

訓練自己的yolov3模型

yolo3整體的文件夾構架：

本文使用VOC格式進行訓練。

一. 數據集的准備

1.visdron2019數據集的下載訓練：

下載完成后放在VOCdevkit文件夾下，利用我放置在VOCdevkit下的det_to_voc.py進行visdrone數據集轉Voc格式的轉換,生成的xml文件你可以存放在Annotations里面，也可以自己單獨創建一份xml文件夾存放，只要在你voc2yolo3.py轉換時注意xmlfilepath的路徑就好了。