░
YOLOX: Exceeding YOLO Series in 2021
1. Inclusion
介於yoloV4和yoloV5在基於anchor的pipeline上的過度優化,因此使用Yolov3作為我們的基礎網絡
Considering YOLOv4 and YOLOv5 may be a little over-optimized for the anchor-based pipeline, we choose YOLOv3 [25] as our start point (we set YOLOv3-SPP as the default YOLOv3).
yolox的large模型比yoloV5提升了1.8%AP,總而言知,就是比當前最使用的算法還要牛逼
YOLOX-L achieves 50.0% AP on COCO with 640 × 640 resolution, outperforming the counterpart YOLOv5-L by 1.8% AP.
2. YOLOX
2.1. YOLOX-DarkNet53
訓練的細節: 在coco數據集上訓練了300個epoch, 使用SGD優化器,學習率使用lr x batchSize / 64, 初始學習率是0.1, 使用cosine的學習率變化, 權重的L2正則是0.0005, SGD動量是0.9, batchSize的大小是128, 輸出圖片的尺度從448到832不等
Implementation details We train the models for a total of 300 epochs with 5 epochs warmup on COCO train2017 [17]. We use stochastic gradient descent (SGD) for training. We use a learning rate of lr×BatchSize/64 (linear scaling [8]), with a initial lr = 0.01 and the cosine lr schedule. The weight decay is 0.0005 and the SGD momentum is 0.9. The batch size is 128 by default to typical 8-GPU devices. Other batch sizes include single GPU training also work well. The input size is evenly drawn from 448 to 832 with 32 strides.
yoloV3 基線: 我們的基線采用了Darknet53做為主題和使用了一個SPPlayer,相比於原先的方法,我們稍微改變了訓練的策略。添加了EMA weight梯度更新,余弦學習率變化,IOU loss 和 IOU分支,使用BCEloss作為訓練類別和置信度的分支,使用IOU loss訓練回歸分支。通常的訓練技巧對於能力的提高是正交的,我們僅使用水平翻轉,顏色抖動,多尺度並且拋棄了隨機Resized和crop,因為我們發現隨機resize和crop與mosaic 數據增強存在重合,根據這些我們的yoloV3提高到了38.5%AP
YOLOv3 baseline Our baseline adopts the architecture of DarkNet53 backbone and an SPP layer, referred to YOLOv3-SPP in some papers [1, 7]. We slightly change some training strategies compared to the original implementation [25], adding EMA weights updating, cosine lr schedule, IoU loss and IoU-aware branch. We use BCE Loss for training cls and obj branch, and IoU Loss for training reg branch. These general training tricks are orthogonal to the key improvement of YOLOX, we thus put them on the baseline. Moreover, we only conduct RandomHorizontalFlip, ColorJitter and multi-scale for data augmentation and discard the RandomResizedCrop strategy, because we found the RandomResizedCrop is kind of overlapped with the planned mosaic augmentation. With those enhancements, our baseline achieves 38.5% AP on COCO val, as shown in Tab. 2.
概念解答:
SPP layer: 金字塔池化,主要用於將全卷機以后的數據,使用最大值池化,不管輸入尺寸多少,都壓縮到一個固定的尺寸的輸出層,這樣就可以實現不同輸入圖像

EMA weights updating:權重的指數移動平均 ,
, 其中,
表示前
條的平均值 (
),
是加權權重值 (一般設為0.9-0.999)。在深度學習的優化過程中,
是t時刻的模型權重weights,
是t時刻的影子權重(shadow weights), 表示在某一個時刻獲得的權重,與前面N-1的時刻做加權相加
cosine lr schedule: 余弦學習率衰減, Cosine是個周期函數嘛,這里的T_max就是這個周期的一半, 如果你將T_max設置為10,則學習率衰減的周期是20個epoch,其中前10個epoch從學習率的初值(也是最大值)下降到最低值,后10個epoch從學習率的最低值上升到最大值。

IoU loss:交叉熵損失函數

RandomHorizontalFlip: 隨機的水平左右翻轉, 根據求出的隨機數的結果,來對圖像和標簽進行左右翻轉,或者不進行左右翻轉。
RandomResizedCrop: 隨機裁剪和壓縮,首先對物體進行隨機的裁剪,將裁剪出來的部分,進行壓縮到一定的尺寸

強大的數據增強: 我們使用Mosaic和Mixup來提高YoloX的表現,Mosaic是Yolov3變體提出的一種有效的增強策略,廣泛的被使用在yolov4,yoloV5和其他檢測器中,MixUp通常被使用在圖片分類任務中,但是被BOF算法使用在檢測訓練中,我們采用MiXUp和Mosaic功能在我們的模型中,並且在最后15個epoch進行關閉,使用了42.0%的AP
Strong data augmentation We add Mosaic and MixUp into our augmentation strategies to boost YOLOX’s performance. Mosaic is an efficient augmentation strategy proposed by ultralytics-YOLOv32 . It is then widely used in YOLOv4 [1], YOLOv5 [7] and other detectors [3]. MixUp [10] is originally designed for image classification task but then modified in BoF [38] for object detection training. We adopt the MixUp and Mosaic implementation in our model and close it for the last 15 epochs, achieving 42.0% AP in Tab.
Mosaic: 馬賽克數據增強,首先取出4張照片,然后對四張照片做色彩,翻轉,縮放等圖像增強操作,然后將四張照片做一定的拼接

圖片選擇 位置擺放 圖片拼接
Mix up: 圖片混合將隨機的兩張樣本按比例混合,分類的結果按比例分配, 這里主要是指分類任務,對於檢測任務,后續會根據代碼做補充

免anchor: yolov4和yolov5都是基於傳統的anchor操作,然后anchor技術有許多問題,第一點,為了實際好的檢測效果,需要采用聚類的方法來確定一些合適的anchor框,聚類的anchor是特定的且不夠通用,第二點,anchor技術增加了檢測頭的復雜性和每張圖的檢測數量。在一些邊緣的AI項目,需要移動大量的數據,這會導致整個延遲,從而構成潛在的瓶頸, AP提高到了42.9%
對於每一個位置,我們將3個減少為1個,並且使得他們直接預測4個值,分別是2個偏置項,在網格中的左上角的位置,以及預測框的長和寬。
Anchor-free Both YOLOv4 [1] and YOLOv5 [7] follow the original anchor-based pipeline of YOLOv3 [25]. However, the anchor mechanism has many known problems. First, to achieve optimal detection performance, one needs to conduct clustering analysis to determine a set of optimal anchors before training. Those clustered anchors are domain-specific and less generalized. Second, anchor mechanism increases the complexity of detection heads, as well as the number of predictions for each image. On some edge AI systems, moving such large amount of predictions between devices (e.g., from NPU to CPU) may become a potential bottleneck in terms of the overall latency。
We reduce the predictions for each location from 3 to 1 and make them directly predict four values, i.e., two offsets in terms of the left-top corner of the grid, and the height and width of the predicted box.

多個正樣本: 為了保持與yolov3的規則一直,上述的anchor-free版本選擇的是對於每一個物體只使用一個正樣本的結果,而忽視了其他高質量的預測,然而,優化這些高質量的預測可能也會帶來有效的梯度,這樣可以緩解在訓練過程中的正負樣本極端不平衡的情況,我們簡單的使用中心點3x3的區域作為正樣本,也被稱為中心樣本在FCOS, 這樣的操作可以使得檢測的准確率提高到45.0%,已經超越了當前最高的yolov3變體
Multi positives To be consistent with the assigning rule of YOLOv3, the above anchor-free version selects only ONE positive sample (the center location) for each object meanwhile ignores other high quality predictions. However, optimizing those high quality predictions may also bring beneficial gradients, which may alleviates the extreme imbalance of positive/negative sampling during training. We simply assigns the center 3×3 area as positives, also named “center sampling” in FCOS [29]. The performance of the detector improves to 45.0% AP as in Tab. 2, already surpassing the current best practice of ultralytics-YOLOv3 (44.3% AP2 ).
簡單的OTA 近年來,在檢測任務中,高級的標簽分配是另外一個重要的過程,基於我們的研究OTA, 對於標簽分配,我們總結出4個關鍵的點,1,損失和質量的考慮 2.中心先驗,3.對於每一個真實標簽的動態正樣本框,4.全局視眼, OTA正好滿足這四個規則,因此我們選擇他作為候選標簽分配策略
SimOTA Advanced label assignment is another important progress of object detection in recent years. Based on our own study OTA [4], we conclude four key insights for an advanced label assignment: 1). loss/quality aware, 2). center prior, 3). dynamic number of positive anchors4 for each ground-truth (abbreviated as dynamic top-k), 4). global view. OTA meets all four rules above, hence we choose it as a candidate label assigning strategy.
最后添加一張關於上述策略與效果的圖

