Siam R-CNN: Visual Tracking by Re-Detection

本文轉載自查看原文 2019-12-03 08:43 1296

Siam R-CNN: Visual Tracking by Re-Detection

2019-12-02 22:21:48

Paper：https://128.84.21.199/abs/1911.12836

Code: 靜候佳音

1. Background and Motivation:

本文嘗試從 Tracking by Re-Detection 的角度來處理跟蹤問題，提出一種新穎的 re-detector，即將 Faster-RCNN 結合到 Siamese architecture 中，通過在一個圖像中任何位置進行重新檢測 template object，判斷給定的 region proposal 是否是同一個物體，然后對該物體進行 BBox 的回歸。本文所提出的 two-stage re-detection architecture 對物體的外觀和長寬比有較好的魯棒性。Tracking by Re-detection 已經有較長的歷史，但是這種方法仍然有局限性是因為 distractor objects 和 template object 非常相似的時候，很難確定物體的位置。對於相似物體的挑戰，前人的方法或者利用較強的空間先驗（Spatial Priors）或者在線更新（Online Adaptation）的方式來解決，但是這些方法都可能會導致 model drift。

本文在 Siam R-CNN re-detector 的基礎之上，提出兩個改進點來解決 distractor 的問題：

1). 本文提出一種新穎的 hard example mining 方法，對困難的 distractors 進行特殊的訓練；

2). 提出一種新穎的 Tracklet Dynamic Programming Algorithm （TDPA），該方法可以同時跟蹤所有潛在的目標物體，包括：distractor objects, 通過從前一幀進行 re-detect 所有的物體候選 BBox，並將這些 BBox 划分為 tracklets（short object tracks）。然后利用動態規划的思想，選擇當前時刻最優的 object。通過顯示建模 motion 和 interaction of all potential objects，然后從檢測中得到的相似物體進行 pooling, 得到 tracklets，Siam R-CNN 可以有效的進行 long-term tracking，對 tracker drift 有較好的抑制，在物體消失后，可以有效地進行重檢測。

效率方面，該方法可以在 ResNet-101 上達到 4.7 FPS，在 ResNet-50 上取得 15 FPS 的速度。

2. The Proposed Method：

本文所提出的 Siam R-CNN 方法示意圖如下圖所示：

可以看到，本文方法是由多個模塊構成的: CNN+RPN 生成 proposal，然后作者還把第一幀的物體也摳出來和提取的 proposal 組合到一起；輸入到 Re-detection 模塊中。

2.1. Siam R-CNN:

本小節主要是講了如何將 Faster RCNN 的那一套用於 Proposal 生成，來得到多個候選。

2.2 Video Hard Example Mining：

在傳統 Faster RCNN 訓練階段，negative examples 是從 target image 上用 RPN 來采樣得到的。但是，在許多圖像中，僅有少量的 negative examples。為了最大化 re-detection head 的判別能力，作者認為需要在 hard negative samples 上進行訓練。類似的思路在物體檢測和跟蹤上也都被廣泛的應用。

Embedding Network.

一種直觀的方法選擇相關的 videos 以得到 hard negative examples 的是：尋找與當前物體屬於同一個類比的物體。然而，物體的類別標簽並不總是可靠，一些同類的物體很容易區分，不同類別的物體反而可能是理想的 hard negative。所以，本文受到 person re-identification 的影響，提出利用 embedding network 的方法，將 Ground truth BBox 中的物體映射為 embedding vector 來表示目標物體。本文利用 PReMVOS 提出的網絡，該網絡是在 COCO 數據集上用 batch-hard triplet loss 來訓練得到的：two distinct persons should be far away in the embedding space, while two crops of the same person iin different frames shoule be close.

Index Structure：

我們接下來構建一個有效的索引結構來估計緊鄰 queries，然后用於尋找所需要跟蹤的物體在 embedding space 中的最近鄰。圖 3 展示了一些檢索得到的 negative examples。

Training Procedure.

本文對訓練數據的每一個 Ground truth BBox 都提取其 RoI-aligned features。在每一個時刻，隨機的選擇一個 video 和 object，然后隨機的選擇一個 reference 和 target frame。在此之后，作者用上一節提到的 indexing structure 來檢索 10000 個緊鄰 reference box，從中選擇出 100 個 negative training examples。

2.3 Tracklet Dynamic Programming Algorithm:

本文所提出的片段動態規划算法（Tracklet Dynamic Programming Algorithm）顯示對感興趣目標物體和潛在的 distrators 都進行跟蹤，所以 distractor objects 可以得到抑制。為了達到這個目的，TDPA 保持了一組 tracklets，即：short sequences of detections。然后用基於 scoring algorithm 的方法來進行 dynamic programming 方式來選擇最優的結果。每一個 detection 都定義為：a bounding box, a re-dection score, and its RoI-aligned features。此外，each detection 是 tracklet 的組成部分。每一個 tracklet 都有一個 start 和 end time，並且由 a set of detections 來定義。

Tracklet Building.

首先提取第一幀 ground truth BBox 的 features，並且用於初始化 tracklet。對於每一個新的視頻幀來說，我們采用如下的方式來更新 tracklets（如算法1 所示）：

1. 我們提取當前幀的 backbone features，然后用 RPN 來評價當前的 feature。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 face detection[Face R-CNN] 論文閱讀之：Is Faster R-CNN Doing Well for Pedestrian Detection? Mask R-CNN mask r-cnn r-cnn學習（二） Cascade R-CNN r-cnn學習(一) Mask R-CNN Faster R-CNN Mask R-CNN