P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds中英譯文

本文轉載自查看原文 2020-09-05 16:09 628 點雲跟蹤

P2B: Point-to-Box Network for 3D Object Tracking in Point Clouds

P2B:用於點雲中三維物體跟蹤的點到盒網絡

Haozhe Qi, Chen Feng, Zhiguo Cao, Feng Zhao, and Yang Xiao

齊浩哲、陳鋒、曹志國、趙峰、楊曉

National Key Laboratory of Science and Technology on Multi-Spectral Information Processing, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology, Wuhan 430074, China

華中科技大學人工智能與自動化學院多光譜信息處理科學技術國家重點實驗室

qihaozhe, chen feng, zgcao@hust.edu.cn , fzhao@alumni.hust.edu.cn , Yang Xiao@hust.edu.cn

Target template目標模板

Cluster of potential target centers潛在目標中心集群

Final predicted 3D target box最終預測的3D目標框

Seed points with targetspecific feature具有特定目標特征的種子點

s p: Proposalwise附言:提議

targetness score目標得分

圖1-舉例說明P2B是如何工作的，從種子采樣到3D目標提議和驗證。

摘要

　　論文解析:https://zhuanlan.zhihu.com/p/146512901

　　針對點雲中的三維目標跟蹤，提出了一種新的端到端學習的P2B網絡。我們的主要思想是首先在嵌入目標信息的三維搜索區域中局部化潛在的目標中心。然后聯合執行點驅動三維目標定位和驗證。這樣，可以避免耗時的3D窮舉搜索。具體來說，我們首先分別從模板和搜索區域的點雲中采樣種子。然后，我們執行排列不變特征增強，將從模板中獲取的tar線索嵌入到搜索區域種子中，並用特定於目標的特征來表示它們。因此，有約束的搜索區域種子通過霍夫投票回歸潛在的目標中心。種子階段的目標得分進一步加強了這些中心。最后，每個中心包括它的鄰居，以利用整體力量進行聯合3D目標提議和驗證。我們將PointNet++作為我們的主干，在KITTI跟蹤數據集上的實驗證明了P2B的優勢(相對於最先進的技術，提高了10%)。請注意，P2B可以在單個NVIDIA 1080Ti圖形處理器上運行40FPS。我們的代碼和模型在https://github.com/HaozheQi/P2B.是可用的

1.Introduction

3D object tracking in point clouds is essential for appli-cations in autonomous driving and robotics vision [25, 26, 7].However, point clouds' sparsity and disorder imposes great challenges on this task, and leads to the fact that, well-established 2D object tracking approaches (e.g., Siamese network [3]) cannot be directly applied.Most existing 3D object tracking methods [1, 4, 24, 16, 15] inherit 2D's ex-perience and rely heavily on RGB-D information.But they may fail when RGB visual information is degraded with illuminational change or even inaccessible.We hence focus on 3D object tracking using only point clouds.The first pi-oneer effort on this topic appears in [11].It mainly executes 3D template matching using Kalman filtering [12] to gen-erate bunches of 3D target proposals.Meanwhile, it uses shape completion to regularize feature learning on point set.Nevertheless, it tends to suffer from four main defects: 1) its tracking network cannot be end-to-end trained;2) 3D search with Kalman filtering consumes much time;3) each target proposal is represented with only one-dimensional global feature, which may lose fine local geometric information;4) shape completion network brings strong class prior which weakens generality.

1.介紹

　　點雲中的三維物體跟蹤對於自動駕駛和機器人視覺的應用是必不可少的[25，26，7]。然而，點雲的稀疏性和無序性給這一任務帶來了巨大的挑戰，並導致了這樣一個事實，即，成熟的2D目標跟蹤方法(例如，暹羅網絡[3])不能直接應用。大多數現有的三維目標跟蹤方法[1，4，24，16，15]繼承了2D的經驗，並且嚴重依賴於RGB-D信息。但是當RGB視覺信息因光照變化而退化，甚至無法訪問，它們可能會失敗。因此，我們專注於僅使用點雲的3D對象跟蹤。關於這一主題的第一篇論文出現在[11]中。它主要使用卡爾曼濾波[12]來執行3D模板匹配，以生成3D目標建議束。同時，利用形狀補全來規范點集的特征學習。然而，它有四個主要缺陷:1)它的跟蹤網絡不能進行端到端的訓練；2)卡爾曼濾波三維搜索耗時長；3)每個目標方案僅用一維全局特征表示，這可能會丟失精細的局部幾何信息；4)形狀完成網絡帶來強大的類先驗，削弱了通用性。

Towards the above concerns, we propose a novel point-to-box network termed P2B for 3D object tracking which can be end-to-end trained.Differing from the intuitive 3D search with box in [11], we turn to addressing 3D ob-ject tracking by first localizing potential target centers and then executing point-driven target proposal and verification jointly.Our intuition lies in two folders.First, the point-wise tracking paradigm may help better exploit 3D local geometric information to characterize target in point clouds. Secondly, formulating 3D object tracking task in an end-to-end manner is of stronger ability to fit target's 3D appear-ance variation during tracking.

　　針對上述問題，我們提出了一種新的點到盒網絡稱為P2B三維物體跟蹤，可以端到端的訓練。與[11]中直觀的帶框3D搜索不同，我們通過首先定位潛在目標中心，然后聯合執行點驅動目標提議和驗證來解決3D對象跟蹤問題。我們的直覺存在於兩個文件夾中。首先，逐點跟蹤范例可以幫助更好地利用3D局部幾何信息來表征點雲中的目標。其次，以端到端的方式制定三維目標跟蹤任務具有更強的適應跟蹤過程中目標三維外觀變化的能力。

We exemplify how P2B works in Fig. 1.We first feed template and search area into backbone respectively and ob-tain their seeds.The search area seeds will consequently predict potential target centers for joint target proposal and verification.Then the search area seeds are augmented with target-specific features, yielding three main components: 1) their 3D position coordinates to retain spatial geometric in-formation, 2) their point-wise similarity with template seeds to mine resembling patterns and reveal the local tracking clue, and 3) encoded global feature of target from tem-plate.This augmentation is invariant to seeds' permutation and yields consistent target-specific features.After that, the augmented seeds are projected to the potential target cen-ters via Hough voting [28].Meanwhile, each seed is as-sessed with its targetness to regularize earlier feature learn-ing;the result targetness score further strengthens its pre-dicted target center's representation.Finally, each potential target center clusters the neighbors to leverage the ensemble power for joint target proposal and verification.

　　我們在圖1中舉例說明了P2B是如何工作的。我們首先將模板和搜索區域分別輸入到主干中，並獲取它們的種子。搜索區域種子將因此預測聯合目標提議和驗證的潛在目標中心。然后，搜索區域種子被目標特定特征擴充，產生三個主要部分:1)它們的3D位置坐標以保持空間幾何信息，2)它們與模板種子的點狀相似性以挖掘相似模式並揭示局部跟蹤線索，以及3)從模板編碼目標的全局特征。這種增強對種子的排列是不變的，並產生一致的目標特異性特征。之后，通過霍夫投票將增強的種子投影到潛在的目標中心[28]。同時，每一個種子都有其目標性，以規范早期特征學習；結果目標性得分進一步加強了其預測目標中心的代表性。最后，每個潛在的目標中心將鄰居聚集在一起，以利用集合能力進行聯合目標提議和驗證。

Experiments on KITTI tracking dataset [10] demon-strate that, P2B significantly outperforms the state-of-the-art method [11] by large a margin (∼10% on both Success and Precision).Note that P2B can run with about 40FPS on a single NVIDIA 1080Ti GPU.

Overall, the main contributions of this paper include

• P2B: a novel point-to-box network for 3D object track-ing in point clouds, which can be end-to-end trained;

• Target-specific feature augmentation to include global and local 3D visual clues for 3D object tracking;

• Integration of 3D target proposal and verification.

　　在KITTI跟蹤數據集[10]上的實驗表明，P2B在成功率和精確度方面都顯著優於最先進的方法[11]。請注意，P2B可以在單個NVIDIA 1080Ti圖形處理器上運行約40FPS。

　　總的來說，本文的主要貢獻包括:

　　　　P2B:一個新穎的點到盒網絡，用於點雲中的三維物體跟蹤，可以進行端到端的訓練；

　　　　特定目標特征增強，包括用於3D對象跟蹤的全局和局部3D視覺線索；

　　　　3D目標提議和驗證的集成。

2.Related Works

We briefly introduce the works most related to our P2B: 3D object tracking, 2D Siamese tracking, deep learning on point set, target proposal and Hough voting.

2.相關工作

　　我們簡要介紹與我們的P2B最相關的作品:3D物體跟蹤，2D暹羅跟蹤，深入學習點集，目標提案和霍夫投票。

3D object tracking.To the best of our knowledge, 3D object tracking using only point clouds has seldom been studied before the recent pioneer attempt [11].Earlier re-lated tracking methods [24, 16, 15, 27, 1, 4] generally resort to RGB-D information.Though with the paid efforts from different theoretical aspects, they may suffer from two main defects: 1) they rely on RGB visual clue and may fail if it is degraded or even inaccessible.This limits some real appli-cations;2) they have no networks designed for 3D tracking, which may limit the representative power.Besides, some of them [24, 16, 15] focus on generating 2D boxes.The above concerns are addressed in [11].Leveraging deep learning on point set and 3D target proposal, it achieves the state-of-the-art result on 3D object tracking using only point clouds.However, it still suffers from some drawbacks as in Sec.1, which motivates our research.

3D對象跟蹤。

　　據我們所知，在最近的先鋒嘗試之前，僅使用點雲的3D對象跟蹤很少被研究[11]。早期的相關跟蹤方法[24，16，15，27，1，4]通常采用RGB-D信息。盡管從不同的理論角度進行了努力，但它們可能存在兩個主要缺陷:1)它們依賴於RGB視覺線索，如果視覺線索退化甚至不可訪問，就可能失敗。這限制了一些實際應用；2)他們沒有設計用於3D跟蹤的網絡，這可能會限制代表的力量。此外，其中一些[24，16，15]專注於生成2D盒。

　　上述問題在[11]中有所闡述。利用對點集和3D目標提議的深入學習，它實現了僅使用點雲的3D對象跟蹤的最先進的結果。然而，它仍然有一些缺點，如在證券交易委員會。1，這激發了我們的研究。

2D Siamese tracking.Numerous state-of-the-art 2D tracking methods [33, 3, 34, 13, 42, 35, 20, 8, 40, 36, 21] are built upon Siamese network.Generally, Siamese network has two branches for template and search area with shared weights to measure their similarity in an implicitly embed-ded space.Recently, [21] unites region proposal network and Siamese network to boost performance.Hence, time-consuming multi-scale search and online fine-tuning are both avoided.Afterwards, many efforts [42, 20, 40, 36, 8] follow this paradigm.However, the above methods are all driven by 2D CNN which is inapplicable to point clouds.We hence aim to extend the Siamese tracking paradigm to 3D object tracking with effective 3D target proposal.

2D暹羅跟蹤。

　　許多最先進的2D跟蹤方法[33，3，34，13，42，35，20，8，40，36，21]都建立在暹羅網絡上。通常，暹羅網絡有兩個分支用於模板和搜索區域，它們具有共享的權重以在隱式嵌入的空間中測量它們的相似性。最近，[21]聯合地區提案網絡和暹羅網絡提高了性能。因此，耗時的多尺度搜索和在線微調都得以避免。后來，許多努力[42，20，40，36，8]遵循這一范式。然而，上述方法都是由2D有線電視新聞網驅動的，不適用於點雲。因此，我們的目標是通過有效的3D目標方案將暹羅跟蹤范例擴展到3D對象跟蹤。

Deep learning on point set.Recently, deep learning on point set draws increasing research interests [5, 30].To ad-dress point clouds' disorder, sparsity and rotation variance, the paid efforts have facilitated the research in 3D object recognition [18, 23], 3D object detection [28, 29, 32, 39], 3D pose estimation [22, 9, 6], and 3D object tracking [11].However, the 3D tracking network in [11] cannot execute end-to-end 3D target proposal and verification jointly, which constitutes P2B's focus.

點集的深度學習。

　　最近，關於點集的深度學習吸引了越來越多的研究興趣[5，30]。為了適應點雲的無序、稀疏和旋轉變化，人們的努力促進了3D對象識別[18，23]、3D對象檢測[28，29，32，39]、3D姿態估計[22，9，6]和3D對象跟蹤[11]的研究。但是，[11]中的3D跟蹤網絡不能執行可愛的端到端3D目標提案和聯合驗證，這構成了P2B的重點。

　　種子目標性得分。

　　MLP多層感知器，具有全連通層、批量歸一化和ReLU。

　　圖3-排列不變性的概念。為了表示rj，我們首先計算rj和所有模板種子之間的逐點相似度Simj:Q = { qi } I = 1。然而，辛吉說:“由於Q的無序，它一直在變化(Q的順序可以不規則地變化)。這激活了我們對一致(即排列不變)f ^t rj的特征增強。“、”表示Simj和f t rj中的尺寸。

Target proposal.In 2D tracking tasks, many tracking-by-detection methods [41, 37, 14] exploit the target clue contained in template to obtain high-quality target-specific proposals.They operate on (2D) area-based pixels with ei-ther edge features [41], region-proposal network [37] or at-tention map [14] in a target-aware manner.Comparatively, P2B regards each point as a regressor towards potential tar-get center which directly relates to 3D target proposal.

目標提案。

　　在2D跟蹤任務中，許多檢測跟蹤方法[41，37，14]利用模板中包含的目標線索來獲得高質量的特定目標建議。它們以目標感知的方式對具有其他邊緣特征[41]、區域建議網絡[37]或潛在地圖[14]的基於(2D)區域的像素進行操作。相比之下，P2B把每一個點都看作是一個潛在的與三維目標方案直接相關的目標獲取中心的回歸。

Hough voting.The seminal work of Hough voting [19] proposes a highly flexible learned representation for object shape, which can combine the information observed on dif-ferent training examples in a probabilistic extension of the Generalized Hough Transform [2].Recently, [28] embeds Hough voting into an end-to-end trainable deep network for 3D object detection in point cloud, which further aggregates local context and yields promising results.But how to ef-fectively apply it to 3D object tracking remains unexplored.

霍夫投票。

　　霍夫投票的開創性工作[19]提出了一種高度靈活的物體形狀的學習表示，它可以在廣義霍夫變換[2]的概率擴展中結合在不同訓練例子上觀察到的信息。最近，[28]將霍夫投票嵌入到端到端可訓練的深度網絡中，用於點雲中的3D對象檢測，這進一步聚集了局部上下文並產生了有希望的結果。但是如何有效地將其應用於三維目標跟蹤仍然是一個有待探索的問題。

3.P2B: A Novel Network on Point Set for 3D Object Tracking

3.1.Overview

In 3D object tracking, we focus on localizing the target (defined by template) in search area frame by frame.We aim to embed template's target clue into search area to pre dict potential target centers, and execute joint target pro-posal and verification in an end-to-end manner.P2B has two main parts (Fig. 2): 1) target-specific feature augmen-tation, and 2) 3D target proposal and verification.We first feed template and search area respectively into backbone and obtain their seeds.Then the template seeds help aug-ment the search area seeds with target-specific features.Af-ter that, these augmented search area seeds are projected to potential target centers via Hough voting.Seed-wise target-ness scores are also calculated to regularize feature learning and strengthen the discriminative power of these potential target centers.Then each potential target center clusters its neighbors for 3D target proposal.Proposal with the maxi-mal proposal-wise targetness score is verified as the final re-sult.We will detail them as follows.Main symbols within P2B are defined in Table 1.For easy comprehension, we also sketch the detailed technical flow in Algorithm 1.

3. P2B:一種新的三維目標跟蹤點集網絡

3.1概觀

　　在三維目標跟蹤中，我們着重於在搜索區域逐幀定位目標(由模板定義)。我們旨在將模板的目標線索嵌入到搜索區域中，以預先確定潛在的目標中心，並以端到端的方式執行聯合目標計划和驗證。P2B有兩個主要部分(圖2): 1)特定目標特征增強，和2)三維目標建議和驗證。我們首先將模板和搜索區域分別饋入骨干網並獲取它們的種子。然后模板種子幫助用特定於目標的特征來更新搜索區域種子。之后，這些擴大的搜索區域種子通過霍夫投票被投影到潛在的目標中心。種子方式的目標性分數也被計算以規范特征學習並增強這些潛在目標中心的辨別能力。然后，每個潛在的目標中心將它的鄰居聚集在一起進行3D目標提議。具有最大建議針對性得分的建議被驗證為最終結果。我們將詳述如下。表1定義了P2B的主要符號。為了便於理解，我們還在算法1中概述了詳細的技術流程。

　　φ和θ表示在特征信道上運行的MLP-最大池-MLP網絡。

　　輸入:模板 (N1大小的Ptmp)和搜索區域(N2大小的Psea) 中的點。

　　輸出:具有最高標准S^P的提案。

　　1:特征提取。將Ptmp和Psea饋入主干網，並分別獲取種子Q = {qi} ^M_i=1和R = {rj } Mj=1，具有特征f ∈ R d。每個種子用其3D位置和f表示，產生3 + d1的維度。

　　2:逐點相似。計算每個點之間的逐點相似度Simj:

　　種子rj和q。對於所有的搜索區域種子，我們得到其與所有模板種子的Sim ∈ RM×M

　　3:特征增強。增加每個Simj:Q為M1的大小×(1+3 + d)。將結果輸入φ，以獲得rj的目標特定特征f t rj∈R d rj。rj用其3D位置和ft rj表示，以產生3 + d的尺寸。

　　4:生成潛在的目標中心。每個種子預測一個具有特征fcj ∈ R ^d2的潛在目標中心cj，1）通過Hough投票以及2)用種子方式的目標性得分s j ∈ R來評估。cj通過連接s j、其3D位置和fcj來表示，以產生1 + 3 + d的維度

　　5:集群。在C中采樣一個子集，使其大小為k。為每個樣本cj用球查詢生成聚類Tj.，其中Tj包含nj潛在的目標中心。

　　6: 3D目標提案。將每個Tj輸入θ，生成一個3D目標方案p ^t_j。通過提案針對性得分s^P_j，共預測了K個提案.

3.2.Target-specific feature augmentation

Here we aim to merge template's target information into search area seed to include both global target clue and local tracking clue.We first feed template and search area respec-tively into feature backbone and obtain their seeds.With the embedded target information in template, we then aug-ment the search area seeds with target-specific features in spirit of pattern matching, which also satisfies permutation-invariance to address point cloud's disorder.

3.2特定目標特征增強

　　在這里，我們旨在將模板的目標信息合並到搜索區域種子中，以包括全局目標線索和局部跟蹤線索。我們首先將模板和搜索區域分別饋入特征主干並獲取它們的種子。利用模板中嵌入的目標信息，在模式匹配的精神下，將搜索區域種子與目標特定的特征結合起來，滿足排列不變性，解決點雲的無序問題。？為何就解決了

Feature encoding on point cloud.We feed the points in template Ptmp (of size N1) and search area Psea (of size N2) to a feature backbone and obtain M1 template seeds Q = {qi} Mi=1 and M2 search area seeds R = {rj} Mj=1 with features f ∈ R d. We applied hierarchical feature learn-ing architecture of PointNet++ [30] as backbone (but not restricted to it), so that Q and R could preserve local con-text within Ptmp and Psea.Each seed is finally represented with [x;f] ∈ R 3+d(x denotes the seed's 3D position).

　　點雲特征編碼。我們將模板Ptmp(大小為N1)和搜索區域Psea(大小為N2)中的點饋送到一個特征主干，並獲得M1模板種子Q = {qi} Mi=1和M2搜索區域種子R = {rj} Mj=1，其特征為f ∈ R d。我們應用PointNet ++ 30的分層特征學習架構作為主干(但不限於此)，以便Q和R可以在Ptmp和Psea中保留本地內容。每個種子最終用[x；f] ∈ R 3+d表示(x表示種子的3D位置)。

Permutation-invariant target-specific feature aug-mentation.To embed Q's target information into R, a nat-ural idea is to compute point-wise similarity Sim (of size M2 × M1) between Q and R, e.g., using cosine distance:　　

Note that Simj,: (row j in Sim) denotes similarity between rj and all seeds in Q. We may first consider Simj,: as rj 's target-specific feature.However, as in Fig. 3, Simj,: keeps unstable due to Q's disorder.This contradicts our need for a consistent feature, i.e., a feature invariant to Q's inside permutation.We accordingly apply symmetric functions (specifically, Maxpool) to ensure permutation-invariance.As in Fig. 4, we first augment each Simj,: (local track-ing clue) with Q' spatial coordinates and features (global target clue), yielding a tensor of size M₁ × (1 + 3 + d₁).Then we feed the tensor into network Φ (MLP-Maxpool-MLP

There are other selections to extract f t : leaving out Q's feature, leaving out Sim or adding R's feature.All of them turns inferior in Sec.4.3.1.

　　排列不變的特定於目標的特征增強。為了將Q的目標信息嵌入到R中，自然的想法是計算Q和R之間的點狀相似度Sim(大小為M₂ × M₁)，例如，使用余弦距離:

　　請注意，Sim_j(Sim中的j行)表示r_j和q中所有種子之間的相似性。我們可以首先考慮Simj:作為rj的目標特定特征。然而，如圖3所示，由於Q的紊亂，Simj保持不穩定。這與我們對一致特征的需求相矛盾，即一個對Q的內部置換不變的特征。因此，我們應用對稱函數(具體來說，Maxpool)來確保置換不變性。如圖4所示，我們首先用Q’空間坐標和特征(全局目標線索)擴充每個Simj(局部跟蹤線索)，產生大小為M1 × (1 + 3 + d1)的張量。然后我們將張量輸入到網絡φ(MLP-麥克斯韋-MLP)中，得到rj的特定目標特征，ft rj∈R d rj最后用[xrj；f t rj ] ∈ R 3+d(xrj表示rj的3D位置)。

　　還有其他選擇來提取f t:省去Q的特征，省去Sim或增加R的特征。他們在第二節都變得很差。4.3.1。

3.3.Target proposal based on potential target centers

Embedded with target clue, each rj can directly predict one target proposal.But our intuition is that, individual seed can only capture limited local clue, which may not suffice the final prediction.We follow the idea within VoteNet [28] to 1) regress the search area seeds into potential target cen-ters via Hough voting, and 2) cluster neighboring centers to leverage the ensemble power and obtain target proposals.

3.3基於潛在目標中心的目標提案

　　嵌入目標線索，每個rj可以直接預測一個目標提案。但我們的直覺是，單個種子只能捕捉有限的局部線索，這可能不足以做出最終的預測。我們遵循VoteNet [28]中的思想，1)通過霍夫投票將搜索區域種子回歸到潛在的目標中心，以及2)聚類相鄰中心以利用集成能力並獲得目標建議。

Potential target center generation

　　潛在目標中心生成。每一個具有特征f t rj的種子rj可以通過霍夫投票粗略地預測潛在的目標中心cj。根據VoteNet [28]，投票模型應用MLP預測rj和地面真實目標中心之間的坐標偏移xj以及f t rj的殘差f t rj。因此，cj用…。xj的損失稱為 (2)

這里，gtj表示從rj到目標中心的地面真值偏移；指示我們只訓練那些位於地面真實目標表面的種子；Mts表示經過訓練的種子數量。

　　聚類和目標提案。對於每個cj，我們使用球查詢[]生成半徑為R:T ^t j = { CK | kck cjk 2 < R }的聚類T T j。由於相鄰的聚類可能捕捉到相似的區域級上下文，為了提高效率，我們在所有潛在的目標中心對大小為K的子集進行采樣，作為聚類質心。在4.3.3節，P2B變得強大到各種各樣的Ks。最后，我們將每一個T ^t j輸入到θ(MLP-馬克斯普爾-MLP)中，並獲得目標建議和建議針對性得分(共生成K個建議):

　　p t j有參數:三維位置的偏移和在X-Y平面的旋轉。我們將詳細介紹如何在秒內學習θ。3.5。

3.4.Improved target proposal with seed-wise tar-getness score

We consider each seed with target-specific feature can be directly assessed with its targetness to 1) regularize earlier feature learning and 2) strengthen the representation of its predicting potential target center.Therefore, we can obtain target proposals with higher quality.

3.4改進的目標提案，帶有種子階段的得分

　　我們認為每個具有特定目標特征的種子可以直接用其目標性來評估，以1)正則化早期特征學習和2)增強其預測潛在目標中心的表示。因此，我們可以獲得更高質量的目標提案。

Seed-wise targetness score s s . We learn a MLP to gen-erate s s j for each rj .Those search area seeds located on the surface of ground-truth target are regarded as positives, and the extra as negatives.We use a standard binary cross en-tropy loss Lcla for s s . Since s s j tightly relates to f t rj , Lcla can explicitly constrain the point feature learning and con-sequent target-specific feature augmentation.

　　種子階段的目標得分為100分，我們學習一個MLP來為每個目標得分.位於地面真實目標表面的搜索區域種子被認為是肯定的，多余的被認為是否定的。由於s-s-j與f-t-rj密切相關，因此Lcla可以明確地約束點特征學習和隨后的特定目標特征增強。

Improved target proposal.Inheriting more discrimi-native power from s s j , we update cj 's representation with Sequentially, we update clusters with ball query and target proposals with Equation (3).We consider that, s^s can implicitly help pick out representative potential target centers to benefit final target proposal.

　　改進的目標提案。從s . s . j .那里繼承了更多與生俱來的權力，我們用..,。接下來，我們使用ball查詢更新集群，並使用等式(3)確定目標建議。我們認為，s可以含蓄地幫助挑選有代表性的潛在目標中心，使最終目標提案受益。

3.5.Final target verification

With K proposals generated from above (refer to Θ in Equation (3)), proposal with the highest proposal-wise tar-getness score is verified as the final tracking result.

3.5最終目標驗證

　　根據以上生成的K個建議(參見等式(3)中的θ)，建議方面得分最高的建議被驗證為最終跟蹤結果。

We follow VoteNet [28] to learn Θ.Specifically, we con-sider proposals whose centers near the target center (within 0.3 meters) as positives and those faraway (by more than 0.6 meters) as negatives.Other proposals are left unpenalized.We use a standard binary cross entropy loss .As for p^t_j, only the positives' box parameters are supervised via Huber (smooth-L1 [31]) loss L_box.We aggregate all the mentioned losses as our final loss L:

　　我們遵循VoteNet [28]來學習θ。具體來說，我們認為那些靠近目標中心(0.3米以內)的中心是積極的，而那些遠離目標中心(0.6米以上)的中心是消極的。其他提議未被采納。我們使用標准的二進制交叉熵損失Lprop，對於p . t . j，通過Huber(光滑-L1 [31])損失Lbox只監督陽性盒參數。我們將上述所有損失合計為最終損失L:

Here γ1(= 0.2), γ2(= 1.5) and γ3(= 0.2) are used to nor-malize all the component losses to be of the same scale.

　　這里，γ1(= 0.2)、γ2(= 1.5)和γ3(= 0.2)被用來將所有的元件損耗非均勻化為相同的比例。

4.Experiments

We applied KITTI tracking dataset [10] (with point clouds scanned using lidar) as benchmark.We followed settings in [11] (shortened as SC3D by us for simplicity) in data split, tracklet generationand evaluation metric for fair comparisons.Since cars in KITTI appear in largest quan-tity and diversity, we mainly focused on car tracking and perform ablation study on it as in SC3D.We also did exten-sive experiments with other three target types (Pedestrain, Van, Cyclist) for better comparisons.

4.實驗

　　我們應用KITTI跟蹤數據集[10](使用激光雷達掃描點雲)作為基准。我們遵循[11]中的設置(為簡單起見，我們將其簡稱為SC3D)進行數據分割、軌跡線生成和公平比較的評估指標。由於KITTI的汽車以最大的質量和多樣性出現，我們主要關注汽車跟蹤，並在SC3D中對其進行消融研究。為了更好地進行比較，我們還對其他三種目標類型(行人、貨車、自行車)進行了廣泛的實驗。

4.1.Experimental setting

4.1.1 Dataset

4.1實驗環境

4.1.1數據集

Since ground truth for test set in KITTI is inaccessible offline, we used its training set to train and test our P2B.This tailored dataset had 21 outdoor scenes and 8 types of targets.We generated tracklets for target instances within all videos and split the dataset as follows: scenes 0-16 for training, 17-18 for validation, and 19-20 for testing.

　　因為在KITTI的測試集的地面真相是離線不可訪問的，我們使用它的訓練集來訓練和測試我們的P2B。這個定制的數據集有21個室外場景和8種類型的目標。我們為所有視頻中的目標實例生成軌跡，並將數據集分割如下:場景0-16用於訓練，場景17-18用於驗證，場景19-20用於測試。

Point cloud's sparsity.Though each frame reports an average of 120k points, we suppose the points on target might be quite sparse with general occlusion and lidar's de-fect on distant objects.To validate our idea, we counted the number of points on KITTI's cars in Fig. 5.We can observe that about 34% cars held fewer than 50 points.The situation may be worse on smaller-size pedestrians and cyclists.This sparsity imposes great challenge onto point cloud based 3D object tracking.

　　點雲的稀疏性。雖然每幀報告平均12萬個點，我們假設目標上的點可能非常稀疏，一般遮擋和激光雷達對遠處物體的影響。為了驗證我們的想法，我們在圖5中計算了KITTI汽車的點數。我們可以觀察到，大約34%的汽車持有少於50個點。小型行人和騎自行車者的情況可能更糟。這種稀疏性給基於點雲的三維目標跟蹤帶來了巨大的挑戰。

Frames containing the same target instance, e.g., a car, are concate-nated by time order to form a tracklet.

　　包含相同目標實例(例如，汽車)的幀按時間順序連接，以形成軌跡。

圖5-KITTI汽車上點的數量直方圖，以例證目標上點的稀疏性。

4.1.2 Evaluation metric

We used One Pass Evaluation (OPE) [38] to measure Suc-cess and Precision of different methods."Success" is de-fined as IOU between predicted box and ground-truth (GT) box."Precision" is defined as AUC for errors (distance be-tween two boxes' centers) from 0 to 2m.

4.1.2評估指標

　　我們使用一次通過評估(OPE) [38]來測量不同方法的成功率和精確度。“成功”被定義為預測框和實際框之間的借據。“精度”被定義為誤差(兩個盒子中心之間的距離)從0到2m的AUC。

4.1.3 Implementation details

Template and search area.For template, we col-lected and normalized its points to N1 = 512 ones with randomly abandoning or duplicating.For search area, we similarly collected and normalized the points to N2 = 1024 ones.The ways to generate template and search area differ in training and testing as detailed below.

4.1.3實施細節

模板和搜索區域。對於模板，我們通過隨機放棄或復制的方式，將其點集合並歸一化為N1 = 512。對於搜索區域，我們類似地收集並標准化了N2 = 1024個點。生成模板和搜索區域的方法在培訓和測試中有所不同，具體如下。

Network architecture.We adopted PointNet++ [30] as our backbone.We tailored it to contain three set-abstraction (SA) layers, with receptive radius of 0.3, 0.5, 0.7 meters, and 3 times of half-size down-sampling.This yielded M1 = 64(= N1/2 ) template seeds and M2 = 128(= N2/2 ) search area seeds.We applied random sampling, and re-moved up-sampling layers in PointNet++ due to points' sparsity.The output feature was of d1 = 256 dimensions.

網絡架構。我們采用了PointNet ++ 30作為我們的主干。我們將其定制為包含三個集合抽象層，接收半徑分別為0.3、0.5、0.7米和3倍半尺寸下采樣。這產生了M1 = 64(= N1/2)個模板種子和M2 = 128(= N2/2)個搜索區域種子。由於點的稀疏性，我們應用了隨機采樣，並在PointNet++中重新移動了上采樣層。輸出特征為d1 = 256維。

Throughout our method, all used MLPs had three layers.The size of these layers was 256 (hence d2 = 256) except that of the last layers (sizely) in following MLPs:

• For MLP to predict s s , sizely = 1.

• For Θ to predict s p and p t , sizely = 5.

　　在我們的方法中，所有使用的多層都有三層。這些層的尺寸為256(因此d2 = 256)，但以下多層中的最后幾層(尺寸)除外:

　　MLP預測s，大小= 1。

　　對於θ來預測s p和p t，sizely = 5。

Clustering.K = 64 randomly sampled potential target centers clustered the neighbors within R = 0.3 meters.

集群。K = 64個隨機抽樣的潛在目標中心聚集在R = 0.3米內的鄰居。

Training.1) Data Augmentation: we applied random offset on previous GT and fused point clouds within the re-sult box and the first GT for more template samples;we en-larged the current GT by 2 meters to include background (negative seeds), applied similar random offset and col-lected inside point cloud for more search area samples.2) We trained P2B from scratch with the augmented samples. We applied Adam optimizer [17].Learning rate was ini-tially 0.001 and decreased by 5 times after 10 epochs.Batch size was 32.In practice, we observed P2B converged to a satisfying result after about 40 epochs.

訓練。1)數據增強:我們在結果框和第一個模板樣本的第一個模板中，對先前的第一個模板和融合點雲應用了隨機偏移；我們將當前的GT放大了2米，以包括背景(負種子)，應用了類似的隨機偏移，並為更多的搜索區域樣本選擇了內部點雲。2)我們用增加的樣本從頭開始訓練P2B。我們應用了亞當優化器[17]。學習率最初為0.001，10個時期后下降了5倍。批量為32。在實踐中，我們觀察到P2B在大約40個紀元后收斂到一個令人滿意的結果。

表2。與SC3D進行全面比較。右三列生成搜索區域的方式不同。

表3。與SC3D進行廣泛比較。右邊的五列顯示了不同目標類型及其平均值的結果。

模板和搜索區域是點雲的形式。燃氣輪機和結果是3D盒子的形式。

Method Car Pedestrian Van Cyclist Mean

方法汽車行人貨車自行車平均

Testing.We used the trained P2B to infer 3D bound-ing boxes within tracklets frame by frame.For the current frame, template initially adopted the first GT's point cloud and then fusion of the first GT's and previous result's point clouds.We enlarged previous result by 2 meters in current frame and collected inside point cloud to obtain search area.

測試。我們使用訓練有素的P2B逐幀推斷軌跡中的3D綁定框。對於當前幀，模板最初采用第一個點雲，然后融合第一個點雲和前一個結果的點雲。在當前幀中，我們將先前的結果放大了2米，並收集了內部點雲以獲得搜索區域。

4.2.Comprehensive comparisons

We only compared our P2B with SC3D [11], the first and only work on point cloud based 3D object tracking.We reported results for 3D car tracking in Table 2.

4.2綜合比較

　　我們僅將我們的P2B與SC3D [11]進行了比較，SC3D是第一個也是唯一一個基於點雲的3D對象跟蹤的工作。我們在表2中報告了3D汽車跟蹤的結果。

We generated search area centered on previous result, previous GT or current GT.Using previous result as the search center meets the requirement of real scenarios, while using previous GT helps approximately assess short-term tracking performance.For the two situations, SC3D applies Kalman filtering to generate proposals.Using current GT is unreasonable, but is considered in SC3D to approximate exhaustive search and assess SC3D's discriminative power.Specifically, SC3D conducts grid search around target cen-ter to include GT box in generated proposals.However, P2B clusters potential target centers to generate proposals with-out explicit dependence on GT box.I.e., P2B may adapt to various scenarios while SC3D could degrade when the GT boxes are removed as demonstrated in Table 2 .Compre-hensively, P2B outperformed SC3D by a large margin.All later experiments adopted the more realistic setting of using previous result ("Testing" in Sec.4.1.3).

　　我們生成了一個搜索區域，以之前的搜索結果、之前的搜索結果或當前的搜索結果為中心。使用以前的結果作為搜索中心符合真實場景的要求，而使用以前的GT有助於大致評估短期跟蹤性能。對於這兩種情況，SC3D應用卡爾曼濾波來生成建議。使用當前的GT是不合理的，但是在SC3D中被認為是近似窮舉搜索和評估SC3D的辨別能力。具體來說，SC3D圍繞目標中心進行網格搜索，以在生成的建議書中包含GT框。然而，P2B將潛在的目標中心聚集在一起，以產生不依賴於燃氣輪機箱的方案。如表2所示，當移除燃氣輪機箱時，P2B可能會適應各種情況，而SC3D可能會降級。總的來說，P2B的表現遠遠超過了SC3D。所有后來的實驗都采用了更現實的設置，即使用先前的結果(“測試”)。4.1.3)。

Extensive comparisons.We further compared P2B with SC3D on Pedestrian, Van, and Cyclist (Table 3).P2B out-performed SC3D by ∼10% on average.P2B's advantage turned significant on data-rich Car and Pedestrian.But P2B degraded when training data decreased as was the case for Van and Cyclist.We conjecture that P2B may rely on more data to learn better networks especially when regressing potential target centers.Comparatively, SC3D needs rela-tively less data to suffice similarity measuring between two regions.To validate this, we used the model trained on data-rich Car to test Van, with the belief that car resem-bles van and contains potentially transferable information.As expected, the Success/Precision result of P2B showed an improved 49.9/59.9 (original: 40.8/48.4), while SC3D reported a declined 37.2/45.9 (original: 40.4/47.0).

　　廣泛的比較。我們進一步比較了P2B和SC3D在行人、貨車和自行車上的表現(表3)。P2B平均比SC3D高出10%。P2B在數據豐富的汽車和行人方面的優勢變得非常明顯。但是，當訓練數據減少時，P2B就退化了貨車和自行車手。我們推測，P2B可能依賴更多的數據來學習更好的網絡，尤其是在回歸潛在的目標中心時。相比之下，SC3D需要相對較少的數據來滿足兩個區域之間的相似性測量。為了驗證這一點，我們使用了在數據豐富的汽車上訓練的模型來測試貨車，相信汽車能重置貨車並包含潛在的可轉移信息。不出所料，P2B的“成功/精度”結果顯示49.9/59.9(原始值:40.8/48.4)，而SC3D報告的結果為37.2/45.9(原始值:40.4/47.0)。

表4。目標特定特征增強的不同方法(tsfa)。用於獲得搜索特征A和B的方法在圖6中示出。

圖6。在特定目標特征增強中包含搜索區域特征的兩種方法。對於A，我們復制搜索區域種子的特征，並在模板特征沿着相似性圖的每一列復制之后附加它們；對於B，我們將搜索區域特征與Maxpool之后的特征連接起來(圖4)。

4.3.Ablation study

4.3.1 Ways for target-specific feature augmentation

Besides our default setting in P2B (Sec.3.2), there are another four possible ways for feature augmentation: re-moving (the duplication of) template features, removing the similarity map, using search area feature A and B (Fig. 6).

4.3消融研究

4.3.1特定目標特征增強的方法

　　除了我們在P2B的默認設置(秒。3.2)，還有另外四種可能的特征增強方法:重新移動(復制) 模板特征，去除相似性圖，使用搜索區域特征A和B(圖6)。

We compared the five settings in Table 4.Here remov-ing template features or similarity map degraded by about 1% or 3%, which validates the contributions of these two parts in our default setting.Search area feature A and B did not improve or even harm the performance.Note that we already combined template features in both conditions.This may reveal that search area features only capture spa-tial context rather than target clue, and hence turns useless for target-specific feature augmentation.In comparison, our default setting brings with richer target clue from template seeds to yield a more "directed" proposal generation.

　　我們比較了表4中的五種設置。在這里，移除模板特征或相似性圖降低了大約1%或3%，這驗證了這兩個部分在我們的默認設置中的貢獻。搜索區域特征A和B並沒有提高甚至損害性能。請注意，我們已經在兩種情況下組合了模板特性。這可能揭示搜索區域特征僅捕捉空間上下文而非目標線索，因此對於目標特定特征增強變得無用。相比之下，我們的默認設置帶來了來自模板種子的更豐富的目標線索，以產生更“定向”的建議生成。

表6。模板生成的不同方式。“第一個和先前的”表示“第一個燃氣輪機和先前的結果”。

圖7-種子階段目標得分和潛在目標中心的說明。綠線顯示從種子(第一行中的彩色點)到潛在目標中心(第二行中的彩色點)的投影。我們用紅色標出了那些信息點，即目標性分數較高的點，用黃色標出了相反的點。成對的種子和潛在中心用相同的顏色標記以顯示相關性。

圖8。不同數量的建議表明，我們的方法與廣泛的參數兼容。

4.3.2 Effectiveness of seed-wise targetness

In Sec.3.4, we obtain seed-wise targetness scores s s and concatenate them with potential target centers to guide the proposal and verification.Here we tested P2B without this concatenation or even the whole branch of s s (Table 5).We can observe that leaving out concatenation dropped the performance by ∼1%, while removing the whole branch dropped by ∼3%.This verifies that s s offers good super-vision on learning the whole network for improved target proposal and verification.

4.3.2種子階段目標的有效性

　　在3.4，我們獲得種子方式的目標得分s，並將它們與潛在的目標中心連接起來，以指導建議和驗證。在這里，我們測試了P2B，沒有這種連接，甚至沒有s的整個分支(表5)。我們可以觀察到，省略串聯會使性能下降1%，而刪除整個分支會使性能下降3%。這驗證了s-s在學習整個網絡以改進目標提議和驗證方面提供了良好的超視覺。

4.3.3 Robustness with different number of proposals

We tested P2B (without re-training) and SC3D with dif-ferent number of proposals.From the results in Fig. 8, P2B obtained satisfying results even with only 20 proposals.But SC3D degraded dramatically when using less than 40 pro-posals.To conclude, P2B turns more robust to less number of proposals, showing that P2B can generate proposals with both higher quality and efficiency.

4.3.3不同提案數量的穩健性

　　我們測試了P2B(沒有再培訓)和SC3D，並提出了不同數量的建議。從圖8的結果來看，即使只有20個建議，P2B也獲得了令人滿意的結果。但是當使用少於40個處理器時，SC3D性能顯著下降。總之，P2B變得更加穩健，提案數量減少，這表明P2B能夠以更高的質量和效率提出提案。

4.3.4 Ways for template generation

For template generation, SC3D concatenates the points in all previous results while P2B concatenates the points within the first GT and previous result to update template for efficiency.Here we reported results with four settings for template generation: the first GT, the previous result, the fusion of the first GT and previous result, and all previ-ous results.Results in Table 6 show P2B's consistent advan-tage over SC3D in all settings, even in "All previous shapes" where P2B reported degraded result.We attribute the degra-dation to that 1) we did not include shape completion [11] and 2) we did not train P2B with all previous results while SC3D considered both.

4.3.4模板生成方式

　　對於模板生成，SC3D連接所有先前結果中的點，P2B將第一個燃氣輪機和先前結果中的點連接起來，以更新效率模板。在這里，我們用模板生成的四個設置來報告結果:第一個組、前一個結果、第一個組和前一個結果的融合，以及所有前一個結果。表6中的結果顯示了P2B在所有設置中相對於SC3D的一致優勢，甚至在“所有以前的形狀”中，P2B報告了降級結果。我們將退化歸因於1)我們沒有包括形狀完成[11]和2)我們沒有用所有以前的結果訓練P2B，而SC3D考慮了兩者。

4.4.Qualitative analysis

4.4.1 Advantageous cases

We first exemplified our target-specific feature's discrim-inative power in Fig. 7.The first row visualizes seeds' tar-getness scores to demonstrate their possibility of belonging to the target (Car).We can observe that P2B had learnt to discriminate the target seeds from the background ones.The second row visualizes how P2B projects seeds to po-tential target centers.We can observe that the potential cen-ters with more target information gathered tightly around GT target center, which further validates our discriminative target-specific features.Besides, P2B can address the occlu-sion because it can generate groups of informative potential target centers for final prediction.

4.4定性分析

4.4.1有利案例

　　我們首先在圖7中舉例說明了我們的特定目標特征的發散能力。第一行可視化種子的焦油含量分數，以證明它們屬於目標(汽車)的可能性。我們可以觀察到，P2B已經學會區分目標種子和背景種子。第二行顯示了P2B如何將種子投射到潛在的目標中心。我們可以觀察到，具有更多目標信息的潛在中心緊密地聚集在燃氣輪機目標中心周圍，這進一步驗證了我們的區別性目標特異性特征。此外，P2B可以解決這一問題，因為它可以生成一組信息豐富的潛在目標中心，用於最終預測。

We then visualize P2B's advantage over SC3D to address point cloud's sparsity in Fig. 9.We can observe that in the sparse scenarios where SC3D tracked off course or even failed, our predicted box held tight to the target center.

　　然后，我們將P2B相對於SC3D的優勢可視化，以解決圖9中點雲的稀疏性。我們可以觀察到，在SC3D偏離軌道甚至失敗的稀疏場景中，我們的預測框緊緊抓住目標中心。

4.4.2 Failure cases

Here we searched for tracklets where P2B failed and found that most failure cases arose when initial template in the first frame was too sparse and hence yielded little target information.As exemplified in Fig. 10, when P2B faced such case and tracked off course with cluttered background, points from the initial template cannot modify current er-roneous predictions and re-obtain an informative template.This failure may also reveal that P2B inherits target infor-mation from template instead of search area.

We believe that when fed with more points containing potentially rich target information, P2B could generate pro-posals with higher quality to yield better results.Our intu-ition is validated in Fig. 11.

4.4.2失敗案例

　　在這里，我們搜索了P2B失敗的軌跡，發現當第一幀中的初始模板過於稀疏，因此產生的目標信息很少時，會出現大多數失敗情況。如圖10所示，當P2B面對這種情況並在混亂的背景下偏離軌道時，來自初始模板的點不能修改當前的時間預測並重新獲得信息模板。這種失敗也可能表明P2B從模板而不是從搜索區域繼承目標信息。

　　我們相信，當獲得更多包含潛在豐富目標信息的分數時，P2B可以生成更高質量的預測，從而產生更好的結果。我們的假設在圖11中得到驗證。

4.5.Running speed

Here we averaged the running time of all test frames for car to measure P2B's speed.P2B achieved 45.5 FPS, in-cluding 7.0 ms for processing point cloud, 14.3 ms for net-work forward propagation and 0.9ms for post-processing, on a single NVIDIA 1080Ti GPU.SC3D in default setting ran with 1.8 FPS on the same platform.

4.5行駛速度

　　在這里，我們對汽車所有測試幀的運行時間進行平均，以測量P2B的速度。在單個NVIDIA 1080Ti圖形處理器上，P2B實現了45.5 FPS，包括處理點雲的7.0毫秒、網絡前向傳播的14.3毫秒和后處理的0.9毫秒。默認設置下的SC3D在同一平台上以1.8 FPS運行。

5.Conclusions

In this work we propose a novel point-to-box (P2B) net-work for 3D object tracking.We focus on embedding the target information within template into search space and formulate an end-to-end method for point-driven target pro-posal and verification jointly.P2B operates on sampled seeds instead of 3D boxes to reduce search space by a large margin.Experiments justify our proposition's superiority.

5.結論

　　在這項工作中，我們提出了一個新穎的點對盒(P2B)網絡三維目標跟蹤。我們着重於將模板中的目標信息嵌入到搜索空間中，並提出一種端到端的方法，用於點驅動目標定位和聯合驗證。P2B對采樣種子而不是3D盒子進行操作，大大減少了搜索空間。實驗證明了我們主張的優越性。

圖11。第一幀的車點數對我們方法的影響。我們計算了測試集中每個時間間隔(水平軸)的平均成功率。

　　實驗還表明，P2B需要更多的數據才能獲得滿意的結果。因此，我們可以期待一個不那么依賴數據的P2B，同時我們也可以收集更多的數據來處理這個大數據時代的問題。此外，我們可以在搜索區域尋找更好的特征增強方法，並在更具挑戰性的場景中測試我們的方法。

　　本工作得到了國家自然科學基金(批准號:U1913602、61876211和61502187)、中國裝備預研領域基金(批准號:61403120405)、中國國家重點實驗室開放基金(批准號:6142113180211)和中央大學基礎研究基金(批准號:2019kfyXKJC024)的共同資助。

word文檔版看筆記標注。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【三維目標檢測】VoteNet：Deep Hough Voting for 3D Object Detection in Point Clouds Deep Learning for 3D Point Clouds: A Survey中文翻譯 CVPR 2019: Generating 3D Adversarial Point Clouds 生成三維對抗點雲數據論文筆記：（2019CVPR）PointConv: Deep Convolutional Networks on 3D Point Clouds 3D Object Classification With Point Convolution —— 點雲卷積網絡 2020國防科大綜述：3D點雲深度學習—綜述（【2019-12】Deep Learning for 3D Point Clouds: A Survey） ICCV2019《KPConv: Flexible and Deformable Convolution for Point Clouds》【論文閱讀】DGCNN：Dynamic Graph CNN for Learning on Point Clouds 論文筆記：（CVPR2017）PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation CVPR2017_PointNet_ Deep Learning on Point Sets for 3D Classification and Segmentation