【論文閱讀】ICLR 2022: Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

本文轉載自查看原文 2022-04-05 22:08 694 論文閱讀

ICLR 2022: Scene Transformer: A unified architecture for predicting future trajectories of multiple agents

Type: ICLR
Year: 2022
組織: waymo

參考與前言
- openreivew
https://openreview.net/forum?id=Wm3EA5OlHsG
- pdf
Scene Transformer: A unified architecture for predicting multiple agent trajectories

1. Motivation

主要受語言模型方法 language modeling approach 啟發而來

問題場景

任務：多agent的軌跡預測問題

難點：因為agent本身行為的多樣性（diverse），加之對彼此軌跡的影響（influence）

之前工作主要聚焦在根據過去動作預測單獨 agent的未來軌跡，然后根據各自的預測來進行規划；但是呢 independent predictions 並不利於表示未來狀態下不同agent之間的交互問題，從而引申規划時也是sub-optimal的軌跡

marginal prediction：未來時刻不同agent預測的軌跡可能會有沖突部分，即兩者相交
joint prediction：在同一未來時刻，不同agent的預測軌跡不會沖突， respect each others’ prediction

Contribution

formulate a model 去同時(jointly)預測所有的agent行為，producing consistent future 來解釋agent之間的行為

以下為原文，這個貢獻的格式和jjh說的TRO格式好像，名詞方法為主語

A novel, scene-centric approach that allows us to gracefully switch training the model to produce either marginal (independent) and joint agent predictions in a single feed-forward pass.

僅在單個feed-forward中進行marginal和joint prediction之間的切換
A permutation equivariant Transformer-based architecture factored over agents, time, and road graph elements that exploits the inherent symmetries of the problem.

使用與 transformer 相同(等價)的permutation 來將agents, time和road graph都考慮在系統內
A masked sequence modeling approach that enables us to condition on hypothetical agent futures at inference time, enabling conditional motion prediction or goal conditioned prediction.

masked sequence modeling 能使我們將未來考慮在內，時間意義上

問題區：

摘要的方法沒看懂，三個一個都沒看懂.... TBD閱讀到后面在回答這個問題吧

Through combining a scene-centric approach, agent permutation equivariant model, and a sequence masking strategy
- 介紹中引入scene-centric說的是為了scaling to large numbers of agents，但是在貢獻中卻說的是切換？emmm 是數量大了就切換？小了就joint？
評估時為什么是marginal and joint motion predictions，后者可以理解，前者的marginal是什么預測？單獨agent的預測與真值對比嘛？

后面介紹部分解釋了，見前面解釋
為什么要切換？直接整體進行joint prediction不是更好嗎？

方法處說明了是不同的任務之間都可以用這一個網絡進行，主要任務是：motion prediction、conditional motion prediction、goal-conditioned prediction
transformer？attention 機制？考慮時形式以vector形式嗎？

方法中有具體介紹，靜止的road graph用feature vector形式，動態的比如紅綠燈是one feature vector per object形式
沒看懂最后一條貢獻，因為在第二條里已經說明了使用transform類似機制將time考慮進內，mask squence 建模原因有重復？
- 是直接對未來的agent進行假設嘛？可能是前情提要知識缺的有點多，可能得套娃比較多
mask的原因其實是切換... The approach is flexible, enabling us to simultaneously train a single model for MP, CMP, GCP.
如果有榜的話不是第一，也可以稱自己為state-of-art嘛？畢竟這篇在waymo online 排行榜中，排名挺后的

2. Method

相關工作主要是圍繞，此處僅做簡單總結，主要是前情提要知識補充可能能解答上面的問題

motion prediction框架：說明成功的模型大多都會考慮agent motion history和道路結構（包括lane, stop line, 紅綠燈等等）；

相關方式：
- 直接將輸入渲染為多通道的鳥瞰圖 top-down image，然后使用卷積，但是receptive field並不利於capturing spatially-distant intersection
- entity-centric approach：可以將agent的歷史狀態使用sequence modeling方式例如 RNN，進行編碼，其中將道路結構中 pose 信息和 semantic type 都編碼（比如以piecewise-linear segments）進入系統；使用如下方法將信息進行聚合：employ pooling, soft-attention, graph neural networks
scene-centric 和 agent-centric representation：主要是討論 representation encoding所用的框架
- 以scene-level 作為坐標系，rasterized top-down image，雖然能有效的表示world狀態在common的坐標系下，但是喪失了一些潛在的pose信息
- 以agent-coordinate 為坐標系，但是隨着agent數量上升同時交互的數量也會二次方上升。
后續說明 waymo的另一篇工作LaneGCN就是以agent為中心但是實在global frame下做的。同時也不需要將場景表示成為圖像的形式
Representing multi-agent futures：主要是如何表示多agent的未來狀態，常用的有直接對每個agent的軌跡使用權重

問題區：

第二點提到的representation不就是第一點里面的相關方式嘛？感覺這篇文章好多地方有耦合方法和方法之間的原因很像，為何不直接總結成一個？

一個是representation，一個是以什么為中心進行

2.1 輸入與輸出

輸入

a feature for every agent at every time step

在模型中是一個3d tensor，A 個 agents，每個里面有D個特征維度，在時間T steps，同時在每層layers中我們都想保持住這樣的size：\([A,T,D]\)

注意在decoder中有多的一個維度：F potential futures

輸出

an output for every agent at every time step

2.2 框架

整體模型名稱：scene transformer，一共有三個階段：

將agents和road graph embed到一個高維空間
employ attention-based network 去 encode agents和road graph之間的交互
使用attention-based network 去 decode multiple future

mask

對於多任務的切換主要用mask來實現，如下圖所示，在做MP的時候時間維度上有mask被遮擋，但是如果是CMP則自身的motion提供未來時間內motion，GCP的話就是提供最遠時間T的AV motion

A. Scene-Centric Representation

此點主要是以什么為中心進行場景周圍信息的獲取，正如前面相關工作中提到的，此處以場景為中心也就是使用 an agent of interest’s position 作為原點，對所有的road graph和agents進行編碼；以agent為中心的話，就是對每個agent分別進行以其為原點的計算

此步中細節步驟為：

為每個agent生成 time step內的feature，if time step is visible
使用 PointNet 為static road graph和其余的元素 learning one feature vecctor per polyline，其中交通標志 sign為長度為1的polylines
為dynamics road graph 比如在空間上是靜止的在時間上是變換的紅綠燈，生成為 one feature vector per object

所有的以上類別都具有xyz位置信息，以其選定好的agent作為居中，對剩余類別進行居中旋轉等處理，再使用sinusoidal position embeddings

B. Encoding Ttansformer

和基本的attention並無太大區別，query, key, value為需要學習的線性層，每個都乘一下輸入 x，比如：\(Q=W_qx\)，如上圖的encoder和decoder框圖，其中decoder最后接了兩層MLP然后 predict 7 outputs，其中前六個對應的是：三個是在給定時間下的agent的三維與the agent of interest之間的絕對坐標，and 三個是不確定性遵循Laplace 分布的參數。后一個是heading

為了尋求更高效的self-attention，僅在時間層上使模型獨立於agent進行平滑軌跡的學習，同樣的僅在agent層上使模型獨立於time進行interaction的之間的學習，類似於解耦，如上圖decoder部分下面，交替進行兩次

與road graph之間是cross attention

C. Predicting Probabilities for each Futures

預測的是概率分數，不論是joint里的每個未來的情況打分還是marginal model里對軌跡的打分。所以我們需要一個feature representation去總結 scene和each agent.

根據agent和time下對agent feature tensor進行分別求和，然后加到additional artificial agent and time，所以internal representation就會變成 \([A+1,T+1,D]\)

然后作為decoder的輸入，經過兩層 MLP+softmax 得到等價的probabilities for each features

D. Joint and Marginal Loss Formulation

首先對於所有的agent都有一個displacement loss and time step to build a loss tensor of shape \([F]\)，但是我們僅將最接近於真值的進行back-propagate反向傳播；對於marginal的預測呢則是每個agent都是單獨的對待，也就是得到了displacement loss是 \([F,A]\)，但是並不aggregate across agents而是為每個agent選取最小的loss然后反向

問題區：

encode和decode都是一個attention-based network... 那

有框圖解釋了兩者的設計方式
這里的預計motion 是根據規划得到的嗎？規划是deterministic的嗎？還是直接針對的是數據集

應該是數據集，所以可以直接獲取未來數據集內的motion進行此任務
an agent of interest’s position 是感興趣的agent的位置吧... 為啥寫的這么繞.. select an interest agent’s position不好嗎...
- 選擇指標是？
  
  腳注和open reivew中也有審稿人問了 hhh，腳注說明了對於waymo是自身車輛，對於Argoverse是需要預測的車輛
這里的所有是指？所有？整張地圖的道路結構？還是選取了以選擇定的agent 畫了框？

3. 實驗

指標為預測中場景的minADE, minFDE, miss rate和mAP，基本上都是用來測量 how close the top k trajectories are to ground truth observation，也就是預測的軌跡離真值有多近

L2: A simple and common distance-based metric is to measure the L2 norm between a given trajectory and the ground truth
minADE: reports the L2 norm of the trajectory with the minimal distance
minFDE: reports the L2 norm of the trajectory with the smallest distance only evaluated at the final location of the trajectory.

本文所有的是MR, mAP，對於joint future則是scene-level下的minSADE, minSFDE, SMR

miss rate (MR) and mean average precision (mAP) to capture how well a model predicts all of the future trajectories of agents probabilistically

主要就貼一下實驗表格等

場景分析圖：

指明不同的目標點，預測也會隨之變換，響應前文提出的switch task GCP

4. Conclusion

碎碎念

正如CJ哥所言：waymo必然不開源；但是吧每個論文的附錄都特別仔細到讓我這種小白菜覺得哇 emm 似乎可以復現呢，但是這篇可能沒細看附錄的原因有好幾個地方還是有點存疑的，hhhh。所以主要重點看看他們的框架是怎么搭的更為重要，waymo三篇基本都是自己設計的網絡不走resnet或者regnet 有預訓練的參數。更多細節要是感興趣的話建議讀一下原文的附錄部分，網絡參數等都介紹的較為詳細

這一篇雖不及MP3驚艷，但似乎奠定了應該用vector的形式去做預測類似於CJ哥在multipath++筆記中提到，vectornet有一統的趨勢。其實pointnet之類的在17年的就提出了進來以pointnet → vectornet → 再到現在的一系列基本都是attention下的各種玩法

open review值得一看還是這種開放審稿的有意思啊，因為有審稿人對GCP的結果說明產生了問題，類似於建議作者在CARLA做就是以目標點的condition prediction其實已經很像planning了，基本就是加一下控制器，然后作者謝謝提醒，我知道（內心OS:但是我不做hhhh）

另外貼一下我在前面說的 online leaderboard 下確實排名不高，不過按提交時間的話就另說了

贈人點贊手有余香 😆；正向回饋才能更好開放記錄 hhh

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【DARTS】2019-ICLR-DARTS: Differentiable Architecture Search-論文閱讀【CVPR 2022】論文閱讀：MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation Context Prior for Scene Segmentation(CVPR 2020)論文閱讀筆記【論文閱讀】CVPR2022: Learning from all vehicles 論文閱讀筆記十五：Pyramid Scene Parsing Network（CVPR2016）【論文閱讀】MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition ICLR 2021 NAS 相關論文(包含Workshop) 論文閱讀筆記（七十）【CVPR2021】：Combined Depth Space based Architecture Search For Person Re-identification TransTrack: Multiple-Object Tracking with Transformer 論文閱讀（Xiang Bai——【PAMI2017】An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition）