2. 以人為中心的視覺理解 (ceiwu lu, SJU)
2.1 基於視頻的時序建模和動作識別方法 (liming wang, NJU)
-
dataset
- 兩張圖:
- 注意一個區分:trimmed and untrimmed videos
-
outline
- action recognition
- action temporal localization
- action spatial detection
- action spatial-temporal detection
-
opportunities and challenges
- opportunities :
- videos provide huge and rich data for visual learning
- action is core in motion perception and has many applications in video understanding
- challenges:
- complex dynamics and temporal variations
- action vocabulary is not well defined
- noisy and weakly labels (dense labeling is expensive)
- High computational and memory cost
- opportunities :
-
temporal structure: 需要對動作進行分解:decomposition
-
常用的 Deep networks
- large-scale video classification with CNN (feifei li, CVPR2014)
- Two-Stream CNN for action recognition in videos (NIPS2014)
- learning spatiotemporal features with 3D CNN (Du Tran, ICCV2015)
- TDD (liming wang, CVPR2015)
- Real-time action recognition with enhanced motion vector CNNs (CVPR2016)
- Two Stream I3D (CVPR2017)
- R(2+1)D (CVPR2018)
- SlowFast Networks (kaiming he, CVPR2019)
-
liming wang 自己的3篇工作
- shot-term -> middle-term -> long term modeling,對應的論文是 (ARTNet -> TSN -> UntrimmedNet)
- 更多細節理解,直接看他的PPT寫讀后感
- 按照 liming wang 自己說的,video action recognition/detection 對於 我的VAD 基本沒有幫助
-
之前看到一些很好的zhihu link: 動作識別-1, 動作識別-2, 時序行為檢測-1, 時序行為檢測-2, 時序行為檢測-3, 時序行為檢測-4,
-
所有的PPT圖片
2.2 復雜視頻的深度高效分析與理解方法 (yu qiao, CAS)
-
DL的一些經驗性Trick介紹
-
人臉識別的開集特點 (Open-set 和 novelty detection有點類似,參考TODO)
-
Center Loss (ECCV2016)
center loss意思即為:為每一個類別提供一個類別中心,最小化min-batch中每個樣本與對應類別中心的距離,這樣就可以達到縮小類內距離的目的。
center loss的原理主要是在softmax loss的基礎上,通過對訓練集的每個類別在特征空間分別維護一個類中心,在訓練過程,增加樣本經過網絡映射后在特征空間與類中心的距離約束,從而兼顧了類內聚合與類間分離。
-
Center Loss的改進 (IJCV2019): 用投影方向代替類中心
-
Large Margin思想設計的Loss:
- L-softmax (Liu, ICML 2016)
- A-softmax (Liu, CVPR2017)
- Additive Margin Softmax (ICLR 2018 workshop)
- CosLoss (wang, CVPR2018)
- ArcFace (CVPR2019)
-
-
Range Loss :有效應對類間樣本數不均衡造成的長尾問題
- motivation:少數人(明星)具有大量的圖片,多數人卻只有少量圖片,這種長尾分布啟發了兩個動機:(1)長尾分布如何影響模型性能的理論分析;(2)設計新的Loss解決這個問題
- 此處有一張圖片,並且Range Loss 的 PPT缺失了
-
video action recognition
- 姿態注意力機制 RPAN (ICCV2017, Oral)
- 把行為識別和姿態估計兩個任務進行結合
- 利用姿態變化,引導 RNN 對行為的動態過程進行建模
- 姿態注意力機制 RPAN (ICCV2017, Oral)
-
一篇文章
- Temporal Hallucinating for Action Recognition with Few Still Images (CVPR 2018)
-
一些圖
2.3 understanding emotions in videos (yanwei fu, FDU)
- 個人感覺:這是個剛挖的新坑,有趣,值得了解下
-
applicaltion
- web video search
- video recommendation system
- avoid inappropriate advertisement
-
Tasks of Emotions in videos
- Emotion recognition
- emotion attribution
- emotion-oriented summarization
-
Challenges
- Sparsely expressed in videos
- Diverse content and variable quality
-
Knowledge Transfer
- Zero-shot Emotion learning (配一張圖)
- A multi-task neural approach for emotion attribution, classification and summarization (TMM)
- Frame-Transfermer emotion classification Network (*CMR 2017)
- Zero-shot Emotion learning (配一張圖)
-
Emotion-oriented summarization
- 相當於選擇關鍵幀以及幀信息融合
-
Face emotion
- Posture, Expression, Identity in faces
-
一些圖:
2.4 以人為中心視覺識別和定位中的結構化深度學習方法探索 (wanli ouyang, sdney university)
-
outline
- introduction
- structured feature learning
- back-bone model design
- conclusion
-
introduction
- object detection
- human pose estimation
- action recognition
-
structured feature learning
- structure in neurons
- motivation:傳統 neurons 在同一層沒有連接,在相鄰層存在局部或者全部連接,沒有保證局部區域的信息。從而引出每一層網絡的各神經元具有結構化信息的。然后以人體姿態估計為例,分析了基於全連接神經網絡的問題:在對人體節點的距離進行建模需要大的卷積核以及一些關節點的關系是不穩定。提出結構化特征學習的人體姿態估計模型(Bidirectional Tree)。
- Bidirectional Tree
- 對應的papers
- end-to-end learning of deformable mixture of parts and deep convolutional neural networks for human pose estimation (CVPR2016)
- structure feature learning for pose estimation (CVPR2016)
- CRF-CNN, modeling structured information in human pose estimation (CVPR2016)
- learning deep structured multi-scale features using attention-gated CRFs for contour prediction (NIPS 2017)
- application of structured feature learning
- 有一張圖片
- structure in neurons
-
back-bone model design
- Hourglass for classification (Encode-Decoder 結構,比如 UNet,一般用於圖像分割,不用於分類)
- 希望: feature with high-level semantics and high resolution is good
- 現實:feature with high-level semantics with low resolution
- Hourglass for classification has poor performance的原因分析:Different tasks require different resolution of features,所以提出 FishNet
- FishNet
- motivation: 為了統一利用像素級、區域級以及圖像級任務的優勢,歐陽萬里老師提出了FishNet,FishNet的優勢是:更好的將梯度傳到淺層網絡,所提取的特征包含了豐富的低層和高層語義信息並保留和微調了各層級信息。
- pros.
- better gradient flow to shallow layers
- features:
- contain rich low-level and high-level semantics
- are preserved and refined from each other (信息互相交流)
- code: https://github.com/kevin-ssy/FishNet
- Hourglass for classification (Encode-Decoder 結構,比如 UNet,一般用於圖像分割,不用於分類)
-
conclusion
- structured deep learning is (1)effective (2)from observation
- end-to-end joint training bridges the gap between structure modeling and feature learning
-
一些圖
2.5 面向監控視頻的行為識別與理解 (xiaowei lin, SJU)
- 行為識別領域的task
- 基於軌跡的行為分析
- 面向任意視頻的行為識別 (liming wang)
- 面向監控視頻的行為識別
- 目標檢測的幾個點
- Tiny DSOD (BMVC 2018)
- Toward accurate one-stage object detection with AP-Loss (CVPR 2019)
- kill two birds with one stone: boosting both object detection accuracy and speed with adaptive patch-of-interest composition (2017)
- 若干應用
- 三維目標檢測與姿態估計
- 多目標跟蹤
- 基於目標檢測、跟蹤的實時場景統計分析
- 多相機跟蹤
- (correspondence structure Re-ID)
- learning correspondence structure for person re-identification (TIP2017)
- person re-identification with correspondence sturcture learning (ICCV 2015)
- (Group Re-ID)
- Group re-identification: leveraging and integrating multi-grain information, (MM2018)
- 車載跨相機定位
- 無人超市
- 野生東北虎 Re-ID
- (correspondence structure Re-ID)
- 行為識別
- 多尺度特征
- action recognition with coarse-to-fine deep feature interation and asynchronous fusion (AAAI 2018)
- 時空異步guanlian
- cross-stream selective networks for action recognition (CVPR workshop 2019)
- 時空行為定位
- finding action tubes with an integrated sparse-to-dense framework (arxiv 2019)
- 監控行為識別
- 有一張圖
- 多尺度特征
- 其他應用
- 實時行為/事件檢測
- 基於軌跡的行為分析-個體行為分析
- a tube-and-droplet-based apporach for representing and analyzing motion trajectories (TPAMI 2017),非DL,無好感
- 基於軌跡的行為分析-軌跡聚類與挖掘
- unsupervised trajectory clustering via adaptive multi-kernel-based shrinkage (ICCV 2015),比較老。。。但可以以它為base,看最新的引用它的高質量論文即可
- 密集場景行為分析
- a diffusion and clustering-based approach for finding coherent motions and understanding crowd scenes (TIP 2016)
- finding coherent motions and semantic regions in crowd scenes: a diffusion and clustering apporach (ECCV 2014)
- 主頁:link-1