Feature Distillation With Guided Adversarial Contrastive Learning

本文轉載自查看原文 2020-10-05 23:29 464 CNN/ defense/ contrastive/ adversarial

概
主要內容

Bai T., Chen J., Zhao J., Wen B., Jiang X., Kot A. Feature Distillation With Guided Adversarial Contrastive Learning. arXiv preprint arXiv 2009.09922, 2020.

概

本文是通過固定教師網絡(具有魯棒性), 讓學生網絡去學習教師網絡的魯棒特征. 相較於一般的distillation 方法, 本文新加了reweight機制, 另外其損失函數非一般的交叉熵, 而是最近流行的對比損失.

主要內容

在這里插入圖片描述

本文的思想是利用robust的教師網絡\(f^t\)來輔助訓練學生網絡\(f^s\), 假設有輸入\((x, y)\), 通過網絡得到特征

\[t^+:= f^t(x), s^+:=f^s(x), \]

則\((t^+, s^+)\)構成正樣本對, 自然我們需要學生網絡提取的特征\(s^+\)能夠逼近\(t^+\), 進一步, 構建負樣本對, 采樣樣本\(\{x_1^-, x_2^-, \ldots, x_k^- \}\), 同時得到負樣本對\((t^+,s_i^-)\), 其中\(s_i^-=f^s(x_i^-)\). 總的樣本對就是

\[\mathcal{S}_{pair} := \{(t^+, s^+), (t^+, s_1^-), \ldots, (t^+, s_k^-)\}. \]

根據負樣本采樣的損失, 最大化

\[J(\theta):= \mathbb{E}_{(t,s)\sim p(t,s)} \log P(1|t,s;\theta) + \mathbb{E}_{(t,s)\sim q(t,s)} \log P(0|t,s;\theta). \]

當然對於本文的問題需要特殊化, 既然先驗\(P(C=1)=\frac{1}{k+1}, P(C=0)=\frac{k}{k+1}\), 故

\[J(\theta):= \mathbb{E}_{(t,s)\sim p(t,s)} \log P(1|t,s;\theta) + k\cdot \mathbb{E}_{(t,s)\sim q(t,s)} \log P(0|t,s;\theta). \]

\(q(t,s)\)是一個區別於\(p(t,s)\)的分布, 本文采用了\(p(t)q(s)\).

作者進一步對前一項加了解釋

\[\begin{array}{ll} P(1|t,s;\theta) &= \frac{P(t,s)P(C=1)}{P(t,s)P(C=1) + P(t)P(s)P(C=0)} \\ &\le \frac{P(t,s)}{k\cdot P(t)P(s)}, \\ \end{array} \]

故

\[\mathbb{E}_{(t,s)\sim p(t,s)} \log P(1|t,s;\theta) + \log k\le I(t,s). \]

又\(J(\theta)\)的第二項是負的, 故

\[J(\theta) \le I(t,s), \]

所以最大化\(J(\theta)\)能夠一定程度上最大化\(t,s\)的互信息.

reweight

教師網絡一般要求精度(干凈數據集上的准確率)比較高, 但是通過對抗訓練所生成的教師網絡往往並不具有這一特點, 所以作者采取的做法是, 對特征\(t\)根據其置信度來加權\(w\), 最后損失為

\[\mathcal{L}(\theta) := \mathbb{E}_{(t,s)\sim p(t,s)} w_t \log P(1|t,s;\theta) + k\cdot \mathbb{E}_{(t,s)\sim p(t)p(s)} w_t \log P(0|t,s;\theta), \]

其中

\[w_t \leftarrow p_{ypred=y}(f^t,t^+) \in [0, 1]. \]

即\(w_t\)為教師網絡判斷\(t^+\)類別為\(y\)(真實類別)的概率.

擬合概率\(P(1|t,s;\theta)\)

在負采樣中, 這類概率是直接用邏輯斯蒂回歸做的, 本文采用

\[P(1|t,s;\theta) = h(t,s) = \frac{e^{t^Ts/\tau}}{e^{t^Ts/\tau}+\frac{k}{M}}, \]

其中\(M\)為數據集的樣本個數.
會不會

\[\frac{e^{t^Ts/\tau}}{e^{t^Ts/\tau}+\gamma \cdot \frac{k}{M^2}}, \]

把\(\gamma\)也作為一個參數訓練符合NCE呢?

實驗的細節

文中有如此一段話

we sample negatives from different classes rather than different instances, when picking up a positive sample from the same class.

也就是說在實際實驗中, \(t^+,s^+\)對應的類別是同一類的, \(t^+, s^-\)對應的類別不是同一類的.

In our view, adversarial examples are like hard examples supporting the decision boundaries. Without hard examples, the distilled models would certainly make mistakes. Thus, we adopt a self-supervised way to generate adversarial examples using Projected Gradient Descent (PGD).

也就是說, \(t, s\)都是對抗樣本?

超參數: \(k=16384\), \(\tau=0.1\).

疑問

算法中的采樣都是針對單個樣本的, 但是我想實際訓練的時候應該還是batch的, 不然太慢了, 但是如果是batch的話, 怎么采樣呢?

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Improving Contrastive Learning by Visualizing Feature Transformation【ICCV21Oral】【閱讀筆記】 Representation Learning with Contrastive Predictive Coding 從對比學習(Contrastive Learning)到對比聚類(Contrastive Clustering) The Limitations of Deep Learning in Adversarial Settings A Simple Framework for Contrastive Learning of Visual Representations Intent Contrastive Learning for Sequential Recommendation閱讀筆記 Context Encoders: Feature Learning by Inpainting Learning Feature Pyramids for Human Pose Estimation（理解） Learning Spread-out Local Feature Descriptors [論文閱讀筆記] Adversarial Learning on Heterogeneous Information Networks