Adversarial Examples Are Not Bugs, They Are Features

本文轉載自查看原文 2020-05-15 16:07 491 CNN/ robust/ adversarial

概
主要內容

Ilyas A, Santurkar S, Tsipras D, et al. Adversarial Examples Are Not Bugs, They Are Features[C]. neural information processing systems, 2019: 125-136.

@article{ilyas2019adversarial,
title={Adversarial Examples Are Not Bugs, They Are Features},
author={Ilyas, Andrew and Santurkar, Shibani and Tsipras, Dimitris and Engstrom, Logan and Tran, Brandon and Madry, Aleksander},
pages={125--136},
year={2019}}

概

作者認為, 標准訓練方法, 由於既能學到穩定的特征和不穩定的特征, 而導致模型不穩定. 作者通過將數據集分解成穩定和非穩定數據來驗證其猜想, 並利用高斯分布作為一特例舉例.

主要內容

本文從二分類模型入手.

符號說明及部分定義

\((x,y) \in \mathcal{X} \times \{\pm 1\}\): 樣本和標簽;
\(C:\mathcal{X} \rightarrow \{\pm 1\}\): 分類器;
\(f:\mathcal{X} \rightarrow \mathbb{R}\) : 特征;
\(\mathcal{F}=\{f\}\): 特征集合;

注: 假設\(\mathbb{E}_{(x,y) \sim \mathcal{D}}[f(x)]=0\), \(\mathbb{E}_{(x,y) \sim \mathcal{D}}[f(x)^2]=1\).
注: 在深度學習中, \(C\)可以理解為

\[C(x) = \mathrm{sgn} \big( b+ \sum_{f \in F_C} w_f \cdot f(x) \big ). \]

\(\rho\)可用特征

滿足

\[\tag{1} \mathbb{E}_{(x,y) \sim \mathcal{D}}[y \cdot f(x)] \ge \rho >0, \]

並記\(\rho_{\mathcal{D}}(f)\)為最大的\(\rho\).

\(\gamma\)穩定可用特征

若\(f\) \(\rho\)可用, 且對於給定的攝動集合\(\Delta\)

\[\tag{2} \mathbb{E}_{(x, y) \sim \mathcal{D}} [\inf_{\delta \in \Delta(x)} y \cdot f(x+ \delta)] \ge \gamma > 0, \]

則\(f\) 為\(\gamma\)穩定可用特征.

可用不穩定特征

即對於\(f\), \(\rho_{\mathcal{D}}(f) >0\), 但是不存在\(\gamma >0\)使得(2)式滿足.

標准(standard)訓練

即最小化期望損失(在實際中為經驗風險):

\[\tag{3} \mathbb{E}_{(x,y) \sim \mathcal{D}} [\mathcal{L}_{\theta} (x, y)], \]

\(\mathcal{L}_{\theta}\)的取法多樣, 比如

\[\mathcal{L}_{\theta}(x, y) = - [y \cdot \big( b+ \sum_{f \in F_C} w_f \cdot f(x) \big )]. \]

穩定(robust)訓練

\[\tag{4} \mathbb{E}_{(x, y) \sim \mathcal{D}} [\max_{\delta \in \Delta(x)} \mathcal{L}_{\theta} (x+\delta, y)]. \]

分離出穩定數據

何為穩定數據? 即在此數據上, 利用標准的訓練方式訓練得到的模型能夠在一定程度上免疫攻擊. 如果能從普通的數據中分離出穩定數據和不穩定數據, 說明上面定義的穩定和非穩特征的存在性.

首先假設\(C\)是一個穩定模型(可通過PGD訓練近似生成), 則\(\hat{D}_{R}\)應當滿足

\[\tag{5} \mathbb{E}_{(x, y) \sim \hat{D}_{R}}[f(x) \cdot y] = \left \{ \begin{array}{ll} \mathbb{E}_{(x, y) \sim D}[f(x) \cdot y] & if \: f \in F_C, \\ 0 & otherwise. \end{array} \right. \]

為了滿足第一條, 需要

\[\tag{6} \min_{x_r} \quad \|g(x_r) - g(x)\|_2, \]

其中\(g\)為將\(x\)映射到表示層(representation layer)的映射?

為了滿足第二條, 在選擇\(x_r\)的初始值的時候, 從\(\mathcal{D}\)中隨機采樣\(x'\), 以保證\(x'\)和\(y\)沒有關系, 則\(\mathbb{E}_{(x, y) \sim D}[f(x') \cdot y] = \mathbb{E}_{(x, y) \sim D}[f(x')] \cdot \mathbb{E}_{(x, y) \sim D}[y] = 0\).

在這里插入圖片描述

分離出不穩定數據

分離出不穩定數據所需要的是標准的模型\(C\), 且

\[\tag{7} x_{adv} = \arg \min_{\|x'-x\| \le \epsilon} L_C(x', t), \]

其中\(L_C\)是認為給定的損失函數(比如:交叉熵), 而\(t\)是通過某種方式給定的標簽, 且\(C(x) = y\), \(C(x')=t\).
既然攝動很小, 且\(x_{adv}\)的標簽為\(t\), 所以此時\(F_C\)中既有穩定特征, 又有不穩定特征.

\(t\)隨機選取

此時穩定性特征和\(t\)不相關, 故其可用度應當為0, 而不穩定特征可用度大於0, 故

\[\tag{8} \mathbb{E}_{(x, y) \sim \hat{D}_{rand}}[f(x) \cdot y] \left \{ \begin{array}{ll} .> 0 & if \: f \: non-robustly \: useful, \\ \approx 0 & otherwise. \end{array} \right. \]

\(t\)選取依賴於\(y\)

\[\tag{9} \mathbb{E}_{(x, y) \sim \hat{D}_{det}}[f(x) \cdot y] = \left \{ \begin{array}{ll} .> 0 & if \: f \: non-robustly \: useful \\ < 0 & if \: f\: robustly \: useful \\ \in \mathbb{R} & otherwise. \end{array} \right. \]

在這里插入圖片描述