Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.
[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]
Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.
概
diffusion model和變分界的結合.
對抗魯棒性上已經有多篇論文用DDPM生成的數據用於訓練了, 可見其強大.
主要內容
Diffusion models
reverse process
從\(p(x_T) = \mathcal{N}(x_T; 0, I)\)出發:
\[p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)), \]
注意這個過程我們擬合均值\(\mu_{\theta}\)和協方差矩陣\(\Sigma_{\theta}\).
這部分的過程逐步將噪聲'恢復'為圖片(信號)\(x_0\).
forward process
\[q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I). \]
其中\(\beta_t\)是可訓練的參數或者人為給定的超參數.
這部分為將圖片(信號)逐步添加噪聲的過程.
變分界
對於參數\(\theta\), 很自然地我們希望通過最小化其負對數似然來優化:
\[\begin{array}{ll} \mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg] &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\ \end{array} \]
注: \(q=q(x_{1:T}|x_0)p_{data}(x_0)\), 下面另\(q(x_0) := p_{data}(x_0)\).
又
\[\begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}] &= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\ &= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\ &= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg]. \end{array} \]
又
\[\begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}] &=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\ &=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\ &=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg]. \end{array} \]
故最后:
\[\mathcal{L} := \mathbb{E}_q \bigg[ \underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} + \sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}} \underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}. \bigg] \]
損失求解
因為無論forward, 還是 reverse process都是基於高斯分布的, 我們可以顯示求解上面的各項損失:
首先, 對於forward process中的\(x_t\):
\[\begin{array}{ll} x_t &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\ &= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\ &= \cdots \\ &= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon, \end{array} \]
故
\[q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s. \]
對於后驗分布\(q(x_{t-1}|x_t, x_0)\), 我們有
\[\begin{array}{ll} q(x_{t-1}|x_t, x_0) &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\ \end{array} \]
所以
\[q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I), \]
其中
\[\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t, \]
\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t. \]
\(L_{t}\)
\(L_T\)與\(\theta\)無關, 舍去.
作者假設\(\Sigma_{\theta}(x_t, t) = \sigma_t^2 I\)為非訓練的參數, 其中
\[\sigma_t^2 = \beta_t | \tilde{\beta}_t, \]
分別為\(x_0 \sim \mathcal{N}(0, I)\)和\(x_0\)為固定值時, 期望下KL散度的最優參數(作者說在實驗中二者差不多).
故
\[L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1. \]
又
\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon. \]
所以
\[\begin{array}{ll} \mathbb{E}_q [L_{t-1} - C] &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\ &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\ \end{array} \]
注: 上式子中\(x_t\)由\(x_0, \epsilon\)決定, 實際上\(x_t = x_t(x_0, \epsilon)\), 故期望實際上是對\(x_t\)求期望.
既然如此, 我們不妨直接參數化\(\mu_{\theta}\)為
\[\mu_{\theta}(x_t, t):= \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big), \]
即直接建模殘差\(\epsilon\).
此時損失可簡化為:
\[\mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2 \bigg\} \]
這個實際上時denoising score matching.
類似地, 從\(p_{\theta}(x_{t-1}|x_t)\)采樣則為:
\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I), \]
這是Langevin dynamic的形式(步長和權重有點變化)
注: 這部分見here.
\(L_0\)
最后我們要處理\(L_0\), 這里作者假設\(x_0|x_1\)滿足一個離散分布, 首先圖片取值於\(\{0, 1, 2, \cdots, 255\}\), 並標准化至\([-1, 1]\). 假設
\[p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\ \delta_+(x) = \left \{ \begin{array}{ll} +\infty & \text{if } x = 1, \\ x + \frac{1}{255} & \text{if } x < 1. \end{array} \right . \delta_- (x) \left \{ \begin{array}{ll} -\infty & \text{if } x = -1, \\ x - \frac{1}{255} & \text{if } x > -1. \end{array} \right . \]
實際上就是將普通的正態分布划分為:
\[(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty) \]
各取值落在其中之一.
在實際代碼編寫中, 會遇到高斯函數密度函數估計的問題(直接求是做不到的), 作者選擇用下列的估計方式:
\[\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}. \]
這樣梯度也就能夠回傳了.
注: 該估計屬於Page.
最后的算法

注: \(t=1\)對應\(L_0\), \(t=2,\cdots, T\)對應\(L_{1}, \cdots, L_{T-1}\).
注: 對於\(L_t\)作者省略了開始的系數, 這反而是一種加權.
作者在實際中是采樣損失用以訓練的.
細節
注意到, 作者的\(\epsilon_{\theta}(\cdot, t)\)是有顯示強調\(t\), 作者在實驗中是通過attention中的位置編碼實現的, 假設位置編碼為\(P\):
- $ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通過兩層的MLP來轉換得到time_steps;
- 作者用的是U-Net結構, 在每個residual 模塊中:
\[x += \text{Linear}(\text{ACT}(t)). \]
參數 |
值 |
\(T\) |
1000 |
\(\beta_t\) |
\([0.0001, 0.02]\), 線性增長\(1,2,\cdots, T\). |
backbone |
U-Net |
注: 作者在實現中還用到了EMA等技巧.
代碼
原文代碼
lucidrains-denoising-diffusion-pytorch