Denoising Diffusion Probabilistic Models (DDPM)

本文轉載自查看原文 2021-12-16 16:00 3029 denoising/ diffusion/ theoretical/ probabilistic/ NIPS/ 2020/ AutoEncoder/ wow

概
主要內容
代碼

Ho J., Jain A. and Abbeel P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems (NIPS), 2020.

[Page E. Approximating to the cumulative normal function and its inverse for use on a pocket calculator. Applied Statistics, vol. 26, pp. 75-76, 1977.]

Yerukala R., Boiroju N. K. Approximations to standard normal distribution function. Journal of Scientific and Engineering Research, vol. 6, pp. 515-518, 2015.

概

diffusion model和變分界的結合.
對抗魯棒性上已經有多篇論文用DDPM生成的數據用於訓練了, 可見其強大.

主要內容

Diffusion models

reverse process

從$p(x_T) = \mathcal{N}(x_T; 0, I)$出發:

\[p_{\theta}(x_{0:T}) := p(X_T) \prod_{t=1}^T p_{\theta}(x_{t-1}|x_t), \quad p_{\theta}(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_{\theta}(x_{t}, t), \Sigma_{\theta}(x_t, t)), \]

注意這個過程我們擬合均值$\mu_{\theta}$和協方差矩陣$\Sigma_{\theta}$.

這部分的過程逐步將噪聲'恢復'為圖片(信號)$x_0$.

forward process

\[q(x_{1:T}|x_0) := \prod_{t=1}^{T}q(x_t|x_{t-1}), \quad q(x_t|x_{t-1}):= \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I). \]

其中$\beta_t$是可訓練的參數或者人為給定的超參數.

這部分為將圖片(信號)逐步添加噪聲的過程.

變分界

對於參數$\theta$, 很自然地我們希望通過最小化其負對數似然來優化:

\[\begin{array}{ll} \mathbb{E}_{p_{data}(x_0)} \bigg[-\log p_{\theta}(x_0) \bigg] &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int p_{\theta}(x_{0:T}) \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \int q(x_{1:T}|x_0)\frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \mathrm{d}x_{0:T} \bigg] \\ &=\mathbb{E}_{p_{data}(x_0)} \bigg[-\log \mathbb{E}_{q(x_{1:T}|x_0)} \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &\le -\mathbb{E}_{p_{data}(x_0)}\mathbb{E}_{q(x_{1:T}|x_0)} \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p_{\theta}(x_{0:T})}{q(x_{1:T}|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=1}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_t|x_{t-1})} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log p(x_T) + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} \cdot \frac{q(x_{t-1}|x_0)}{q(x_t|x_0)} + \log \frac{p_{\theta}(x_0|x_1)}{q(x_1|x_0)} \bigg] \\ &= -\mathbb{E}_q \bigg[\log \frac{p(x_T)}{q(x_T|x_0)} + \sum_{t=2}^T \log \frac{p_{\theta}(x_{t-1}|x_t)}{q(x_{t-1}|x_t, x_0)} + \log p_{\theta}(x_0|x_1) \bigg] \\ \end{array} \]

注: $q=q(x_{1:T}|x_0)p_{data}(x_0)$, 下面另$q(x_0) := p_{data}(x_0)$.

又

\[\begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_T|x_0)}{p(x_T)}] &= \int q(x_0, x_T) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) q(x_T|x_0) \log \frac{q(x_T|x_0)}{p(x_T)} \mathrm{d}x_0 \mathrm{d}x_T \\ &= \int q(x_0) \mathrm{D_{KL}}(q(x_T|x_0) \| p(x_T)) \mathrm{d}x_0 \\ &= \int q(x_{0:T}) \mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \mathrm{d}x_{0:T} \\ &= \mathbb{E}_q \bigg[\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T)) \bigg]. \end{array} \]

又

\[\begin{array}{ll} \mathbb{E}_q [\log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)}] &=\int q(x_0, x_{t-1}, x_t) \log \frac{q(x_{t-1}|x_t, x_0)}{p_{\theta}(x_{t-1}|x_t)} \mathrm{d}x_0 \mathrm{d}x_{t-1}\mathrm{d}x_t\\ &=\int q(x_0, x_t) \mathrm{D_{KL}}(q(x_{t-1}|x_t, x_0)\| p_{\theta}(x_{t-1}|x_t)) \mathrm{d}x_0 \mathrm{d}x_t\\ &=\mathbb{E}_q\bigg[\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t)) \bigg]. \end{array} \]

故最后:

\[\mathcal{L} := \mathbb{E}_q \bigg[ \underbrace{\mathrm{D_{KL}}(q(x'_T|x_0) \| p(x'_T))}_{L_T} + \sum_{t=2}^T \underbrace{\mathrm{D_{KL}}(q(x'_{t-1}|x_t, x_0)\| p_{\theta}(x'_{t-1}|x_t))}_{L_{t-1}} \underbrace{-\log p_{\theta}(x_0|x_1)}_{L_0}. \bigg] \]

損失求解

因為無論forward, 還是 reverse process都是基於高斯分布的, 我們可以顯示求解上面的各項損失:

首先, 對於forward process中的$x_t$:

\[\begin{array}{ll} x_t &= \sqrt{1 - \beta_t} x_{t-1} + \sqrt{\beta_t} \epsilon, \: \epsilon \sim \mathcal{N}(0, I) \\ &= \sqrt{1 - \beta_t} (\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{\beta_{t-1}} \epsilon') + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - \beta_t}\sqrt{\beta_{t-1}} \epsilon' + \sqrt{\beta} \epsilon \\ &= \sqrt{1 - \beta_t}\sqrt{1 - \beta_{t-1}} x_{t-2} + \sqrt{1 - (1 - \beta_t)(1 - \beta_{t-1})} \epsilon \\ &= \cdots \\ &= (\prod_{s=1}^t \sqrt{1 - \beta_s}) x_0 + \sqrt{1 - \prod_{s=1}^t (1 - \beta_s)} \epsilon, \end{array} \]

故

\[q(x_t|x_0) = \mathcal{N}(x_t|\sqrt{\bar{\alpha}_t}x_0, (1 - \bar{\alpha}_t)I), \: \bar{\alpha}_t := \prod_{s=1}^t \alpha_s, \alpha_s := 1 - \beta_s. \]

對於后驗分布$q(x_{t-1}|x_t, x_0)$, 我們有

\[\begin{array}{ll} q(x_{t-1}|x_t, x_0) &= \frac{q(x_t|x_{t-1})q(x_{t-1}|x_0)}{q(x_t|x_0)} \\ &\propto q(x_t|x_{t-1})q(x_{t-1}|x_0) \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_{t-1}) \|x_t - \sqrt{1 - \beta_t} x_{t-1}\|^2 + \beta_t \|x_{t-1} - \sqrt{\bar{\alpha}_{t-1}}x_0\|^2 \bigg]\Bigg\} \\ &\propto \exp\Bigg\{-\frac{1}{2 (1 - \bar{\alpha}_{t-1})\beta_t} \bigg[(1 - \bar{\alpha}_t)\|x_{t-1}\|^2 - 2(1 - \bar{\alpha}_{t-1}) \sqrt{\alpha_t} x_t^Tx_{t-1} - 2 \sqrt{\bar{\alpha}_{t-1}} \beta_t \bigg]\Bigg\} \\ \end{array} \]

所以

\[q(x_{t-1}|x_t, x_0) \sim \mathcal{N}(x_{t-1}|\tilde{u}_t(x_t, x_0), \tilde{\beta}_t I), \]

其中

\[\tilde{u}_t(x_t,x_0) := \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t, \]

\[\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t. \]

$L_{t}$

$L_T$與$\theta$無關, 舍去.

作者假設$\Sigma_{\theta}(x_t, t) = \sigma_t^2 I$為非訓練的參數, 其中

\[\sigma_t^2 = \beta_t | \tilde{\beta}_t, \]

分別為$x_0 \sim \mathcal{N}(0, I)$和$x_0$為固定值時, 期望下KL散度的最優參數(作者說在實驗中二者差不多).

故

\[L_{t} = \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \tilde{u}_t(x_t, x_0)\|^2 +C, \quad t = 1,2,\cdots, T-1. \]

又

\[x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \Rightarrow x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon. \]

所以

\[\begin{array}{ll} \mathbb{E}_q [L_{t-1} - C] &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma_t^2} \| \mu_{\theta}(x_t, t) - \tilde{u}_t\big( x_t, (\frac{1}{\sqrt{\bar{\alpha}_t}}x_t - \frac{\sqrt{1 - \bar{\alpha}_t} }{\sqrt{\bar{\alpha}_t}} \epsilon) \big)\|^2 \bigg\} \\ &= \mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{1}{2 \sigma^2_t} \| \mu_{\theta}(x_t, t) - \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon \big) \bigg\} \\ \end{array} \]

注: 上式子中$x_t$由$x_0, \epsilon$決定, 實際上$x_t = x_t(x_0, \epsilon)$, 故期望實際上是對$x_t$求期望.

既然如此, 我們不妨直接參數化$\mu_{\theta}$為

\[\mu_{\theta}(x_t, t):= \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big), \]

即直接建模殘差$\epsilon$.

此時損失可簡化為:

\[\mathbb{E}_{x_0, \epsilon} \bigg\{ \frac{\beta_t^2}{2\sigma_t^2 \alpha_t (1 - \bar{\alpha}_t)} \|\epsilon_{\theta}(\sqrt{\bar{\alpha}_t}x_0 + \sqrt{1 - \bar{\alpha}_t}\epsilon, t) - \epsilon\|^2 \bigg\} \]

這個實際上時denoising score matching.

類似地, 從$p_{\theta}(x_{t-1}|x_t)$采樣則為:

\[x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \big( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_{\theta}(x_t, t) \big) + \sigma_t z, \: z \sim \mathcal{N}(0, I), \]

這是Langevin dynamic的形式(步長和權重有點變化)

注: 這部分見here.

$L_0$

最后我們要處理$L_0$, 這里作者假設$x_0|x_1$滿足一個離散分布, 首先圖片取值於$\{0, 1, 2, \cdots, 255\}$, 並標准化至$[-1, 1]$. 假設

\[p_{\theta}(x_0|x_1) = \prod_{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_+(x_0^i) } \mathcal{N}(x; \mu_{\theta}^i(x_1, 1), \sigma_1^2) \mathrm{d}x, \\ \delta_+(x) = \left \{ \begin{array}{ll} +\infty & \text{if } x = 1, \\ x + \frac{1}{255} & \text{if } x < 1. \end{array} \right . \delta_- (x) \left \{ \begin{array}{ll} -\infty & \text{if } x = -1, \\ x - \frac{1}{255} & \text{if } x > -1. \end{array} \right . \]

實際上就是將普通的正態分布划分為:

\[(-\infty, -1 + 1/255], (-1 + 1 / 255, -1 + 3/255], \cdots, (1 - 3/255, 1 - 1/255], (1 - 1 / 255, +\infty) \]

各取值落在其中之一.
在實際代碼編寫中, 會遇到高斯函數密度函數估計的問題(直接求是做不到的), 作者選擇用下列的估計方式:

\[\Phi(x) \approx \frac{1}{2} \Bigg\{1 + \tanh \bigg(\sqrt{2/\pi} (1 + 0.044715 x^2) \bigg) \Bigg\}. \]

這樣梯度也就能夠回傳了.

注: 該估計屬於Page.

最后的算法

注: $t=1$對應$L_0$, $t=2,\cdots, T$對應$L_{1}, \cdots, L_{T-1}$.
注: 對於$L_t$作者省略了開始的系數, 這反而是一種加權.
作者在實際中是采樣損失用以訓練的.

細節

注意到, 作者的$\epsilon_{\theta}(\cdot, t)$是有顯示強調$t$, 作者在實驗中是通過attention中的位置編碼實現的, 假設位置編碼為$P$:

$ t = \text{Linear}(\text{ACT}(\text{Linear}(t * P)))$, 即通過兩層的MLP來轉換得到time_steps;
作者用的是U-Net結構, 在每個residual 模塊中:

\[x += \text{Linear}(\text{ACT}(t)). \]

參數	值
$T$	1000
$\beta_t$	$[0.0001, 0.02]$, 線性增長$1,2,\cdots, T$.
backbone	U-Net

注: 作者在實現中還用到了EMA等技巧.

代碼

原文代碼

lucidrains-denoising-diffusion-pytorch

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 A Neural Probabilistic Language Model 擴散（diffusion）和彌散（dispersion）有什么區別圖擴散-Diffusion Improves Graph Learning OpenCV中Denoising相關函數的簡單介紹 Django之Models的class Meta Noise2Void - Learning Denoising from Single Noisy Images Paper | FFDNet: Toward a Fast and Flexible Solution for CNN based Image Denoising Paper | Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising Sequence Models - Recurrent Neural Networks 如何有效使用Pretrained Models