DAGs with NO TEARS: Continuous Optimization for Structure Learning

本文轉載自查看原文 2021-05-27 20:32 1614 Causal Inference/ 2018/ theoretical/ NIPS/ seminal/ graph/ wow

DAGs with NO TEARS: Continuous Optimization for Structure Learning

DAGs with NO TEARS: Continuous Optimization for Structure Learning

Zheng X., Aragam B., Ravikumar P. and Xing E. DAGs with NO TEARS: Continuous Optimization for Structure Learning. In Advances in Neural Information Processing Systems (NIPS), 2018.

概

有向圖可以用鄰接矩陣\(A \in \{0, 1\}^{d \times d}\)來表示, 其中\(A_{ij} = 1\) 表示 node \(i\) 指向 node \(j\). 進一步的, 我們想要表示有向無環圖(DAG), 則\(A\)需要滿足額外的性質, 保證無環.

現在的問題是, 有一堆觀測數據\(X \in \mathbb{R}^{n \times d}\), 如何通過這些數據推測其(特征之間的)關系, 即對應的\(A\).

主要內容

首先, 假設特征之間滿足一個線性關系:

\[X_j = w_j^T X + z_j, \]

其中

\[W = [w_1|w_2|\cdots|w_d] \in \mathbb{R}^{d}, \]

\(z\)為隨機的噪聲.

通過\(W\)可以推出相應的\(A=\mathcal{A}(W)\), 即

\[W_{ij} \not = 0 \Leftrightarrow A_{ij} = 1, W_{ij} =0 \Leftrightarrow A_{ij} = 0. \]

故我們目標通常是:

\[\min_{W} \quad \ell(W;X) = \frac{1}{2n}\|X - XW\|_F^2, \\ \mathrm{s.t.} \quad \mathcal{A}(W) \in \mathbb{D}, \]

其中\(\mathbb{D}\)表示有向無環圖.

進一步地, 因為我們希望\(W\)是一個系數的矩陣(否則斷然不是DAG), 故

\[F(W;X) = \ell(W;X) + \lambda \|W\|_1, \]

並

\[\min_W \quad F(W;X) \\ \mathrm{s.t.} \quad \mathcal{A}(W) \in \mathbb{D}. \]

顯然現在的關鍵是如何處理\(\mathcal{A}(W) \in \mathbb{D}\)這個條件, 以前的方法通常需要復雜的運算, 本文提出一種等價的條件

\[h(W) = 0, \]

滿足

\(h(W)= 0\)當且僅當\(\mathcal{A}(W) \in \mathbb{D}\);
\(h(W)\)越小, 說明\(\mathcal{A}(W)\)越接近無環圖;
\(h(W)\)是一個光滑函數;
\(h(W)\)便於求導.

顯然1是期望的, 2可以用於判斷所得的\(W\)的優劣, 3, 4便於我們用數值方法求解.

等價條件的推導

\(\mathrm{tr}(I-W)^{-1} = d\)

Proposition 1: 假設\(W \in \mathbb{R}_+^{d \times d}\) 且 \(\|W\| < 1\), 則\(\mathcal{A}(W)\)能夠表示有向無環圖當且僅當

\[\mathrm{tr}(I - W)^{-1} = d. \]

proof:

\(A = \mathcal{A}(W)\)能夠表示有向無環圖, 當且僅當

\[\mathrm{tr}(A^k) = 0 \Leftrightarrow \mathrm{tr} (W^k) = 0, \forall\: k=1,\cdots \]

\(\Rightarrow\)

由於\(\|W\| < 1\)(最大奇異值小於1), 故

\[\mathrm{tr}(I-W)^{-1} = \mathrm{tr}(\sum_{k=0} W^k) = \mathrm{tr}(I) = d. \]

\(\Leftarrow\)

\(\mathrm{tr}(W^k) \ge 0\), 故

\[\mathrm{tr}(I-W)^{-1} = d \]

當且僅當

\[\mathrm{tr}(W^k) = 0. \]

注: \(\|W\| < 1\)這個條件並不容易滿足.

\(\mathrm{tr}(e^W)=d\)

注: \(e^A = I + \sum_{k=1} \frac{A^k}{k!}\).

Proposition 2: 假設\(W \in \mathbb{R}_+^{d \times d}\), 則\(\mathcal{A}(W)\)能夠表示有向無環圖當且僅當

\[\mathrm{tr}(e^W) = d. \]

proof:

證明是類似的.

注: 此時對\(W\)的最大奇異值沒有要求.

\(\mathrm{tr}(W^k) = 0\)

這部分的證明可能應該歸屬於DAG-GNN.

Proposition 3: 假設\(W \in \mathbb{R}_+^{d \times d}\) , 則\(\mathcal{A}(W)\)能夠表示有向無環圖當且僅當

\[\mathrm{tr}(W^k) = 0, \: k=1,2,\cdots, d. \]

proof:

\(\Rightarrow\)是顯然的, 證明\(\Rightarrow\)只需說明

\[\mathrm{tr}(W^k)=0, \: k=1,2,\cdots, d \Rightarrow \mathrm{tr}(W^k), \: k\ge 1. \]

假設\(W\)的特征多項式為\(p(\lambda) = \sum_{k=0}^d \beta_k \lambda^k, \beta_d=1\), 則有

\[p(W) = \sum_{k=0}^d \beta_k W^k = 0. \]

進一步有

\[W^{d} = -\sum_{k=0}^{d-1} \beta_k W^k \Rightarrow W^{d+1} = -\sum_{k=1}^d \beta_k W^{k+1} \Rightarrow \mathrm{tr}(W^{d+1}) = -\sum_{k=1}^d \beta_k \mathrm{tr}(W^{k+1}) = 0. \]

由歸納假設可知結論成立.

Corollary 1: 假設\(W \in \mathbb{R}_+^{d \times d}\) , 則\(\mathcal{A}(W)\)能夠表示有向無環圖當且僅當

\[\mathrm{tr}(I+W)^d=d. \]

\(\mathrm{tr}(e^{W \circ W}) =d\)

注: \(\circ\) 表示哈達瑪積, 即對應元素相乘.

上面依然要求\(W\)各元素大於0, 一個好的辦法是:

Theorem 1: 一個矩陣\(W \in \mathbb{R}^{d \times d}\), 則\(\mathcal{A}(W)\) 能表示有向無環圖當且僅當

\[\mathrm{tr}(e^{W \circ W}) =d. \]

proof:

\(\mathcal{A}(W)=\mathcal{A}(W \circ W)\).

\(\mathrm{tr}(I + W \circ W)^d =d\)

Theorem 2: 一個矩陣\(W \in \mathbb{R}^{d \times d}\), 則\(\mathcal{A}(W)\) 能表示有向無環圖當且僅當

\[\mathrm{tr}(I + W \circ W)^d =d. \]

注: \(W \circ W\)前面加個系數也是沒關系的.

性質的推導

故, 此時我們只需設置

\[h(W) = \mathrm{tr}(e^{W\circ W}) - d \]

顯然滿足1,2,3, 接下來我們推導其梯度

\[\begin{array}{ll} \mathrm{d}h(W) &= \mathrm{d}\: \mathrm{tr} (e^{W\circ W}) \\ &= \mathrm{tr} (\mathrm{d}e^{W\circ W}) \\ &= \mathrm{tr} (\mathrm{d}\sum_{k=1} \frac{M^k}{k!}) \\ &=\sum_{k=1} \mathrm{tr} ( \frac{\mathrm{d}M^k}{k!}) \\ &=\sum_{k=0} \mathrm{tr} ( \frac{M^k \mathrm{d}M}{k!}) \\ &= \mathrm{tr}(e^{W\circ W} \cdot \mathrm{d}(W\circ W)) \\ &= \mathrm{tr}(e^{W\circ W} \cdot (2W \circ \mathrm{d} W)) \\ &= \mathrm{tr}(e^{W\circ W} \circ 2W^T \cdot \mathrm{d} W) \\ \end{array} \]

故

\[\nabla h(W) = (e^{W\circ W})^T \circ W. \]

注: 其中\(M =W \circ W\).

求解

利用augmented Lagrangian轉換為(這一塊不是很懂, 但只是數值求解的東西, 不影響理解)

\[\min_W \max_{\alpha}\quad \ell (W;X) +\lambda \|W\|_1 + \frac{\rho}{2}|h(W)|^2 + \alpha h(W), \]

具體求解算法如下:

代碼

原文代碼

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 深度學習優化理論綜述——Optimization for deep learning: theory and algorithms 【筆記】論文閱讀 | Optimization as a Model for Few-Shot Learning Federated Optimization: Distributed Machine Learning for On-Device Intelligence 論文筆記(1)—"Clustered federated learning: Model-Agnostic distributed multi-Task optimization under privacy constraints" 課程二(Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization)，第一周（Practical aspects of Deep Learning） —— 4.Programming assignments:Gradient Checking Continuous Control with Deep Reinforcement Google Optimization Tools介紹 DSO windowed optimization 公式 DSO windowed optimization 代碼 (1) DSO windowed optimization 代碼 (2)