論文信息

論文標題：Structural Deep Network Embedding
論文作者：Aditya Grover;Aditya Grover; Jure Leskovec
論文來源：2016, KDD
論文地址：download
論文代碼：download

1 Introduction

　　網絡表示學習，可以用於學習網絡節點的向量表示，從而表示網絡結構等信息。目前幾乎所有的網絡表示學習方法都是基於淺層模型，但由於網絡結構本身比較復雜，淺層模型往往收斂於局部最優解，無法表示更高級的非線性網絡結構。

　　本文作者基於此，提出SDNE模型（Structure Deep Network Embedding)，模型可以有效提取網絡局部和全局結構信息。

　　SDNE 屬於一個半監督模型。

- The second-order proximity is used by the unsupervised component to capture the global network structure.
- The first-order proximity is used as the supervised information in the supervised component to preserve the local network structure.

　　網絡表示學習遇到的挑戰：

High non-linearity: the underlying structure of the network is highly non-linear.
Structure-preserving: The underlying structure of the network is very complex. The similarity of vertexes is dependent on both the local and global network structure. Therefore, how to simultaneously preserve the local and global structure is a tough problem. 　　
Sparsity: Many real-world networks are often so sparse that only utilizing the very limited observed links is not enough to reach a satisfactory performance .

　　在 Figure 1，證明了具有 second-order proximity 的頂點對的數量比具有 first-order proximity 的頂點對的數量大得多。因此，二階鄰近度的引入能夠在表征網絡結構方面提供更多信息。

2 Structural deep network embedding

　　Definition 2. (First-Order Proximity) The first-order proximity describes the pairwise proximity between vertexes. For any pair of vertexes, if $s_{i,j} > $0, there exists positive first-order proximity between $v_i$ and $v_j$ . Otherwise, the first-order proximity between $v_i$ and $v_j$ is 0.

　　Definition 3. (Second-Order Proximity) The second-order proximity between a pair of vertexes describes the proximity of the pair's neighborhood structure. Let $\mathcal{N}_{u}=\left\{s_{u, 1}, \ldots, s_{u,|V|}\right\}$ denote the first-order proximity between $v_{u}$ and other vertexes. Then, secondorder proximity is determined by the similarity of $\mathcal{N}_{u}$ and $\mathcal{N}_{v}$ .

　　回憶：

　　上述解釋可以參考：論文解讀（LINE）《LINE: Large-scale Information Network Embedding》

2.1 Framework

　　一個 semi-supervised 的深度模型，其框架如圖 Figure 2 所示.

2.2 Loss Functions

　　本文定義如 Table 1 所示：

　　給定輸入 $ x_i$，自編碼器每層的節點表示為：

　　　　$\begin{array}{l}\mathbf{y}_{i}^{(1)}=\sigma\left(W^{(1)} \mathbf{x}_{i}+\mathbf{b}^{(1)}\right) \\\mathbf{y}_{i}^{(k)}=\sigma\left(W^{(k)} \mathbf{y}_{i}^{(k-1)}+\mathbf{b}^{(k)}\right), k=2, \ldots, K\end{array}\quad \quad \quad \quad (1)$

　　獲得 $\mathbf{y}_{i}^{(K)}$ 后 , 計算重構誤差：

　　　　$\mathcal{L}=\sum \limits _{i=1}^{n}\left\|\hat{\mathbf{x}}_{i}-\mathbf{x}_{i}\right\|_{2}^{2}$

　　本文使用鄰接矩陣 $S$ 作為自編碼器的輸入，即 $x_i = s_i$ ，由於每個實例 $s_i$ 表征了頂點 $v_i$ 的鄰域結構，重建過程將使具有相似鄰域結構的頂點具有相似的鄰域結構潛在表征。

　　通過分析，不能直接使用 $S$ 矩陣，因為網絡的稀疏性，$S$ 中非零元素的數量遠遠少於零元素的數量。如果直接使用 $S$ 作為自編碼器的輸入，則更容易重構 $S$ 中的零元素，即出現很多 $0$ 元素。

　　為解決上述問題，我們對非零元素的重構誤差施加了比零元素更大的懲罰，修正后的目標函數如下所示：

　　　　$\begin{aligned}\mathcal{L}_{2 n d} &=\sum \limits _{i=1}^{n}\left\|\left(\hat{\mathbf{x}}_{i}-\mathbf{x}_{i}\right) \odot \mathbf{b}_{\mathbf{i}}\right\|_{2}^{2} \\&=\|(\hat{X}-X) \odot B\|_{F}^{2}\end{aligned}\quad \quad \quad \quad (3)$

　　其中

- $\odot$ 表示 Hadamard 積；
- $\mathbf{b}_{\mathbf{i}}=\left\{b_{i, j}\right\}_{j=1}^{n}$ ，如果 $s_{i, j}= 0, b_{i, j}=1$ ，否則 $b_{i, j}=\beta>1$

　　現在，通過使用以鄰接矩陣 $S$ 作為輸入的修改后的深度自動編碼器，具有相似鄰域結構的頂點將被映射到表示替換中的附近。 SDNE 的 Unsupervised Component 可以通過重建頂點之間的二階接近度來保留全局網絡結構。

　　為捕捉局部結構，使用一階鄰近度來表示局部網絡結構。設計 supervised component 以利用一階接近度。損失函數定義如下：

　　　　$\begin{aligned}\mathcal{L}_{1 s t} &=\sum \limits _{i, j=1}^{n} s_{i, j}\left\|\mathbf{y}_{i}^{(K)}-\mathbf{y}_{j}^{(K)}\right\|_{2}^{2} \\&=\sum \limits _{i, j=1}^{n} s_{i, j}\left\|\mathbf{y}_{i}-\mathbf{y}_{j}\right\|_{2}^{2}\end{aligned} \quad \quad \quad \quad (4)$

　　為同時保持 first-order 和 second-order ，提出了一個 semi-supervised 模型，結合了 Eq. 4 和 Eq. 3 聯合最小化以下目標函數：

　　　　$\begin{aligned}\mathcal{L}_{\text {mix }} &=\mathcal{L}_{2 n d}+\alpha \mathcal{L}_{1 s t}+\nu \mathcal{L}_{\text {reg }} \\&=\|(\hat{X}-X) \odot B\|_{F}^{2}+\alpha \sum \limits _{i, j=1}^{n} s_{i, j}\left\|\mathbf{y}_{i}-\mathbf{y}_{j}\right\|_{2}^{2}+\nu \mathcal{L}_{\text {reg }}\end{aligned}\quad \quad \quad \quad (5)$

　　其中 $\mathcal{L}_{\text {reg }}$ 是一個 $\mathcal{L}$ 2-norm 正則化項，用於防止過擬合，其定義如下：

　　　　$\mathcal{L}_{r e g}=\frac{1}{2} \sum \limits_{k=1}^{K}\left(\left\|W^{(k)}\right\|_{F}^{2}+\left\|\hat{W}^{(k)}\right\|_{F}^{2}\right)$

2.3 Optimization

　　為優化上述模型，目標是最小化關於 $\theta$ 的 $\mathcal{L}_{\operatorname{mix}}$ 函數。詳細地說，關鍵步驟是計算偏導數（partial derivative）$\partial \mathcal{L}_{m i x} / \partial \hat{W}^{(k)}$ 和 $ \partial \mathcal{L}_{\operatorname{mix}} / \partial W^{(k)}$ ：

　　　　　　　　$\begin{array}{l}{\large \frac{\partial \mathcal{L}_{m i x}}{\partial \hat{W}^{(k)}}=\frac{\partial \mathcal{L}_{2 n d}}{\partial \hat{W}^{(k)}}+\nu \frac{\partial \mathcal{L}_{r e g}}{\partial \hat{W}^{(k)}} \\\frac{\partial \mathcal{L}_{m i x}}{\partial W^{(k)}}=\frac{\partial \mathcal{L}_{2 n d}}{\partial W^{(k)}}+\alpha \frac{\partial \mathcal{L}_{1 s t}}{\partial W^{(k)}}+\nu \frac{\partial \mathcal{L}_{r e g}}{\partial W^{(k)}}, k=1, \ldots, K} \end{array}\quad \quad \quad \quad (6)$

　　首先來看 $\partial \mathcal{L}_{2 n d} / \partial \hat{W}^{(K)}$ ：

　　　　　　　　$ {\large \frac{\partial \mathcal{L}_{2 n d}}{\partial \hat{W}^{(K)}}=\frac{\partial \mathcal{L}_{2 n d}}{\partial \hat{X}} \cdot \frac{\partial \hat{X}}{\partial \hat{W}^{(K)}}} \quad \quad \quad \quad (7)$

　　對於第一項，根據 Eq. 3，有：

　　　　　　　　${\large \frac{\partial \mathcal{L}_{2 n d}}{\partial \hat{X}}=2(\hat{X}-X) \odot B } \quad \quad \quad \quad (8)$

　　第二項的計算 $\partial \hat{X} / \partial \hat{W}$ 可由 $\hat{X}=$ $\sigma\left(\hat{Y}^{(K-1)} \hat{W}^{(K)}+\hat{b}^{(K)}\right) $ 計算。然后 $\partial \mathcal{L}_{2 n d} / \partial \hat{W}^{(K)}$ 可以計算出。基於反向傳播，我們可以迭代地得到 $\partial \mathcal{L}_{2 n d} / \partial \hat{W}^{(k)}, k=$ $1, \ldots K-1$ 和 $\partial \mathcal{L}_{2 n d} / \partial W^{(k)}, k=1, \ldots K $ 。現在 $\mathcal{L}_{2 n d}$ 的偏導數計算完成。

　　現在計算 $\partial \mathcal{L}_{1 s t} / \partial W^{(k)}$. $\mathcal{L}_{1 s t}$ 可以表述為:

　　　　$\mathcal{L}_{1 s t}=\sum_{i, j=1}^{n} s_{i, j}\left\|\mathbf{y}_{i}-\mathbf{y}_{j}\right\|_{2}^{2}=2 \operatorname{tr}\left(Y^{T} L Y\right) \quad \quad \quad \quad (9)$
　　其中 $L=D-S, D \in \mathbb{R}^{n \times n}$ 是 diagonal matrix，$D_{i, i}=$ $\sum_{j} s_{i, j} $ 。
　　然后首先關注計算 $\partial \mathcal{L}_{1 s t} / \partial W^{(K)}$ ：

　　　　$\frac{\partial \mathcal{L}_{1 s t}}{\partial W^{(K)}}=\frac{\partial \mathcal{L}_{1 s t}}{\partial Y} \cdot \frac{\partial Y}{\partial W^{(K)}} \quad \quad \quad \quad (10)$

　　因為 $Y=\sigma\left(Y^{(K-1)} W^{(K)}+b^{(K)}\right)$, 第二項 $\partial Y / \partial W^{(K)}$ 可容易計算出。對於第一項 $\partial \mathcal{L}_{1 s t} / \partial Y$，我們有:
　　　　$\frac{\partial \mathcal{L}_{1 \text { st }}}{\partial Y}=2\left(L+L^{T}\right) \cdot Y \quad \quad \quad \quad (11)$

　　同樣地，利用反向傳播，我們可以完成對的 $\mathcal{L}_{1 s t}$ 偏導數的計算。

　　現在我們得到了這些參數的偏導數。通過對參數的初始化，可以利用 $SGD$ 對所提出的深度模型進行優化。需要注意的是，由於模型的非線性較高，在參數空間中存在許多局部最優。因此，為了找到一個良好的參數空間區域，我們首先使用 Deep Belief Network 對參數進行 pretrain ，這在文獻中被證明是深度學習的必要參數初始化。

　　SDNE 完整算法在 Alg 1 中提出。