論文解讀（GIN）《How Powerful are Graph Neural Networks》

本文轉載自查看原文 2022-03-05 16:32 2134 論文解讀

論文信息

論文標題：How Powerful are Graph Neural Networks
論文作者：Keyulu Xu, Weihua Hu, J. Leskovec, S. Jegelka
論文來源：2019, ICLR
論文地址：download
論文代碼：download

1 Introduction

　　GNN 目前主流的做法是遞歸迭代聚合一階鄰域表征來更新節點表征，如 GCN 和 GraphSAGE，但這些方法大多是經驗主義，缺乏理論去理解 GNN 到底做了什么，還有什么改進空間。

　　GNN 的變體均是遵循兩個步驟：鄰居聚合（neighborhood aggregation）和圖池化（graph-level pooling）。　　

　　本文框架受 GNNs 和 WL 圖同質測試的啟發，若 GNNs 對不同構圖能很好識別，則認為是具有較強的表征能力。

　　本文貢獻：

- 證明了GNN最多只和 Weisfeiler-Lehman (WL) test 一樣有效，即 WL test 是GNN性能的上限；
- 建立了鄰域聚合（neighbor aggregation）和圖讀出函數（graph readout functions）的條件，在這些條件下，得到的 GNN 與 WL test 一樣強大；
- 提出圖同構網絡(Graph Isomorphism Network——GIN)，並證明了它的判別、表征能力等於 WL test 的能力；

2 Preliminaries

2.1 GNN steps

　　GNN 常見的兩步走：1、聚合鄰居信息；2、更新節點學習

　　GNN 的第 $k$ 層表達式：

　　　　$a_{v}^{(k)}=\text { AGGREGATE }^{(k)}\left(\left\{h_{u}^{(k-1)}: u \in \mathcal{N}(v)\right\}\right)$

　　　　$h_{v}^{(k)}=\operatorname{COMBINE}^{(k)}\left(h_{v}^{(k-1)}, a_{v}^{(k)}\right)$

　　AGGREGATE 比較典型的例子是 GraphSAGE：

　　GraphSAGE 的 AGGREGATE 被定義為：

　　　　$a_{v}^{(k)}=\operatorname{MAX}\left(\left\{\operatorname{ReLU}\left(W \cdot h_{u}^{(k-1)}\right), \forall u \in \mathcal{N}(v)\right\}\right)$

　　這里的 $MAX $ 代表的是 element-wise max-pooling 。

　　GraphSAGE 的 COMBINE 為：

　　　　$W \cdot\left[h_{v}^{(k-1)}, a_{v}^{(k)}\right]$

　　而在 GCN 中，AGGREGATE 和 COMBINE 集成為：

　　　　$h_{v}^{(k)}=\operatorname{ReLU}\left(W \cdot \operatorname{MEAN}\left\{h_{u}^{(k-1)}, \forall u \in \mathcal{N}(v) \cup\{v\}\right\}\right)$

　　對於節點分類任務，節點表示 $h_{v}^{(K)}$ 將作為預測的輸入；對於圖分類任務，READOUT 函數聚合了最后一次迭代輸出的節點表示$h_{v}^{(K)}$ ，並生成圖表示 $h_{G}$ :

　　　　$ h_{G}=\operatorname{READOUT}\left(\left\{h_{v}^{(K)} \mid v \in G\right\}\right)$

　　其中：READOUT 函數是具有排列不變性的函數，如：summation。

2.2 Weisfeiler-Lehman test

　　圖同構問題（ graph isomorphism problem）：詢問這兩個圖在拓撲結構上是否相同。

　　WL test 為了辨別多標簽圖，具體步驟如下：[ 參考《Weisfeiler-Lehman(WL) 算法和WL Test》 ]

- 迭代地聚合節點及其鄰域的標簽；
- 將聚合后的標簽散列為唯一的新標簽。如果在某次迭代中，兩個圖之間的節點標簽不同，則該算法判定兩個圖是非同構的；

　　基於 WL test 的多圖相似性判別算法 WL subtree kernel 也被提出，圖示如下：

　　上述過程是將 2WL test 保存為樹結構。

3 Theoretical framework：overview

　　Definition 1 (Multiset). A multiset is a generalized concept of a set that allows multiple instances for its elements. More formally, a multiset is a 2-tuple $X=(S, m)$ where $S$ is the underlying set of $X$ that is formed from its distinct elements, and $m: S \rightarrow \mathbb{N}_{\geq 1}$ gives the multiplicity of the elements.

　　假設節點 $v$ 及其鄰居集合 $\mathcal{N} (v)$ ，假設節點 $v$ 的標簽是 1 ，其 $\mathcal{N} (v)$ 對應的標簽是 1、1、2、3、4，可以把鄰居集合看成一個 Multiset 。

4 Building powerful graph neural networks

　　作者提出 Theorem 2：即為圖同質測試。

　　Lemma 2. Let $G_{1}$ and $G_{2}$ be any two non-isomorphic graphs. If a graph neural network $\mathcal{A}: \mathcal{G} \rightarrow \mathbb{R}^{d}$ maps $G_{1}$ and $G_{2}$ to different embeddings, the Weisfeiler-Lehman graph isomorphism test also decides $G_{1}$ and $G_{2}$ are not isomorphic.

　　Lemma 2 證明：

　　假設：如果節點標簽一致，那么節點表示也一致。

　　假設對節點 $v$ 做了 $k$ 次 WL test 標簽聚合，其最終標簽若相似，則節點表示也一致。那么如果在 GNN 中，$k$ hop 鄰域一樣，那么必然節點表示一樣。WL test 過程和 GNN 聚合過程是一致的。

　　作者提出 Theorem 3：如果 GNN 中 Aggregate、Combine 和 Readout 函數是單射，GNN 可以和 WL test 一樣強大。

　　Theorem 3. Let $\mathcal{A}: \mathcal{G} \rightarrow \mathbb{R}^{d}$ be a GNN . With a sufficient number of GNN layers, $\mathcal{A}$ maps any graphs $G_{1}$ and $G_{2}$ that the Weisfeiler-Lehman test of isomorphism decides as non-isomorphic, to different embeddings if the following conditions hold:

　　 a) A aggregates and updates node features iteratively with

　　　　$h_{v}^{(k)}=\phi\left(h_{v}^{(k-1)}, f\left(\left\{h_{u}^{(k-1)}: u \in \mathcal{N}(v)\right\}\right)\right)$

　　 where the functions $f$, which operates on multisets, and $\phi$ are injective(單射).

　　b) $\mathcal{A}$'s graph-level readout, which operates on the multiset of node features $\left\{h_{v}^{(k)}\right\}$ , is injective.

　　Theorem 3 證明和 Lemma2 證明思想類似，都是基於相同假設。

4.1 Graph isomorphism network(GIN)

　　為建模鄰居聚合的單射多集函數。

　　下述 Lemma 5 闡述 sum aggregators 是單射的：

　　Lemma 5. Assume $\mathcal{X}$ is countable. There exists a function $f: \mathcal{X} \rightarrow \mathbb{R}^{n}$ so that $h(X)=\sum_{x \in X} f(x)$ is unique for each multiset $X \subset \mathcal{X}$ of bounded size. Moreover, any multiset function $g$ can be decomposed as $g(X)=\phi\left(\sum\limits _{x \in X} f(x)\right)$ for some function $\phi $.

　　Lemma5 證明：

　　出發點：考慮一個有 $N$ 個元素的 multiset ，對其進行任意划分，最多可以划分成 $N$ 個子集，所以很自然的可以使用 $N$ 個正整數對其打上唯一標記，因此證明 $f$ 可以是唯一的單射函數。

　　

　　Corollary 6. Assume $\mathcal{X}$ is countable. There exists a function $f: \mathcal{X} \rightarrow \mathbb{R}^{n}$ so that for infinitely many choices of $\epsilon$ , including all irrational numbers, $h(c, X)=(1+\epsilon) \cdot f(c)+\sum\limits _{x \in X} f(x)$ is unique for each pair $(c, X)$ , where $c \in \mathcal{X}$ and $X \subset \mathcal{X}$ is a multiset of bounded size. Moreover, any function $g$ over such pairs can be decomposed as $ g(c, X)=\varphi\left((1+\epsilon) \cdot f(c)+\sum\limits_{x \in X} f(x)\right)$ for some function $\varphi $.

　　Corollary 6 證明：

　　

　　對於第一種情況利用 Lemma 5 解釋，對於第二種情況利用無理數 $\epsilon$ 的性質。

　　Corollary 6 證明了 $ h(c, X)=(1+\epsilon) \cdot f(c)+\sum\limits_{x \in X} f(x)$ 是單射函數，同時本文也將 $\varphi$ 和 $f$ 用 MLP 代替（由於 MLP 是萬能近似函數，可以模擬單射性質），又根據單射的性質（若 $f$ 和 $g$ 皆為單射的，則 $f o g$ 亦為單射），得 $MLP(h(c, X))$ 也是單射的，即：

　　　　$h_{v}^{(k)}=\operatorname{MLP}^{(k)}\left(\left(1+\epsilon^{(k)}\right) \cdot h_{v}^{(k-1)}+\sum\limits _{u \in \mathcal{N}(v)} h_{u}^{(k-1)}\right) \quad\quad\quad({\large \star } )$

4.2 Graph-level readout of GIN

　　Readout 模塊使用 concat+sum，對每次迭代得到的所有節點特征求和得到圖的特征，然后拼接起來。

　　　　$h_{G}=\operatorname{CONCAT}\left(\operatorname{READOUT}\left(\left\{h_{v}^{(k)} \mid v \in G\right\}\right) \mid k=0,1, \ldots, K\right)$

　　即

　　　　$h_{G}=\operatorname{CONCAT}\left(\operatorname{sum}\left(\left\{h_{v}^{(k)} \mid v \in G\right\}\right) \mid k=0,1, \ldots, K\right)$

5 Less powerful but still interesting GNNs

　　本文研究不滿足 Theorem 3 的 GraphSAGE 和 GCN，做了兩個消融實驗：

- 1-layer perceptrons instead of MLPs .　　
- mean or max-pooling instead of the sum.

5.1 1-layer perceotrons are not sufficient

　　許多GNN 任然采用 1 層的 perceptrons，對於某些 multiset 可能存在無法區別的問題。

　　Lemma 7. There exist finite multisets $X_{1} \neq X_{2}$ so that for any linear mapping $W $, ${\small \sum\limits_{x \in X_{1}} \operatorname{ReLU}(W x)=\sum\limits_{x \in X_{2}} \operatorname{ReLU}(W x)} $.

　　Lemma 7 證明：

　　

　　1 層的 perceptrons 表現得很像線性映射，因此 GNN 層退化為對鄰域特征的簡單求和。本文的證明建立在線性映射中缺乏偏差項這一事實之上。有了偏差項和足夠大的輸出維數，1 層的 perceptrons 可能能夠區分不同的 multiset。

5.2 Structures that confuse mean and max-pooling

　　現在考慮將 $h(X)=\sum\limits _{x \in X} f(x)$ 中的 sum 替換為 Mean-pooling 和 Max-pooling 將產生什么問題。

　　Mean-pooling 和 max-pooling aggregators 在某種程度上是一種好的 multiset functions [ 具有平移不變性 ]，但是他們不是單射的。

　　Figure 2 根據三個 aggregators 的表示能力進行了排序。

　　三種不同的 Aggregate：

- sum：學習全部的標簽以及數量，可以學習精確的結構信息（不僅保存了分布信息，還保存了類別信息）；[ 藍色：4個；紅色：2 個 ]
- mean：學習標簽的比例（比如兩個圖標簽比例相同，但是節點有倍數關系），偏向學習分布信息；[ 藍色：$4/6=2/3$ 的比例；紅色：$2/6=1/3$ 的比例 ]
- max：學習最大標簽，忽略多樣，偏向學習有代表性的元素信息；[ 兩類（類內相同），所以各一個 ]

　　Figure 3 說明了mean-pooling aggregators 和 max-pooling aggregators 無法區分的結構對。

　　在 Figure 3a 中：Every node has the same feature $a$ and $f(a)=h_a$ is the same across all nodes.

- mean：左 $\frac{1}{2}(h_a+h_a)=h_a$ ，右：$\frac{1}{3}(h_a+h_a+h_a)=h_a$，無法區分；
- max：左 $h_a$ , 右 $h_a$ 無法區分；
- sum：左 $2h_a$ , 右 $3h_a$ , 可以區分；

　　在 Figure 3b 中：Let $h_{\text {color }}(r \text { for red, } g \text { for green })$ denote node features transformed by $f $.

- mean: 左 $ \frac{1}{2}(h_r+h_g)$ ，右： $\frac{1}{3}(h_g+2h_r) $ ，可以區分；
- max : 左 $max (h_r, h_g) $ ，右： $max (h_g, h_r, h_r) $ ，無法區分;
- sum: 左 $sum(h_r+h_g)$ , 右 $sum(2 h_r+h_g)$ , 可以區分；

　　在 Figure 3c 中：

- mean：左 $\frac{1}{2}(h_r+h_g) $ ，右：$\frac{1}{4}(2 h_g+2 h_r) $ ，無法區分;
- max：左 $\max (h_g, h_g, h_r, h_r)$ ，右：$\max (h_g, h_r)$ ，無法區分；
- sum：左 $h_r+h_g$ ，右：$2 h_r+2 h_g$ ，可以區分；

5.3 Mean learns distrubutions

　　旨在說明等比例 multiset ，使用 mean 是無法區分的。

　　Corollary 8. Assume $\mathcal{X}$ is countable. There exists a function $f: \mathcal{X} \rightarrow \mathbb{R}^{n}$ so that for $h(X)= \frac{1}{|X|} \sum\limits _{x \in X} f(x), h\left(X_{1}\right)=h\left(X_{2}\right)$ if and only if multisets $X_{1}$ and $X_{2}$ have the same distribution. That is, assuming $\left|X_{2}\right| \geq\left|X_{1}\right| , we have X_{1}=(S, m)$ and $X_{2}=(S, k \cdot m)$ for some $k \in \mathbb{N}_{\geq 1}$ .

5.4 Max-pooling learns sets with distinct elements

　　Max-pooling 闡述的是，只要決定性元素（max value）一樣，其他元素是否考慮無關緊要。顯然這是不合理的。

　　Corollary 9. Assume $\mathcal{X}$ is countable. Then there exists a function $f: \mathcal{X} \rightarrow \mathbb{R}^{\infty}$ so that for ${\small h(X)=\underset{x \in X}{max} f(x), h\left(X_{1}\right)=h\left(X_{2}\right)} $ if and only if $X_{1}$ and $X_{2}$ have the same underlying set.

6 Experiments

6.1 Training set performance of GINs

　　在訓練中，GIN 和WL test 一樣，可以擬合所有數據集，這表說了 GIN 表達能力達到了上限

6.2 Generalization ability of GNNs

- GIN-0 比GIN-eps 泛化能力強：可能是因為更簡單的緣故；
- GIN 比 WL test 效果好：因為GIN進一步考慮了結構相似性，即WL test 最終是one-hot輸出，而GIN是將WL test映射到低維的embedding；
- max 在無節點特征的圖（用度來表示特征）基本無效；

7 Conclusion

　　本文主要基於對 graph分類，證明了 sum 比 mean 、max 效果好，但是不能說明在node 分類上也是這樣的效果，另外可能優先場景會更關注鄰域特征分布，或者代表性，故需要都加入進來實驗。

修改歷史

2021-03-15 創建文章
2022-06-10 修訂文章，大整理

論文解讀目錄

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【大綜解讀】A Comprehensive Survey on Graph Neural Networks 論文解讀（GAT）《Graph Attention Networks》《Population Based Training of Neural Networks》論文解讀論文解讀《ImageNet Classification with Deep Convolutional Neural Networks》 ImageNet Classification with Deep Convolutional Neural Networks 論文解讀論文解讀《Understanding the Effective Receptive Field in Deep Convolutional Neural Networks》 neural style論文解讀解讀 intriguing properties of neural networks Exploiting Edge Features in Graph Neural Networks 論文解讀（Geom-GCN）《Geom-GCN: Geometric Graph Convolutional Networks》