GCN 實現3 ：代碼解析

本文轉載自查看原文 2019-10-11 20:55 1996

1.代碼結構
├── data // 圖數據
├── inits // 初始化的一些公用函數
├── layers // GCN層的定義
├── metrics // 評測指標的計算
├── models // 模型結構定義
├── train // 訓練
└── utils // 工具函數的定義

2.數據

Data: cora,Citeseer, or Pubmed,在data文件夾下：

Original data:

Cora的數據集
包括2708份科學出版物，分為7類。引文網絡由5429個鏈接組成。數據集中的每個發布都由一個0/1值的單詞向量來描述，該向量表示字典中相應單詞的存在或不存在。這部詞典由1433個獨特的單詞組成。數據集中的自述文件提供了更多的細節。

CiteSeer文獻分類
CiteSeer數據集包括3312種科學出版物，分為6類。引文網絡由4732條鏈接組成。數據集中的每個發布都由一個0/1值的單詞向量來描述，該向量表示字典中相應單詞的存在或不存在。該詞典由3703個獨特的單詞組成。

數據集中的自述文件提供了更多的細節。

CiteSeer for Entity Resolution
為了實體解析，CiteSeer數據集包含1504個機器學習文檔，其中2892個作者引用了165個作者實體。對於這個數據集，惟一可用的屬性信息是作者名。完整的姓總是給出的，在某些情況下，作者的全名和中間名是給出的，其他時候只給出首字母。

PubMed糖尿病數據庫
由來自PubMed數據庫的19717篇與糖尿病相關的科學出版物組成，分為三類。引文網絡由44338個鏈接組成。數據集中的每個出版物都由一個TF/IDF加權詞向量來描述，這個詞向量來自一個包含500個唯一單詞的字典。數據集中的自述文件提供了更多的細節。

以下以cora數據集為例：

數據集預處理：

讀取數據：

"""
def load_data(dataset_str):
------
Loads input data from gcn/data directory

ind.dataset_str.x => the feature vectors of the training instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.tx => the feature vectors of the test instances as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.allx => the feature vectors of both labeled and unlabeled training instances
    (a superset of ind.dataset_str.x) as scipy.sparse.csr.csr_matrix object;
ind.dataset_str.y => the one-hot labels of the labeled training instances as numpy.ndarray object;
ind.dataset_str.ty => the one-hot labels of the test instances as numpy.ndarray object;
ind.dataset_str.ally => the labels for instances in ind.dataset_str.allx as numpy.ndarray object;
ind.dataset_str.graph => a dict in the format {index: [index_of_neighbor_nodes]} as collections.defaultdict
    object;
ind.dataset_str.test.index => the indices of test instances in graph, for the inductive setting as list object.

All objects above must be saved using python pickle module.

:param dataset_str: Dataset name
:return: All data input files loaded (as well the training/test data).
------
names = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(names)):
    with open("data/ind.{}.{}".format(dataset_str, names[i]), 'rb') as f:
        if sys.version_info > (3, 0):
            objects.append(pkl.load(f, encoding='latin1'))
        else:
            objects.append(pkl.load(f))

x, y, tx, ty, allx, ally, graph = tuple(objects)
test_idx_reorder = parse_index_file("data/ind.{}.test.index".format(dataset_str))
test_idx_range = np.sort(test_idx_reorder)

if dataset_str == 'citeseer':
    # Fix citeseer dataset (there are some isolated nodes in the graph)
    # Find isolated nodes, add them as zero-vecs into the right position
    test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder)+1)
    tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
    tx_extended[test_idx_range-min(test_idx_range), :] = tx
    tx = tx_extended
    ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
    ty_extended[test_idx_range-min(test_idx_range), :] = ty
    ty = ty_extended

features = sp.vstack((allx, tx)).tolil()
features[test_idx_reorder, :] = features[test_idx_range, :]
adj = nx.adjacency_matrix(nx.from_dict_of_lists(graph))

labels = np.vstack((ally, ty))
labels[test_idx_reorder, :] = labels[test_idx_range, :]

idx_test = test_idx_range.tolist()
idx_train = range(len(y))
idx_val = range(len(y), len(y)+500)

train_mask = sample_mask(idx_train, labels.shape[0])
val_mask = sample_mask(idx_val, labels.shape[0])
test_mask = sample_mask(idx_test, labels.shape[0])

y_train = np.zeros(labels.shape)
y_val = np.zeros(labels.shape)
y_test = np.zeros(labels.shape)
y_train[train_mask, :] = labels[train_mask, :]
y_val[val_mask, :] = labels[val_mask, :]
y_test[test_mask, :] = labels[test_mask, :]

return adj, features, y_train, y_val, y_test, train_mask, val_mask, test_mask
"""

知識點1：
那么為什么需要序列化和反序列化這一操作呢？便於存儲。序列化過程將文本信息轉變為二進制數據流。這樣就信息就容易存儲在硬盤之中，當需要讀取文件的時候，從硬盤中讀取數據，然后再將其反序列化便可以得到原始的數據。在Python程序運行中得到了一些字符串、列表、字典等數據，想要長久的保存下來，方便以后使用，而不是簡單的放入內存中關機斷電就丟失數據。python模塊大全中的Pickle模塊就派上用場了，它可以將對象轉換為一種可以傳輸或存儲的格式。 loads()函數執行和load() 函數一樣的反序列化。取代接受一個流對象並去文件讀取序列化后的數據，它接受包含序列化后的數據的str對象, 直接返回的對象。
import cPickle as pickle
pickle.dump(obj,f) # 序列化方法pickle.dump()
pickle.dumps(obj,f) #pickle.dump(obj, file, protocol=None,*,fix_imports=True) 該方法實現的是將序列化后的對象obj以二進制形式寫入文件file中，進行保存。它的功能等同於 Pickler(file, protocol).dump(obj)。
pickle.load(f) #反序列化操作： pickle.load(file, *,fix_imports=True, encoding=”ASCII”. errors=”strict”)
pickle.loads(f)

回顧以下：
cora數據集：包括2708份科學出版物，分為7類。引文網絡由5429個鏈接組成。數據集中的每個發布都由一個0/1值的單詞向量來描述，該向量表示字典中相應單詞的存在或不存在。這部詞典由1433個獨特的單詞組成。