主要介紹GCN-Tree模型中依存樹的內容。論文中使用的工具來自Standford Parser。
https://www.xfyun.cn/services/semanticDependence 訊飛中文分詞平台
http://nlp.stanford.edu:8080/parser/ 這是可以體驗功能。
工具包:https://nlp.stanford.edu/software/stanford-dependencies.shtml 教你怎么用stanford dependency parser這個工具代碼。
圖2:用圖卷積網絡抽取關系。左側顯示整體架構,而右側則只顯示“relative”一詞的詳細圖卷積計算,以求清晰。本文還提供了一個完整的、未標記的句子依存分析,以供參考。
我們使用論文中的例子還原一下這個解析樹:
Your query
Tagging
Parse
(ROOT (S (NP (PRP He)) (VP (VBD was) (RB not) (NP (NP (DT a) (NN relative)) (PP (IN of) (NP (NNP Mike) (NNP Cane)))))))
Universal dependencies
nsubj(relative-5, He-1) cop(relative-5, was-2) advmod(relative-5, not-3) det(relative-5, a-4) root(ROOT-0, relative-5) case(Cane-8, of-6) compound(Cane-8, Mike-7) nmod(relative-5, Cane-8)
Universal dependencies, enhanced
nsubj(relative-5, He-1) cop(relative-5, was-2) advmod(relative-5, not-3) det(relative-5, a-4) root(ROOT-0, relative-5) case(Cane-8, of-6) compound(Cane-8, Mike-7) nmod:of(relative-5, Cane-8)
可以看到第5個單詞relative作為根節點,nsubj,cop,avvmod,det,root,case,compound,nmod:of作為依賴邊關系表示縮寫,在論文數據集中標注為$stanford-deprel$
括號前邊的項為關系邊的出發點,后項為這句子中的第X個單詞(head,此單詞),head在論文數據集中標注為$stanford-head$
取自數據集中某條數據:
這樣可以寫出universal dependencies:
nsubj(named-2,He-1)
root(ROOT-0,named-2)
dobj(named-2,one-3)
case(Aziz-7,as-4)
compound(Aziz-7,Shah-5)
compound(Aziz-7,Shah-6)
compound(Aziz-7,Shah-7)
nmod(named-2,Aziz-7)
通過工具驗證一下:
Your query
Tagging
Parse
(ROOT (S (NP (PRP He)) (VP (VBD named) (NP (CD one)) (PP (IN as) (NP (NNP Shah) (NNP Abdul) (NNP Aziz))))))
Universal dependencies
nsubj(named-2, He-1) root(ROOT-0, named-2) obj(named-2, one-3) case(Aziz-7, as-4) compound(Aziz-7, Shah-5) compound(Aziz-7, Abdul-6) obl(named-2, Aziz-7)
Universal dependencies, enhanced
nsubj(named-2, He-1) root(ROOT-0, named-2) obj(named-2, one-3) case(Aziz-7, as-4) compound(Aziz-7, Shah-5) compound(Aziz-7, Abdul-6) obl:as(named-2, Aziz-7)
經驗證確實是這樣標注的。
兩個基本問題
都挺簡單的數據結構問題(多叉樹的節點問題):
a. 已知一個節點怎么找到它的父(子)節點。
這個就很簡單了。自己應該會的。
b. 求兩個節點的最短路徑
就是找到一個節點,把自己和所有父節點放到一個數組里,再在另一個節點,從本身開始順着父節點找,直到找到和第一個節點並且存在於第一個數組里,這樣,第一個數組從0開始到這個公共節點和第二個節點的從這個節點到自己本身的所有節點就是這倆節點的最短路徑。
舉個實在例子(意見抽取):
dependency tree是:
屬性之間的最短路徑:
注意的是,這個路徑上每次經過的線(也就是他們倆的關系),這里的路徑就是這個。
屬性與評價之間的最短路徑:
從這兩組最短路徑很明顯看出誰跟誰更親近,這也是最短路徑的一個應用。
下面介紹在論文模型中如何將輸入變為Tree形式表示,這是GCNRelationModel模型中forward過程中使用依賴樹的過程:
1 def forward(self, inputs): 2 words, masks, pos, ner, deprel, head, subj_pos, obj_pos, subj_type, obj_type = inputs # unpack 3 l = (masks.data.cpu().numpy() == 0).astype(np.int64).sum(1) #將mask矩陣中的True/Fasle->1/0,記錄每個batch有多少個單詞 4 maxlen = max(l) 5 6 def inputs_to_tree_reps(head, words, l, prune, subj_pos, obj_pos): 7 head, words, subj_pos, obj_pos = head.cpu().numpy(), words.cpu().numpy(), subj_pos.cpu().numpy(), obj_pos.cpu().numpy() 8 trees = [head_to_tree(head[i], words[i], l[i], prune, subj_pos[i], obj_pos[i]) for i in range(len(l))] 9 adj = [tree_to_adj(maxlen, tree, directed=False, self_loop=False).reshape(1, maxlen, maxlen) for tree in trees] 10 adj = np.concatenate(adj, axis=0) # 這個batch中的多個numpy鄰接矩陣,跨行進行拼接 shape = [b,maxlen,maxlen] 11 adj = torch.from_numpy(adj) 12 return Variable(adj.cuda()) if self.opt['cuda'] else Variable(adj) 13 14 #.data用法可以修改tensor的值而不被autograd(不會影響反向傳播), 15 # subj_pos,obj_pos均為主語賓語在句子中的位置,#返回距離List :[-3,-2,-1,0,0,0,1,2,3] 16 adj = inputs_to_tree_reps(head.data, words.data, l, self.opt['prune_k'], subj_pos.data, obj_pos.data) 17 h, pool_mask = self.gcn(adj, inputs) #將此batch的adj鄰接矩陣,與輸入輸入到gcn
在第16行,直接輸入數據的head,word,l,剪枝路徑約束值k,主語、賓語位置信息,調用6~12行的函數生成了此batch的依賴樹的鄰接矩陣。
方法內部第8行,通過調用head_to_tree( )方法,給batch每句話都生成一棵依賴樹。
1 class Tree(object): 2 """ 3 Reused tree object from stanfordnlp/treelstm. 4 stanfordnlp/treelstm重用的樹對象 5 """ 6 def __init__(self): 7 self.dist = 0 8 self.idx = 0 9 self.parent = None 10 self.num_children = 0 11 self.children = list() 12 13 def add_child(self,child): 14 child.parent = self 15 self.num_children += 1 16 self.children.append(child) 17 18 def size(self): 19 if getattr(self,'_size'): 20 return self._size 21 count = 1 22 for i in range(self.num_children): 23 count += self.children[i].size() 24 self._size = count 25 return self._size 26 27 def depth(self): 28 if getattr(self,'_depth'): 29 return self._depth 30 count = 0 31 if self.num_children>0: 32 for i in range(self.num_children): 33 child_depth = self.children[i].depth() 34 if child_depth>count: 35 count = child_depth 36 count += 1 37 self._depth = count 38 return self._depth 39 40 def __iter__(self): 41 yield self 42 for c in self.children: 43 for x in c: 44 yield x 45 46 def head_to_tree(head, tokens, len_, prune, subj_pos, obj_pos): 47 """ 48 Convert a sequence of head indexes into a tree object. 49 將head索引序列轉換為tree對象 50 """ 51 tokens = tokens[:len_].tolist() 52 head = head[:len_].tolist() 53 root = None 54 55 if prune < 0: #不進行剪枝 56 nodes = [Tree() for _ in head] #多少個單詞,就有多少個節點nodes 57 58 for i in range(len(nodes)): 59 h = head[i] 60 nodes[i].idx = i 61 nodes[i].dist = -1 # just a filler 62 if h == 0: 63 root = nodes[i] 64 else: 65 nodes[h-1].add_child(nodes[i]) #nodes[h-1]出邊指向->當前節點,對應standford標注 66 else: #進行剪枝 67 # find dependency path 68 subj_pos = [i for i in range(len_) if subj_pos[i] == 0] #subj_pos為0的部分實際上是實體,返回主語實體的下標[3,4,5] 69 obj_pos = [i for i in range(len_) if obj_pos[i] == 0] 70 71 cas = None 72 73 subj_ancestors = set(subj_pos) 74 for s in subj_pos: #遍歷主語實體的每一個下標 75 h = head[s] 76 tmp = [s] 77 while h > 0: #head如果不是root 78 tmp += [h-1] #tmp存儲當前節點s與發射邊節點(head,s),以及發射邊節點的祖先 79 subj_ancestors.add(h-1) 80 #subj_ancestors存儲主語實體下標除root之外的所有發射邊節點,以及發射邊節點的祖先,一直到找到root根節停止 81 h = head[h-1] 82 83 if cas is None: 84 cas = set(tmp) #第一次遍歷cas是空的,把第一個下標節點對應的所有祖先加入 85 else: 86 cas.intersection_update(tmp) #第二三次遍歷就調用intersection_update取交集,最后保留幾個主語實體節點的公共祖先 87 88 obj_ancestors = set(obj_pos) 89 for o in obj_pos: 90 h = head[o] 91 tmp = [o] 92 while h > 0: 93 tmp += [h-1] 94 obj_ancestors.add(h-1) 95 h = head[h-1] 96 cas.intersection_update(tmp) #cas再與賓語實體節點的公共祖先取交集 97 98 # find lowest common ancestor 99 if len(cas) == 1: #只有一個公共節點那么LCA就是它 100 lca = list(cas)[0] 101 else: 102 child_count = {k:0 for k in cas} 103 for ca in cas: 104 if head[ca] > 0 and head[ca] - 1 in cas: #ca的祖先不是根節點 and ca的祖先在cas這堆祖先節點中 105 child_count[head[ca] - 1] += 1 #ca的祖先加一個孩子,那個孩子就是ca 106 107 #LCA(Least Common Ancestors) 108 # the LCA has no child in the CA set 109 for ca in cas: #很容易理解,公共祖先樹中沒孩子的‘葉子’肯定是所有實體節點的lCA 110 if child_count[ca] == 0: 111 lca = ca 112 break 113 114 path_nodes = subj_ancestors.union(obj_ancestors).difference(cas) 115 #主語樹(含祖先)與賓語樹(含祖先)取並集,再去掉公共祖先節點 116 path_nodes.add(lca) #再加上最低公共祖先,LCA樹構造完成了 117 118 # compute distance to path_nodes 119 dist = [-1 if i not in path_nodes else 0 for i in range(len_)]#LCA樹中的節點被標記為0,其他節點標記為-1 120 121 for i in range(len_): 122 if dist[i] < 0: #如果不是LCA的節點 123 stack = [i] 124 while stack[-1] >= 0 and stack[-1] not in path_nodes:# 125 stack.append(head[stack[-1]] - 1) #stack存儲節點i以及他的祖先們,直到最高的祖先在path_nodes中 126 127 if stack[-1] in path_nodes: #如果節點i的最高祖先在LCA中 128 for d, j in enumerate(reversed(stack)): #stack存儲的路徑反序i<-B<-A 變成 A->B->i 129 dist[j] = d #dist[A] = 0 ,dist[B] = 1,dist[i] = 2,顯然dist表示了各個節點到LCA樹的距離 130 else: 131 for j in stack: #這部分節點說明與LCA沒有邊連接到,與LCA的距離自然是無窮大 132 if j >= 0 and dist[j] < 0: 133 dist[j] = int(1e4) # aka infinity 134 135 highest_node = lca 136 nodes = [Tree() if dist[i] <= prune else None for i in range(len_)] #剪枝 prune<=k,滿足要求的節點Tree(),不滿足要求的為None 137 138 #遍歷一遍nodes,將LCA樹創建好 139 for i in range(len(nodes)): 140 if nodes[i] is None: 141 continue 142 h = head[i] 143 nodes[i].idx = i 144 nodes[i].dist = dist[i] 145 if h > 0 and i != highest_node: 146 assert nodes[h-1] is not None 147 nodes[h-1].add_child(nodes[i]) 148 149 root = nodes[highest_node] 150 151 assert root is not None 152 return root
再調用tree_to_adj( )方法(第9行),將每棵依賴樹轉換為鄰接矩陣。
1 def tree_to_adj(sent_len, tree, directed=True, self_loop=False): 2 """ 3 Convert a tree object to an (numpy) adjacency matrix. 4 把一個樹對象轉為鄰接矩陣 5 """ 6 ret = np.zeros((sent_len, sent_len), dtype=np.float32) 7 8 queue = [tree] #樹LCA根節點入隊 9 idx = [] 10 while len(queue) > 0: 11 t, queue = queue[0], queue[1:] 12 13 idx += [t.idx] #LCA樹的節點編號 14 15 for c in t.children: #t節點有孩子c節點,所以t到c有臨界邊 16 ret[t.idx, c.idx] = 1 17 queue += t.children #孩子節點入隊遍歷 18 19 if not directed: #這個參數關鍵決定是雙向還是單向圖 20 ret = ret + ret.T 21 22 if self_loop: #節點到自身循環邊 23 for i in idx: 24 ret[i, i] = 1 25 26 return ret
最后將依賴樹鄰接矩陣做合適處理可以輸入gcn當中了。
參考:
詳解依存樹的來龍去脈:https://blog.csdn.net/qq_27590277/article/details/88345017
Standford依存句法詳細解釋:http://wenku.baidu.com/link?url=IfW-hkMfPuK29t49Wa_nO2UAMpP2oGYCUAZuY5PrHHIQHsIm5moH82DMbTA521PMhCC4svgGRSgUTaSkHktw5Ru6RQCCRjwuHfkNVB3mcum
numpy庫數組拼接np.concatenate:https://www.cnblogs.com/shueixue/p/10953699.html
PyTorch中Variable變量:https://blog.csdn.net/qq_19329785/article/details/85029116