命名實體的標注有兩種方式:1)BIOES 2)BIO
實體的類別可以自己根據需求改變,通常作為原始數據來說,標注為BIO的方式。自己寫了一套標注方法,大家可以參考下
原文:1.txt
Inspired by energy-fueled phenomena such as cortical cytoskeleton flows [46,45,32] during biological morphogenesis, the theory of active polar viscous gels has been developed [37,33]. The theory models the continuum, macroscopic mechanics of a collection of uniaxial active agents, embedded in a viscous bulk medium, in which internal stresses are induced due to dissipation of energy [41,58]. The energy-consuming uniaxial polar agents constituting the gel are modeled as unit vectors. The average of unit vectors in a small local volume at each point defines the macroscopic directionality of the agents and is described by a polarization field. The polarization field is governed by an equation of motion accounting for energy consumption and for the strain rate in the fluid. The relationship between the strain rate and the stress in the fluid is provided by a constitutive equation that accounts for anisotropic, polar agents and consumption of energy. These equations, along with conservation of momentum, provide a continuum hydrodynamic description modeling active polar viscous gels as an energy consuming, anisotropic, non-Newtonian fluid [37,33,32,41]. The resulting partial differential equations governing the hydrodynamics of active polar viscous gels are, however, in general analytically intractable.
人工標注文本:1.ann
T1 Task 120 155 theory of active polar viscous gels
T2 Process 195 238 models the continuum, macroscopic mechanics
T3 Material 137 155 polar viscous gels
T4 Material 258 280 uniaxial active agents
T6 Material 296 315 viscous bulk medium
T7 Material 415 436 uniaxial polar agents
T8 Material 454 457 gel
* Synonym-of T7 T8
T9 Material 1074 1092 polar viscous gels
T10 Material 1099 1149 energy consuming, anisotropic, non-Newtonian fluid
* Synonym-of T9 T10
T11 Material 1241 1266 active polar viscous gels
T12 Process 628 646 polarization field
T13 Process 652 670 polarization field
T14 Process 689 707 equation of motion
R1 Hyponym-of Arg1:T13 Arg2:T14
T15 Process 866 887 constitutive equation
T16 Process 1023 1057 continuum hydrodynamic description
T17 Process 959 1011 These equations, along with conservation of momentum
* Synonym-of T17 T16
T18 Process 44 71 cortical cytoskeleton flows
T19 Process 90 114 biological morphogenesis
T20 Material 773 778 fluid
現在批量對數據集進行標注,代碼參考如下:
1 import spacy 2 3 4 def extract_entity(): 5 6 with open('./data/1.txt', 'r', encoding='utf8') as tfr, open('./data/1.ann', 'r', encoding='utf8') as afr: 7 content = tfr.read() 8 ann = afr.readlines() 9 doc = spacy.load('en') 10 # 分句子 11 sents = list(doc(content).sents) 12 # 存儲每個句子的開始結束索引 13 sent_index_dict = {} 14 for each_sent in sents: 15 sent_index_dict[each_sent] = (each_sent.start_char, each_sent.end_char) 16 17 # 對於每一個標注 18 for each_ann in ann: 19 task_kind = each_ann.strip().split('\t')[0] 20 # 是任務我就繼續處理 21 if task_kind.startswith('T'): 22 task_name_start_end_index = each_ann.strip().split('\t')[1] 23 task_text = each_ann.strip().split('\t')[2] 24 25 task_name = task_name_start_end_index.split(' ')[0] 26 task_start = int(task_name_start_end_index.split(' ')[1]) 27 task_end = int(task_name_start_end_index.split(' ')[2]) 28 # 根據索引,找到這個詞對應的句子 29 for key in sent_index_dict: 30 if task_start >= sent_index_dict[key][0] and task_end <= sent_index_dict[key][1]: 31 s = key.string 32 temp_str = [token.text for token in doc(s)] 33 if temp_str[-1] == '\n': 34 temp_str = temp_str[:len(temp_str) - 1] 35 start_info = s.find(task_text) 36 end_info = len(task_text) + start_info 37 38 str_test = s[start_info: end_info] 39 assert str_test == task_text 40 41 # 對所有內容打標簽 O 42 content_tag = [] 43 for i in range(len(temp_str)): 44 content_tag.append('O') 45 46 sentence_token = temp_str 47 sentence_tag = content_tag.copy() 48 49 entity_str = [token.text for token in doc(task_text)] 50 entity_category = task_name 51 tag_res = {} 52 # 遍歷該單詞,看是單個詞還是多個詞 53 if len(entity_str) == 1: 54 # 單個實體,標簽: U-Process 55 t = 'B-' + entity_category 56 tag_res[entity_str[0]] = t 57 elif len(entity_str) == 2: 58 # 兩個單詞組成的實體 59 t1 = 'B-' + entity_category 60 t2 = 'I-' + entity_category 61 tag_res[entity_str[0]] = t1 62 tag_res[entity_str[1]] = t2 63 else: 64 # 三個及以上的單詞組成的實體 65 # 先給每個單詞打上標簽, I-Process,再單獨對開始和結束對比 66 for word in entity_str: 67 tag_res[word] = 'I-' + entity_category 68 # 對開始和結束單獨標記 69 tag_res[entity_str[0]] = 'B-' + entity_category 70 # tag_res[entity_str[-1]] = 'E-' + entity_category 71 # 按照順序存儲到列表里面 72 entity_tag = [] 73 for word in entity_str: 74 entity_tag.append(tag_res[word]) 75 76 # 找到實體所在句子的下標 77 entity_index = sentence_token.index(entity_str[0]) 78 # 對應更改sentence_tag的標簽 79 sentence_tag[entity_index: entity_index + len(entity_str)] = entity_tag 80 print('機器標簽:{}'.format(str(sentence_tag))) 81 print('原始標簽:{}'.format(str(content_tag))) 82 print('原始標簽長度:{}; 機器標簽長度:{}'.format(len(content_tag), len(sentence_tag))) 83 84 assert len(sentence_tag) == len(content_tag) 85 86 87 # def tag_BIO(ann, co): 88 # 89 # doc = spacy.load('en') 90 # # 分句子 91 # sents = list(doc(s).sents) 92 # # 找一個句子出來 93 # sentence = sents[2].text 94 # sent_length = len(sentence) 95 # print(sentence,'\n長度是:', sent_length) 96 # target = 'gel' 97 # target_index = (454, 457) 98 # # 對句子進行分詞 99 # t = doc(sentence) 100 # token_list = [token.text for token in t] 101 # print('ok') 102 103 104 if __name__ == '__main__': 105 extract_entity()
備注:處理英文字符用spacy,這個工具不錯,號稱工業級nlp工具
