使用哈工大LTP進行文本命名實體識別並保存到txt

本文轉載自查看原文 2019-09-05 18:06 436

版權聲明：本文為博主原創文章，遵循 CC 4.0 BY-SA 版權協議，轉載請附上原文出處鏈接和本聲明。
本文鏈接：https://blog.csdn.net/broccoli2/article/details/84025285
需求說明：
（1）將計算機本地文檔集中的文本進行分詞、詞性標注，最后進行命名實體識別。
（2）將（1）中處理結果保存到本地txt文件中。

技術選擇：
本需求的實現使用了哈工大的pyltp，如果你對ltp還不太了解，請點擊這里或者去哈工大語言雲官網了解相關內容。

完整代碼展示：

# -*- coding: utf-8 -*-
import os
import jieba

LTP_DATA_DIR = 'D:\pyprojects\LTP\ltp_data' # ltp模型目錄的路徑
cws_model_path = os.path.join(LTP_DATA_DIR, 'cws.model') # 分詞模型路徑，模型名稱為`cws.model`
pos_model_path = os.path.join(LTP_DATA_DIR, 'pos.model') # 詞性標注模型路徑，模型名稱為`pos.model`
ner_model_path = os.path.join(LTP_DATA_DIR, 'ner.model') # 命名實體識別模型路徑，模型名稱為`ner.model`
par_model_path = os.path.join(LTP_DATA_DIR, 'parser.model') # 依存句法分析模型路徑，模型名稱為`parser.model`
srl_model_path = os.path.join(LTP_DATA_DIR, 'srl') # 語義角色標注模型目錄路徑，模型目錄為`srl`。注意該模型路徑是一個目錄，而不是一個文件。

from pyltp import SentenceSplitter
from pyltp import Segmentor
from pyltp import Postagger
from pyltp import NamedEntityRecognizer
from pyltp import Parser
from pyltp import SementicRoleLabeller

#創建停用詞表
def stopwordslist(filepath):
stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
return stopwords

# 分句，也就是將一片文本分割為獨立的句子
def sentence_splitter(sentence):
sents = SentenceSplitter.split(sentence) # 分句
print('\n'.join(sents))

# 分詞
def segmentor(sentence):
segmentor = Segmentor() # 初始化實例
segmentor.load(cws_model_path) # 加載模型
#segmentor.load_with_lexicon('cws_model_path', 'D:\pyprojects\LTP\ltp_data\dict.txt') #加載模型使用用戶自定義字典的高級分詞
words = segmentor.segment(sentence) # 分詞
# 默認可以這樣輸出
# print('/'.join(words))
# 可以轉換成List 輸出
words_list = list(words)
segmentor.release() # 釋放模型
return words_list

# 詞性標注
def posttagger(words):
postagger = Postagger() # 初始化實例
postagger.load(pos_model_path) # 加載模型
postags = postagger.postag(words) # 詞性標注
#for word, tag in zip(words, postags):
# print(word + '/' + tag)
postagger.release() # 釋放模型
return postags

# 命名實體識別
def ner(words, postags):
recognizer = NamedEntityRecognizer() # 初始化實例
recognizer.load(ner_model_path) # 加載模型
netags = recognizer.recognize(words, postags) # 命名實體識別
#for word, ntag in zip(words, netags):
# print(word + '/' + ntag)
recognizer.release() # 釋放模型
return netags

stopwords = stopwordslist('D:/2181729/stop_words.txt')
final = ''
f1=open('D:/2181729/nerfcdata/30.txt','w', encoding='UTF-8')
with open('D:/2181729/data/30.txt', 'r', encoding='UTF-8') as f:

for line in f:
segs = jieba.cut(line, cut_all=False)
for seg in segs:
if seg not in stopwords:
final += seg

words = segmentor(final)
postags = posttagger(words)
netags = ner(words,postags)

tags = []
dict = []

for word, ntag in zip(words, netags):
if(ntag != 'O'):#過濾非命名實體
tags.append(ntag)
if (ntag not in dict):
dict.append(ntag)
# print(word + '/' + ntag)
f1.write(word + ':' + ntag + '\r\n')

for tag in dict:
num = tags.count(tag)
print(tag + ":"+str(num))
f1.write(tag + ":"+str(num) + '\r\n')
f1.close()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
效果展示：

參考：
https://blog.csdn.net/informationscience/article/details/76850652
————————————————
版權聲明：本文為CSDN博主「broccoli2」的原創文章，遵循 CC 4.0 BY-SA 版權協議，轉載請附上原文出處鏈接及本聲明。
原文鏈接：https://blog.csdn.net/broccoli2/article/details/84025285

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 哈工大LTP基本使用-分詞、詞性標注、依存句法分析、命名實體識別、角色標注使用哈工大LTP進行句法分析使用使用nltk 和 spacy進行命名實體提取/識別使用Standford coreNLP進行中文命名實體識別命名實體識別之使用tensorflow的bert模型進行微調 hanlp進行命名實體識別命名實體識別命名實體識別 3. 哈工大LTP解析命名實體識別，使用pyltp提取文本中的地址