用 Python 和 Stanford CoreNLP 進行中文自然語言處理

本文轉載自查看原文 2017-04-24 09:15 2631

實驗環境：Windows 7 / Python 3.6.1 / CoreNLP 3.7.0

一、下載 CoreNLP

在 Stanford NLP 官網下載最新的模型文件：

CoreNLP 完整包 stanford-corenlp-full-2016-10-31.zip：下載后解壓到工作目錄。
中文模型stanford-chinese-corenlp-2016-10-31-models.jar：下載后復制到上述工作目錄。

二、安裝 stanza

stanza 是 Stanford CoreNLP 官方最新開發的 Python 接口。

根據 StanfordNLPHelp 在 stackoverflow 上的解釋，推薦 Python 用戶使用 stanza 而非 nltk 的接口。

If you want to use our tools in Python, I would recommend using the Stanford CoreNLP 3.7.0 server and making small server requests (or using the stanza library).

If you use nltk what I believe happens is Python just calls our Java code with subprocess and this can actually be very inefficient since distinct calls reload all of the models.

注意 stanza\setup.py 文件臨近結尾部分，有一行是

packages=['stanza', 'stanza.text', 'stanza.monitoring', 'stanza.util'],

這樣安裝后缺少模塊，需要手動修改為

packages=['stanza', 'stanza.text', 'stanza.monitoring', 'stanza.util', 'stanza.corenlp', 'stanza.ml', 'stanza.cluster', 'stanza.research'],

三、測試

在CoreNLP工作目錄中，打開cmd窗口，啟動服務器：

如果處理英文，輸入
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
如果處理中文，輸入
java -Xmx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties StanfordCoreNLP-chinese.properties -port 9000 -timeout 15000

注意stanford-chinese-corenlp-2016-10-31-models.jar應當位於工作目錄下。

可在瀏覽器中鍵入 http://localhost:9000/ 或 corenlp.run 進行直觀測試。

Python示例代碼：

from stanza.nlp.corenlp import CoreNLPClient
client = CoreNLPClient(server='http://localhost:9000', default_annotators=['ssplit', 'lemma', 'tokenize', 'pos', 'ner']) # 注意在以前的版本中，中文分詞為 segment，新版已經和其他語言統一為 tokenize

# 分詞和詞性標注測試
test1 = "深藍的天空中掛着一輪金黃的圓月，下面是海邊的沙地，都種着一望無際的碧綠的西瓜，其間有一個十一二歲的少年，項帶銀圈，手捏一柄鋼叉，向一匹猹盡力的刺去，那猹卻將身一扭，反從他的胯下逃走了。"
annotated = client.annotate(test1)
for sentence in annotated.sentences:
    for token in sentence:
        print(token.word, token.pos)

# 命名實體識別測試
test2 = "大概是物以希為貴罷。北京的白菜運往浙江，便用紅頭繩系住菜根，倒掛在水果店頭，尊為膠菜；福建野生着的蘆薈，一到北京就請進溫室，且美其名曰龍舌蘭。我到仙台也頗受了這樣的優待……"
annotated = client.annotate(test2)
for sentence in annotated.sentences:
    for token in sentence:
        if token.ner != 'O':
          print(token.word, token.ner)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 使用TensorFlow進行中文自然語言處理的情感分析 python 自然語言處理（五）____WordNet Python 自然語言處理筆記(一) PYTHON自然語言處理中文翻譯 NLTK 中文版.pdf 自然語言處理之中文分詞算法 [自然語言處理] 中文分詞技術 hanlp中文自然語言處理的幾種分詞方法學習NLP:《精通Python自然語言處理》中文PDF+英文PDF+代碼 Python自然語言處理工具小結 hanlp的基本使用--python(自然語言處理)