spaCy庫的基本使用

在做ABSA任務的時候，一個開源項目里用到這個自然語言處理工具庫。

摘要出來以供學習。

關於spaCy和安裝
spaCy流水線和屬性
- Tokenization
- Pos Tagging
- Entity Detection
- Dependency Parsing
- 名詞短語
與NLTK和coreNLP的對比

1.關於spaCy 和安裝

1.1 關於 Spacy

Spacy 是由 cython 編寫。因此它是一個非常快的庫。 spaCy 提供簡潔的接口用來訪問其方法和屬性 governed by trained machine (and deep) learning models.

1.2 安裝

安裝 Spacy

pip install spacy

下載數據和模型

python -m spacy download en_core_web_sm		# 以前是en，現在已經修改為en_core_web_sm

現在，您可以使用 Spacy 了。

2. Spacy 流水線和屬性

要想使用 Spacy 和訪問其不同的 properties，需要先創建 pipelines。 通過加載模型來創建一個 pipeline。 Spacy 提供了許多不同的模型 , 模型中包含了語言的信息- 詞匯表，預訓練的詞向量，語法和實體。

下面將加載默認的模型- english-core-web

import spacy 
nlp = spacy.load(“en_core_web_sm”)12

nlp 對象將要被用來創建文檔，訪問語言注釋和不同的 nlp 屬性。我們通過加載一個文本文件來創建一個 document 。這里使用的是從 tripadvisor's 網站上下載下來的旅館評論。

document = open(filename).read()
document = nlp(document)12

現在，document 成為 spacy.english 模型的一部分，同時 document 也有一些成員屬性。可以通過 dir(document) 查看。

dir(document)
>> [..., 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']12

document 包含大量的文檔屬性信息，包括 - tokens, token’s reference index, part of speech tags, entities, vectors, sentiment, vocabulary etc. 下面將介紹一下幾個屬性

2.1 Tokenization

"this is a sentence."
-> (tokenization)
>> ['this', 'is', 'a', 'sentence', '.']123

Spacy 會先將文檔分解成句子，然后再 tokenize 。我們可以使用迭代來遍歷整個文檔。

# first token of the doc 
document[0] 
>> Nice

# last token of the doc  
document[len(document)-5]
>> boston 

# List of sentences of our doc 
list(document.sents)
>> [ Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
...
Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).]1234567891011121314

2.2 Part of Speech Tagging (詞性標注)

詞性標注： word 的動詞/名詞/… 屬性。這些標注可以作為文本特征用到 information filtering, statistical models, 和 rule based parsing 中.

# get all tags
all_tags = {w.pos: w.pos_ for w in document}
>> {83: 'ADJ', 91: 'NOUN', 84: 'ADP', 89: 'DET', 99: 'VERB', 94: 'PRON', 96: 'PUNCT', 85: 'ADV', 88: 'CCONJ', 95: 'PROPN', 102: 'SPACE', 93: 'PART', 98: 'SYM', 92: 'NUM', 100: 'X', 90: 'INTJ'}

# all tags of first sentence of our document 
for word in list(document.sents)[0]:  
    print(word, word.tag_)
>> (Nice, 'JJ') (place, 'NN') (Better, 'JJR') (than, 'IN') (some, 'DT') (reviews, 'NNS') (give, 'VBP') (it, 'PRP') (credit, 'NN') (for, 'IN') (., '.') 
123456789

下面代碼創建一個文本處理操作，去掉噪聲詞。

#define some parameters  
noisy_pos_tags = ["PROP"]
min_token_length = 2

#Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise 
def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

# top unigrams used in the reviews 
from collections import Counter
cleaned_list = [cleanup(word.string) for word in document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)
>> [('hotel', 683), ('room', 652), ('great', 300),  ('sheraton', 285), ('location', 271)]123456789101112131415161718192021222324

2.3 Entity Detection （實體檢測）

Spacy 包含了一個快速的實體識別模型，它可以識別出文檔中的實體短語。有多種類型的實體，例如 - 人物，地點，組織，日期，數字。可以通過 document 的 ents 屬性來訪問這些實體。

下面代碼用來找出當前文檔中的所有命名實體。

labels = set([w.label_ for w in document.ents]) 
for label in labels: 
    entities = [cleanup(e.string, lower=False) for e in document.ents if label==e.label_] 
    entities = list(set(entities)) 
    print label,entities12345

2.4 Dependency Parsing

spacy 一個非常強大的特性就是十分快速和准確的語法解析樹的構建，通過一個簡單的 API 即可完成。這個 parser 也可以用作句子邊界檢測和短語切分。通過 “.children” , “.root”, “.ancestor” 即可訪問。

# extract all review sentences that contains the term - hotel
hotel = [sent for sent in document.sents if 'hotel' in sent.string.lower()]

# create dependency tree
sentence = hotel[2] 
for word in sentence:
    print(word, ': ', str(list(word.children)))
>> A :  []  
cab :  [A, from] 
from :  [airport, to]
the :  [] 
airport :  [the] 
to :  [hotel] 
the :  [] 
hotel :  [the] 
can :  []
be :  [cab, can, cheaper, .] 
cheaper :  [than]
than :  [shuttles] 
the :  []
shuttles :  [the, depending] 
depending :  [time] 
what :  [] 
time :  [what, of] 
of :  [day]
the :  [] 
day :  [the, go] 
you :  []
go :  [you]
. :  []123456789101112131415161718192021222324252627282930

下面代碼所作的工作是：解析所有包含 “hotel” 句子的依賴樹，看看都用了什么樣的形容詞來描述 “hotel”。下面創建了一個自定義函數來解析依賴樹和抽取相關的詞性標簽。

# check all adjectives used with a word 
def pos_words (document, token, pos_tag):
    sentences = [sent for sent in document.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if token in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children
                                                      if child.pos_ == pos_tag] )
    return Counter(pwrds).most_common(10)

pos_words(document, 'hotel', "ADJ")
>> [(u'other', 20), (u'great', 10), (u'good', 7), (u'better', 6), (u'nice', 6), (u'different', 5), (u'many', 5), (u'best', 4), (u'my', 4), (u'wonderful', 3)]12345678910111213

2.5 Noun Phrases （名詞短語）

Dependency trees 也可以用來生成名詞短語。

# Generate Noun Phrases 
doc = nlp(u'I love data science on analytics vidhya') 
for np in doc.noun_chunks:
    print(np.text, np.root.dep_, np.root.head.text)
>> I nsubj love
   data science dobj love
   analytics pobj on1234567