NLTK (Natural Language Toolkit)
NTLK是著名的Python自然語言處理工具包,但是主要針對的是英文處理。NLTK配套有文檔,有語料庫,有書籍。
- NLP領域中最常用的一個Python庫
- 開源項目
- 自帶分類、分詞等功能
- 強大的社區支持
- 語料庫,語言的實際使用中真是出現過的語言材料
- http://www.nltk.org/py-modindex.html
在NLTK的主頁詳細介紹了如何在Mac、Linux和Windows下安裝NLTK:http://nltk.org/install.html ,建議直接下載Anaconda,省去了大部分包的安裝,安裝NLTK完畢,可以import nltk測試一下,如果沒有問題,還有下載NLTK官方提供的相關語料。
安裝步驟:
-
下載NLTK包
pip install nltk
-
運行Python,並輸入下面的指令
import nltk nltk.download()
-
彈出下面的窗口,建議安裝所有的包 ,即
all
-
測試使用:
語料庫
nltk.corpus
import nltk from nltk.corpus import brown # 需要下載brown語料庫 # 引用布朗大學的語料庫 # 查看語料庫包含的類別 print(brown.categories()) # 查看brown語料庫 print('共有{}個句子'.format(len(brown.sents()))) print('共有{}個單詞'.format(len(brown.words())))
執行結果:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] 共有57340個句子 共有1161192個單詞
分詞 (tokenize)
- 將句子拆分成具有語言語義學上意義的詞
- 中、英文分詞區別:
- 英文中,單詞之間是以空格作為自然分界符的
- 中文中沒有一個形式上的分界符,分詞比英文復雜的多
- 中文分詞工具,如:結巴分詞
pip install jieba
- 得到分詞結果后,中英文的后續處理沒有太大區別
# 導入jieba分詞 import jieba seg_list = jieba.cut("歡迎來到黑馬程序員Python學科", cut_all=True) print("全模式: " + "/ ".join(seg_list)) # 全模式 seg_list = jieba.cut("歡迎來到黑馬程序員Python學科", cut_all=False) print("精確模式: " + "/ ".join(seg_list)) # 精確模式
運行結果:
全模式: 歡迎/ 迎來/ 來到/ 黑馬/ 程序/ 程序員/ Python/ 學科 精確模式: 歡迎/ 來到/ 黑馬/ 程序員/ Python/ 學科
詞形問題
- look, looked, looking
- 影響語料學習的准確度
- 詞形歸一化
1. 詞干提取(stemming)
示例:
# PorterStemmer from nltk.stem.porter import PorterStemmer porter_stemmer = PorterStemmer() print(porter_stemmer.stem('looked')) print(porter_stemmer.stem('looking')) # 運行結果: # look # look
示例:
# SnowballStemmer from nltk.stem import SnowballStemmer snowball_stemmer = SnowballStemmer('english') print(snowball_stemmer.stem('looked')) print(snowball_stemmer.stem('looking')) # 運行結果: # look # look
示例:
# LancasterStemmer from nltk.stem.lancaster import LancasterStemmer lancaster_stemmer = LancasterStemmer() print(lancaster_stemmer.stem('looked')) print(lancaster_stemmer.stem('looking')) # 運行結果: # look # look
2. 詞形歸並(lemmatization)
-
stemming,詞干提取,如將ing, ed去掉,只保留單詞主干
-
lemmatization,詞形歸並,將單詞的各種詞形歸並成一種形式,如am, is, are -> be, went->go
-
NLTK中的stemmer
PorterStemmer, SnowballStemmer, LancasterStemmer
-
NLTK中的lemma
WordNetLemmatizer
-
問題
went 動詞 -> go, 走 Went 名詞 -> Went,文特
-
指明詞性可以更准確地進行lemma
示例:
from nltk.stem import WordNetLemmatizer # 需要下載wordnet語料庫 wordnet_lematizer = WordNetLemmatizer() print(wordnet_lematizer.lemmatize('cats')) print(wordnet_lematizer.lemmatize('boxes')) print(wordnet_lematizer.lemmatize('are')) print(wordnet_lematizer.lemmatize('went')) # 運行結果: # cat # box # are # went
示例:
# 指明詞性可以更准確地進行lemma # lemmatize 默認為名詞 print(wordnet_lematizer.lemmatize('are', pos='v')) print(wordnet_lematizer.lemmatize('went', pos='v')) # 運行結果: # be # go
3. 詞性標注 (Part-Of-Speech)
-
NLTK中的詞性標注
nltk.word_tokenize()
示例:
import nltk words = nltk.word_tokenize('Python is a widely used programming language.') print(nltk.pos_tag(words)) # 需要下載 averaged_perceptron_tagger # 運行結果: # [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('widely', 'RB'), ('used', 'VBN'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]
4. 去除停用詞
- 為節省存儲空間和提高搜索效率,NLP中會自動過濾掉某些字或詞
- 停用詞都是人工輸入、非自動化生成的,形成停用詞表
-
分類
語言中的功能詞,如the, is…
詞匯詞,通常是使用廣泛的詞,如want
-
中文停用詞表
中文停用詞庫
哈工大停用詞表
四川大學機器智能實驗室停用詞庫
百度停用詞列表
-
其他語言停用詞表
-
使用NLTK去除停用詞
stopwords.words()
示例:
from nltk.corpus import stopwords # 需要下載stopwords filtered_words = [word for word in words if word not in stopwords.words('english')] print('原始詞:', words) print('去除停用詞后:', filtered_words) # 運行結果: # 原始詞: ['Python', 'is', 'a', 'widely', 'used', 'programming', 'language', '.'] # 去除停用詞后: ['Python', 'widely', 'used', 'programming', 'language', '.']
5. 典型的文本預處理流程
示例:
import nltk from nltk.stem import WordNetLemmatizer from nltk.corpus import stopwords # 原始文本 raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.' # 分詞 raw_words = nltk.word_tokenize(raw_text) # 詞形歸一化 wordnet_lematizer = WordNetLemmatizer() words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words] # 去除停用詞 filtered_words = [word for word in words if word not in stopwords.words('english')] print('原始文本:', raw_text) print('預處理結果:', filtered_words)
運行結果:
原始文本: Life is like a box of chocolates. You never know what you're gonna get. 預處理結果: ['Life', 'like', 'box', 'chocolate', '.', 'You', 'never', 'know', "'re", 'gon', 'na', 'get', '.']
使用案例:
import nltk from nltk.tokenize import WordPunctTokenizer sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle') paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!" # 分句 sentences = sent_tokenizer.tokenize(paragraph) print(sentences) sentence = "Are you old enough to remember Michael Jackson attending. the Grammys with Brooke Shields and Webster sat on his lap during the show?" # 分詞 words = WordPunctTokenizer().tokenize(sentence.lower()) print(words)
輸出結果:
['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!'] ['are', 'you', 'old', 'enough', 'to', 'remember', 'michael', 'jackson', 'attending', '.', 'the', 'grammys', 'with', 'brooke', 'shields', 'and', 'webster', 'sat', 'on', 'his', 'lap', 'during', 'the', 'show', '?']