NLTK與NLP原理及基礎

本文轉載自查看原文 2018-07-26 00:28 1598

參考https://blog.csdn.net/zxm1306192988/article/details/78896319

以NLTK為基礎配合講解自然語言處理的原理 http://www.nltk.org/

Python上著名的自然語⾔處理庫

自帶語料庫，詞性分類庫
自帶分類，分詞，等功能
強⼤的社區⽀持
還有N多的簡單版wrapper，如 TextBlob

NLTK安裝（可能需要預先安裝numpy）

pip install nltk

　安裝語料庫

import nltk
nltk.download()

NLTK自帶語料庫

>>> from nltk.corpus import brown
>>> brown.categories()  # 分類
['adventure', 'belles_lettres', 'editorial',
'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion',
'reviews', 'romance', 'science_fiction']
>>> len(brown.sents()) # 一共句子數
57340
>>> len(brown.words()) # 一共單詞數
1161192

　　文本處理流程：

文本 -> 預處理（分詞、去停用詞） -> 特征工程 -> 機器學習算法 -> 標簽

分詞（Tokenize）

把長句⼦拆成有“意義”的⼩部件

>>> import nltk
>>> sentence = “hello, world"
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['hello', ‘,', 'world']

　　中英文NLP區別：
英文直接使用空格分詞，中文需要專門的方法進行分詞

中文分詞

import jieba
seg_list = jieba.cut('我來到北京清華大學', cut_all=True)
print('Full Mode:', '/'.join(seg_list))  # 全模式
seg_list = jieba.cut('我來到北京清華大學', cut_all=False)
print('Default Mode:', '/'.join(seg_list))  # 精確模式
seg_list = jieba.cut('他來到了網易杭研大廈')  # 默認是精確模式
print('/'.join(seg_list))
seg_list = jieba.cut_for_search('小明碩士畢業於中國科學院計算所，后在日本京都大學深造')  # 搜索引擎模式
print('搜索引擎模式:', '/'.join(seg_list))
seg_list = jieba.cut('小明碩士畢業於中國科學院計算所，后在日本京都大學深造', cut_all=True)
print('Full Mode:', '/'.join(seg_list))

　　紛繁復雜的詞型

Inflection 變化：walk=>walking=>walked 不影響詞性
derivation 引申：nation（noun）=>national(adjective)=>nationalize(verb) 影響詞性

詞形歸一化

Stemming 詞干提取(詞根還原)：把不影響詞性的inflection 的小尾巴砍掉（使用詞典，匹配最長詞）
walking 砍掉ing=>walk
walked 砍掉ed=>walk
Lemmatization 詞形歸一(詞形還原)：把各種類型的詞的變形，都歸一為一個形式（使用wordnet）
went 歸一 => go
are 歸一 => be

NLTK實現Stemming

詞干提取：3種

1、

from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer=LancasterStemmer()
print(lancaster_stemmer.stem('maximum'))
print(lancaster_stemmer.stem('multiply'))
print(lancaster_stemmer.stem('provision'))
print(lancaster_stemmer.stem('went'))
print(lancaster_stemmer.stem('wenting'))
print(lancaster_stemmer.stem('walked'))
print(lancaster_stemmer.stem('national'))

2、

from nltk.stem.porter import PorterStemmer
porter_stemmer=PorterStemmer()
print(porter_stemmer.stem('maximum'))
print(porter_stemmer.stem('multiply'))
print(porter_stemmer.stem('provision'))
print(porter_stemmer.stem('went'))
print(porter_stemmer.stem('wenting'))
print(porter_stemmer.stem('walked'))
print(porter_stemmer.stem('national'))

　　3、

from nltk.stem import SnowballStemmer
snowball_stemmer=SnowballStemmer("english")
print(snowball_stemmer.stem('maximum'))
print(snowball_stemmer.stem('multiply'))
print(snowball_stemmer.stem('provision'))
print(snowball_stemmer.stem('went'))
print(snowball_stemmer.stem('wenting'))
print(snowball_stemmer.stem('walked'))
print(snowball_stemmer.stem('national'))

　　NLTK實現 Lemmatization（詞形歸一）

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer=WordNetLemmatizer()
print(wordnet_lemmatizer.lemmatize('dogs'))
print(wordnet_lemmatizer.lemmatize('churches'))
print(wordnet_lemmatizer.lemmatize('aardwolves'))
print(wordnet_lemmatizer.lemmatize('abaci'))
print(wordnet_lemmatizer.lemmatize('hardrock'))

　　問題：Went v.是go的過去式 n.英文名：溫特
所以增加詞性信息，可使NLTK更好的 Lemmatization

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
# 沒有POS Tag,默認是NN 名詞
print(wordnet_lemmatizer.lemmatize('are'))
print(wordnet_lemmatizer.lemmatize('is'))
# 加上POS Tag
print(wordnet_lemmatizer.lemmatize('is', pos='v'))
print(wordnet_lemmatizer.lemmatize('are', pos='v'))

　　NLTK標注POS Tag

import nltk
text=nltk.word_tokenize('what does the beautiful fox say')
print(text)
print(nltk.pos_tag(text))

　　去停用詞

import nltk
from nltk.corpus import stopwords
word_list=nltk.word_tokenize('what does the beautiful fox say')
print(word_list )
filter_words=[word for word in word_list if word not in stopwords.words('english')]
print(filter_words)

根據具體task決定，如果是文本查重、寫作風格判斷等，可能就不需要去除停止詞

什么是自然語言處理？
自然語言——> 計算機數據

文本預處理讓我們得到了什么？

NLTK在NLP上的經典應⽤（重點）

情感分析
文本相似度
文本分類

1、情感分析

最簡單的方法：基於情感詞典（sentiment dictionary）
類似於關鍵詞打分機制

like 1
good 2
bad -2
terrible -3

比如：AFINN-111
http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer("english")

sentiment_dictionary = {}
for line in open('AFINN-111.txt'):
    word, score = line.split('\t')
    sentiment_dictionary[word] = int(score)

text = 'I went to Chicago yesterday, what a fucking day!'
word_list = nltk.word_tokenize(text)  # 分詞
words = [(snowball_stemmer.stem(word)) for word in word_list]  # 詞干提取,詞形還原最好有詞性，此處先不進行
words = [word for word in word_list if word not in stopwords.words('english')]  # 去除停用詞
print('預處理之后的詞：', words)
total_score = sum(sentiment_dictionary.get(word, 0) for word in words)
print('該句子的情感得分：', total_score)
if total_score > 0:
    print('積極')
elif total_score == 0:
    print('中性')
else:
    print('消極')

　　缺點：新詞無法處理、依賴人工主觀性、無法挖掘句子深層含義

配上ML的情感分析

from nltk.classify import NaiveBayesClassifier

# 隨手造點訓練集
s1 = 'this is a good book'
s2 = 'this is a awesome book'
s3 = 'this is a bad book'
s4 = 'this is a terrible book'

def preprocess(s):
    dic = ['this', 'is', 'a', 'good', 'book', 'awesome', 'bad', 'terrible']
    return {word: True if word in s else False for word in dic} # 返回句子的詞袋向量表示


# 把訓練集給做成標准形式
training_data = [[preprocess(s1), 'pos'],
                 [preprocess(s2), 'pos'],
                 [preprocess(s3), 'neg'],
                 [preprocess(s4), 'neg']]

# 喂給model吃
model = NaiveBayesClassifier.train(training_data)
# 打出結果
print(model.classify(preprocess('this is a terrible book')))

　　文本相似度

使用 Bag of Words 元素的頻率表示文本特征

使用余弦定理判斷向量相似度

import nltk
from nltk import FreqDist

corpus = 'this is my sentence ' \
         'this is my life ' \
         'this is the day'

# 根據需要做預處理：tokensize,stemming,lemma,stopwords 等
tokens = nltk.word_tokenize(corpus)
print(tokens)

# 用NLTK的FreqDist統計一下文字出現的頻率
fdist = FreqDist(tokens)
# 類似於一個Dict,帶上某個單詞, 可以看到它在整個文章中出現的次數
print(fdist['is'])
# 把最常見的50個單詞拿出來
standard_freq_vector = fdist.most_common(50)
size = len(standard_freq_vector)
print(standard_freq_vector)


# Func:按照出現頻率大小，記錄下每一個單詞的位置
def position_lookup(v):
    res = {}
    counter = 0
    for word in v:
        res[word[0]] = counter
        counter += 1
    return res


# 把詞典中每個單詞的位置記錄下來
standard_position_dict = position_lookup(standard_freq_vector)
print(standard_position_dict)

#新的句子
sentence='this is cool'
# 建立一個跟詞典同樣大小的向量
freq_vector=[0]*size
# 簡單的預處理
tokens=nltk.word_tokenize(sentence)
# 對於新句子里的每個單詞
for word in tokens:
    try:
        # 如果在詞典里有，就在標准位置上加1
        freq_vector[standard_position_dict[word]]+=1
    except KeyError:
        continue

print(freq_vector)

這里求的是一個詞頻率向量。

求完之后再運用上述那個公式。

應用：文本分類

TF-IDF是一個整體

TF：Term Frequency 衡量一個term 在文檔中出現得有多頻繁。
TF（t）=t出現在文檔中的次數/文檔中的term總數

IDF：Inverse Document Frequency ，衡量一個term有多重要。
有些詞出現的很多，但明顯不是很有用，如 ‘is’’the’ ‘and’ 之類的詞。
IDF(t)=loge(文檔總數/含有t的文檔總數)
（如果一個詞越常見，那么分母就越大，逆文檔頻率就越小越接近0。所以分母通常加1，是為了避免分母為0（即所有文檔都不包含該詞）。log表示對得到的值取對數。）
如果某個詞比較少見，但是它在這篇文章中多次出現，那么它很可能就反映了這篇文章的特性，正是我們所需要的關鍵詞。
TF−IDF=TF∗IDF

NLTK實現TF-IDF

from nltk.text import TextCollection

# 首先，把所有的文檔放到TextCollection類中
# 這個類會自動幫你斷句，做統計，做計算
corpus = TextCollection(['this is sentence one',
                         'this is sentence two',
                         ' is sentence three'])

# 直接就能算出tfidf
# (term:一句話中的某個term,text:這句話)
print(corpus.tf_idf('this', 'this is sentence four'))

# 對於每個新句子
new_sentence='this is sentence five'
# 遍歷一遍所有的vocabulary中的詞：
standard_vocab=['this' 'is' 'sentence' 'one' 'two' 'five']
for word in standard_vocab:
    print(corpus.tf_idf(word, new_sentence))

　　得到了 TF-IDF的向量表示后，用ML 模型就行分類即可：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 NLTK基礎【NLP】Python NLTK處理原始文本【NLP】Python NLTK獲取文本語料和詞匯資源 [NLP] TextCNN模型原理和實現 NLP入門（二）探究TF-IDF的原理 NLP自然語言處理原理及名詞介紹 NLP基礎——詞集模型（SOW）和詞袋模型（BOW） Java基礎之Synchronized原理 vmp殼基礎原理一、rsync基礎原理