一、安裝NLTK

pip install nltk
# 或者 PyCharm --> File --> Settings --> Project Interpreter --> +號搜索 --> Install Package 【matplotlib、numpy、pandas一並安裝，后面會用到】

二、下載NLTK books數據

# download_books.py 中

# -*- coding: utf-8 -*-
# Nola
import nltk
nltk.download()

　　特別說明：Download Directory（下載目錄）可以自己指定，父目錄必須為nltk_data，此處下載目錄為沙盒環境下的share目錄。若不知道該怎么自定義下載目錄可參考下方提供的幾個查找目錄，放在查找目錄下一定沒錯：

　　若顯示下載失敗，在NLTK Downloader界面的All Packages找到對應的庫單獨下載。

三、使用NLTK books數據

　　1.1 引入books數據集

# Pycharm 打開Terminal
# 安裝ipython
pip install ipython

from nltk.book import *

text1

text2

　　1.2 搜索文本

# concordance(word)函數 詞匯索引word及上下文
text1.concordance("monstrous")
text2.concordance("affection")
text5.concordance("lol")

# similar(word)函數 搜索word相關詞
text1.similar("monstrous")
text2.similar("monstrous")

# common_contexts([word1, word2])函數 搜索多個word共同上下文
text2.common_contexts(["monstrous", "very"])

# dispersion_plot([word1, word2, word3])函數 判斷詞在文本中的位置（每一豎線代表一個單詞，從文本開始位置到指定詞前面有多少給詞） 離散圖（使用matplotlib畫圖）

# generate() 生成隨機文本
text3.generate()

　　1.3 詞匯計數　　

# python語法
len(text3)
sorted(set(text3))
len(set(text3))

　　1.4 詞頻分布　

# FreqDist(text)函數 返回text文本中每個詞出現的次數的元組列表
fdist1 = FreqDist(text1)

fdist1
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>

# hapaxes()函數 返回低頻詞
len(fdist1.hapaxes()) 

# most_common(num)函數 返回高頻詞匯top50
fdist1.most_common(50)

fdist1.plot(50, cumulative=True) # top50詞匯累計頻率圖

　　1.5 細粒度選擇詞

　　高頻詞和低頻詞提取出的信息量有限，研究文本中的長詞提取出更多的信息量。采用集合論的一些符號：P性質，V詞匯，w單個詞符，P(w)當且僅當w詞符長度大於15。表示為：{w | w ∈ V & P(w)}

V = set(text1)
long_words = [w for w in V if len(w) > 15]
len(long_words)

fdist5 = FreqDist(text5)
sorted(w for w in set(text5) if len(w) > 7 and fdist5[w] > 7)

　　1.6 詞語搭配和雙連詞

# 詞對稱為雙連詞

# bigrams([word1, word2, word3]) 生成雙連詞 返回一個generator
list(bigrams(["a", "doctor", "with", "him"]))
Out[37]: [('a', 'doctor'), ('doctor', 'with'), ('with', 'him')]

# nltk中使用collocation_list()函數生成 很能體現文本風格
text4.collocation_list()
text8.collocation_list()
Out[44]: 
['would like',
 'medium build',
 'social drinker',
 'quiet nights',
 'non smoker',
 'long term',
 'age open',
 'Would like',
 'easy going',
 'financially secure',
 'fun times',
 'similar interests',
 'Age open',
 'weekends away',
 'poss rship',
 'well presented',
 'never married',
 'single mum',
 'permanent relationship',
 'slim build']

　　1.7 計數詞匯長度

# 統計text1文本詞符長度和長度頻次
[len(w) for w in text1]

fdist = FreqDist(len(w) for w in text1)

In [47]: fdist
Out[47]: FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...})

In [48]: fdist.most_common(10)
Out[48]: 
[(3, 50223),
 (1, 47933),
 (4, 42345),
 (2, 38513),
 (5, 26597),
 (6, 17111),
 (7, 14399),
 (8, 9966),
 (9, 6428),
 (10, 3528)]

In [49]: fdist.max()
Out[49]: 3

In [50]: fdist[3]
Out[50]: 50223

In [51]: fdist.freq(3)
Out[51]: 0.19255882431878046

In [52]: fdist.freq(1)
Out[52]: 0.18377878912195814

　　1.8 函數說明

fdist.N()  # 樣本總數

In [60]: fdist.freq(3)  # 給定樣本的頻率
Out[60]: 0.19255882431878046


In [55]: fdist.tabulate()  # 頻率分布表
    3     1     4     2     5     6     7     8     9    10    11    12    13    14    15    16    17    18    20
50223 47933 42345 38513 26597 17111 14399  9966  6428  3528  1873  1053   567   177    70    22    12     1     1


fdist.plot()  # 頻率分布圖 （圖1）
fdist.plot(cumulative=True)  # 累計頻率分布圖 （圖2）

圖1

圖2

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 利用NLTK在Python下進行自然語言處理 Python NLTK 自然語言處理入門與例程(轉) 自然語言處理入門自然語言處理怎么最快入門？自然語言處理(NLP)入門學習資源清單把python自然語言處理的nltk_data打包到360雲盤，然后共享給朋友們自然語言處理之jieba分詞自然語言處理(一) 關系抽取 NLP自然語言處理 Python自然語言處理-系列一