NLTK的內置函數

1. 詞語索引

(1) concordance函數給出一個指定單詞每一次出現，連同上下文一起顯示。

>>>text1.concordance('monstrous')

(2) similar函數查找文中上下文結構相似的詞，如the___pictures 和 the___size 等。

>>> text1.similar("monstrous")

(3) common_contexts 函數檢測、查找兩個或兩個以上的詞共同的上下文。

>>> text2.common_contexts(["monstrous", "very"])
be_glad am_glad a_pretty is_pretty a_lucky
>>>

2. 詞語離散圖

判斷詞在文本中的位置：從文本開頭算起在它前面有多少詞。這個位置信息可以用離散圖表示。

>>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
>>>

3. 詞語計數

>>>len(text3)

44764

4. 文本-->詞表並排序

sorted(set(text3))

5. 詞匯豐富度

>>> from __future__ import division
>>> len(text3) / len(set(text3))
16.050197203298673
>>>

6. 詞在文本中出現的次數和百分比

>>> text3.count("smote")
5
>>> 100 * text4.count('a') / len(text4)
1.4643016433938312
>>>

7. 索引列表

(1) 表示元素位置的數字叫做元素的索引。

>>> text1[50]
'grammars'
>>>

(2) 找出一個詞第一次出現的索引。

>>> text1.index('grammars')
50
>>>

8. 切片可以獲取到文本中的詞匯(文本片段)。

>>>text1[100:120]['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-', 'fish', 'is', 'to', 'be', 'called', 'in', 'our', 'tongue', 'leaving', 'out']
>>>

9. NLTK 頻率分布類中定義的函數

例子描述
fdist = FreqDist(samples) 創建包含給定樣本的頻率分布
fdist.inc(sample) 增加樣本
fdist['monstrous'] 計數給定樣本出現的次數
fdist.freq('monstrous') 給定樣本的頻率
fdist.N() 樣本總數
fdist.keys() 以頻率遞減順序排序的樣本鏈表
for sample in fdist: 以頻率遞減的順序遍歷樣本
fdist.max() 數值最大的樣本
fdist.tabulate() 繪制頻率分布表
fdist.plot() 繪制頻率分布圖

fdist.plot(cumulative=True) 繪制累積頻率分布圖
fdist1 < fdist2 測試樣本在fdist1 中出現的頻率是否小於fdist2

text1.concordance("monstrous") # 搜索單詞，並顯示上下文
text1.similar("monstrous") # 搜索具有相似上下文的單詞
text2.common_context(["monstrous", "very"]) #兩個或兩個以上的詞的共同的上下文
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) # 將語料按時間順序拼接，此命令即可畫出這些單詞在語料中的位置，可以用來研究隨時間推移語言使用上的變化
text3.generate() # 根據語料3的詞序列統計信息生成隨機文本【計算機寫SCI論文的原理？】

len(text3) / len(set(text3)) # 計算平均詞頻或者叫詞匯豐富度
100* text3.count("smote") / len(text3) # 計算特定詞在文本中的百分比
標識符: All words
類型：Unique words

FreqDist(text1).keys()[:50] # 查看text1中頻率最高的前50個詞，FreeDist([])用來計算列表中元素的頻率
FreqDist(text1).hapaxes() # 查看頻率為1的詞
bigrams(['more', 'is', 'said', 'than', 'done']) # 構造雙連詞，即[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
text4.collocations() # 返回文本中的雙連詞

fdist = FreqDist(samples) 創建包含給定樣本的頻率分布
fdist.inc(sample) 增加樣本
fdist['monstrous'] 計數給定樣本出現的次數
fdist.freq('monstrous') 給定樣本的頻率
fdist.N() 樣本總數
fdist.keys() 以頻率遞減順序排序的樣本鏈表
for sample in fdist: 以頻率遞減的順序遍歷樣本
fdist.max() 數值最大的樣本
fdist.tabulate() 繪制頻率分布表
fdist.plot() 繪制頻率分布圖
fdist.plot(cumulative=True) 繪制累積頻率分布圖
fdist1 < fdist2 測試樣本在 fdist1 中出現的頻率是否小於 fdist2

nltk.Text(gutenberg.words("autsten-emma.txt') # 索引文本，下一步才能使用concordance等函數.
gutenberg.raw(fileid) # 給出原始文本內容
gutenberg.words(fileid) # 詞數
gutenberg.sents(fileid) # 句數
wordlists = PlaintextCorpusReader(corpus_root, '.*') # 讀入自己的語料庫

cfdist= ConditionalFreqDist(pairs) 從配對鏈表中創建條件頻率分布
cfdist.conditions() 將條件按字母排序
cfdist[condition] 此條件下的頻率分布
cfdist[condition][sample] 此條件下給定樣本的頻率
cfdist.tabulate() 為條件頻率分布制表
cfdist.tabulate(samples, conditions) 指定樣本和條件限制下制表
cfdist.plot() 為條件頻率分布繪圖
cfdist.plot(samples, conditions) 指定樣本和條件限制下繪圖
cfdist1 < cfdist2 測試樣本在 cfdist1 中出現次數是否小於在 cfdist2 中出現次數

條件概率的應用:

# -*- encoding: utf-8 -*-

import nltk

def generate_model(cfdist, word, num=15):
    for i in range(num):
        print word
        word = cfdist[word].max()

text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

print cfd['living']

generate_model(cfd, 'living')

nltk.corpus.stopwords.words('english') # stop words, 停用詞
nltk.corpus.names # 姓名

wordnet.synsets('car') # 同義詞集
wordnet.lemmas('car') # 獲取所有包含詞car的詞條

下載、讀取、處理網絡文本
from urllib import urlopen
url = " http://www.gutenberg.org/files/2554/2554.txt"
raw = urlopen(url).read()

url = " http://news.bbc.co.uk/2/hi/health/2284783.stm"
html = urlopen(url).read()
raw = nltk.clean_html(html) # 清除html標記，但導航等內容還是無法清除

import feedparser
blog = feedparser.parse(" http://languagelog.ldc.upenn.edu/nll/?feed=atom")
blog['feed']['title']
post = blog.entries[2]

tokens = nltk.word_tokenize(raw) # 分詞
text = nltk.Text(tokens) # 下一步才能使用text.collocations()等函數

# 解碼
import codecs
f = codecs.open(path, encoding='latin2')

# 正則
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ==> ['ing']
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processing') ==> ['processing']

re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 's')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes') ==> [('processe', 'es')]
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') ==> [('processe', '')]

# 查找上、下位詞
hobbies_learned = nltk.Text(brown.words(categories=['hobbies', 'learned']))
hobbies_learned.findall(r"<\w*> <and> <other> <\w*s>")

將得到：
speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;

# 詞干提取
tokens = nltk.word_tokenize(raw)
porter = nltk.PorterStemmer()
lancaster = nltk.LancasterStemmer()
[porter.stem(t) for t in tokens]

# 詞形歸並
wnl = nltk.WordNetLemmatizer()
[wnl.lemmatize(t) for t in tokens]

# 分詞
nltk.regexp_tokenize()

# Python 過程風格與聲明風格
# 找到文本中最長的詞

maxlen = max(len(word) for word in text)
[word for word in text if len(word) == maxlen] # 熟悉並經常使用

lengths = map(len, nltk.corpus.brown.sents(categories="news"))
avg = sum(lengths) / len(lengths)

set() # 后台已經做了索引，集合成員地查找盡可能使用set

matplotlib # 繪圖工具
NetworkX # 網絡可視化

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 轉】SparkSQL中的內置函數【轉】Python 內置函數 locals（）和globals（）【轉】Python max內置函數詳細介紹 Oracle SQL 內置函數大全(轉) CESIUM內置shader變量和函數[轉] 內置函數【轉】linux下awk內置函數的使用(split/substr/length) WebGL 內置變量與內置函數 Python 函數與內置函數內置函數與匿名函數