Python NLTK 獲取文本語料和詞匯資源
作者:白寧超
2016年11月7日13:15:24
摘要:NLTK是由賓夕法尼亞大學計算機和信息科學使用python語言實現的一種自然語言工具包,其收集的大量公開數據集、模型上提供了全面、易用的接口,涵蓋了分詞、詞性標注(Part-Of-Speech tag, POS-tag)、命名實體識別(Named Entity Recognition, NER)、句法分析(Syntactic Parse)等各項 NLP 領域的功能。本文主要介紹NLTK(Natural language Toolkit)的幾種語料庫,以及內置模塊下函數的基本操作,諸如雙連詞、停用詞、詞頻統計、構造自己的語料庫等等,這些都是非常實用的。主要還是基礎知識,關於python方面知識,可以參看本人的【Python五篇慢慢彈】系列文章(本文原創編著,轉載注明出處:Python NLTK獲取文本語料和詞匯資源)
目錄
【Python NLP】干貨!詳述Python NLTK下如何使用stanford NLP工具包(1)
【Python NLP】Python 自然語言處理工具小結(2)
【Python NLP】Python NLTK 走進大秦帝國(3)
【Python NLP】Python NLTK獲取文本語料和詞匯資源(4)
【Python NLP】Python NLTK處理原始文本(5)
1 古騰堡語料庫
直接獲取語料庫的所有文本:nltk.corpus.gutenberg.fileids()
>>> import nltk >>> nltk.corpus.gutenberg.fileids()
運行結果:
導入包獲取語料庫的所有文本
>>> from nltk.corpus import gutenberg >>> gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
查找某個文本
>>> persuasion=nltk.corpus.gutenberg.words("austen-persuasion.txt") >>> len(persuasion) 98171 >>> persuasion[:200] ['[', 'Persuasion', 'by', 'Jane', 'Austen', '1818', ...]
查找文件標識符
num_char = len(gutenberg.raw(fileid)) # 原始文本的長度,包括空格、符號等 num_words = len(gutenberg.words(fileid)) #詞的數量 num_sents = len(gutenberg.sents(fileid)) #句子的數量 num_vocab = len(set([w.lower() for w in gutenberg.words(fileid)])) #文本的尺寸 # 打印出平均詞長(包括一個空白符號,如下詞長是3)、平均句子長度、和文本中每個詞出現的平均次數 print(int(num_char/num_words),int(num_words/num_sents),int(num_words/num_vocab),fileid)
運行結果:
2 網絡和聊天文本
獲取網絡聊天文本
>>> from nltk.corpus import webtext >>> for fileid in webtext.fileids(): print(fileid,webtext.raw(fileid))
運行結果
查看網絡聊天文本信息
>>> for fileid in webtext.fileids(): print(fileid,len(webtext.words(fileid)),len(webtext.raw(fileid)),len(webtext.sents(fileid)),webtext.encoding(fileid)) firefox.txt 102457 564601 1142 ISO-8859-2 grail.txt 16967 65003 1881 ISO-8859-2 overheard.txt 218413 830118 17936 ISO-8859-2 pirates.txt 22679 95368 1469 ISO-8859-2 singles.txt 4867 21302 316 ISO-8859-2 wine.txt 31350 149772 2984 ISO-8859-2
即時消息聊天會話語料庫:
>>> from nltk.corpus import nps_chat >>> chatroom = nps_chat.posts('10-19-20s_706posts.xml') >>> chatroom[123] ['i', 'do', "n't", 'want', 'hot', 'pics', 'of', 'a', 'female', ',', 'I', 'can', 'look', 'in', 'a', 'mirror', '.']
3 布朗語料庫
查看語料信息:
>>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
比較文體中情態動詞的用法:
>>> import nltk >>> from nltk.corpus import brown >>> new_texts=brown.words(categories='news') >>> fdist=nltk.FreqDist([w.lower() for w in new_texts]) >>> modals=['can','could','may','might','must','will'] >>> for m in modals: print(m + ':',fdist[m]) can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
NLTK條件概率分布函數:
>>> cfd=nltk.ConditionalFreqDist((genre,word) for genre in brown.categories() for word in brown.words(categories=genre)) >>> genres=['news','religion','hobbies','science_fiction','romance','humor'] >>> modals=['can','could','may','might','must','will'] >>> cfd.tabulate(condition=genres,samples=modals)
運行結果:
4 路透社語料庫
包括10788個新聞文檔,共計130萬字,這些文檔分90個主題,安裝訓練集和測試分組,編號‘test/14826’文檔屬於測試
>>> from nltk.corpus import reuters >>> print(reuters.fileids()[:500])
運行結果:
查看語料包括的前100個類別:>>> print(reuters.categories()[:100])
查看語料尺寸:
>>> len(reuters.fileids()) 10788
查看語料類別尺寸:
>>> len(reuters.categories()) 90
查看某個編號的語料下類別尺寸:
>>> reuters.categories('training/9865') ['barley', 'corn', 'grain', 'wheat']
查看某幾個聯合編號下語料的類別尺寸:
>>> reuters.categories(['training/9865','training/9880']) ['barley', 'corn', 'grain', 'money-fx', 'wheat']
查看哪些編號的文件屬於指定的類別:
>>> reuters.fileids('barley') ['test/15618', 'test/15649', 'test/15676', 'test/15728', 'test/15871', 'test/15875', 'test/15952', 'test/17767', 'test/17769', 'test/18024', 'test/18263', 'test/18908', 'test/19275', 'test/19668', 'training/10175', 'training/1067', 'training/11208', 'training/11316', 'training/11885', 'training/12428', 'training/13099', 'training/13744', 'training/13795', 'training/13852', 'training/13856', 'training/1652', 'training/1970', 'training/2044', 'training/2171', 'training/2172', 'training/2191', 'training/2217', 'training/2232', 'training/3132', 'training/3324', 'training/395', 'training/4280', 'training/4296', 'training/5', 'training/501', 'training/5467', 'training/5610', 'training/5640', 'training/6626', 'training/7205', 'training/7579', 'training/8213', 'training/8257', 'training/8759', 'training/9865', 'training/9958']
5 就職演說語料庫
查看語料信息:
>>> from nltk.corpus import inaugural >>> len(inaugural.fileids()) 56 >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1957-Eisenhower.txt', '1961-Kennedy.txt', '1965-Johnson.txt', '1969-Nixon.txt', '1973-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1985-Reagan.txt', '1989-Bush.txt', '1993-Clinton.txt', '1997-Clinton.txt', '2001-Bush.txt', '2005-Bush.txt', '2009-Obama.txt']
查看演說語料的年份:
>>> [fileid[:4] for fileid in inaugural.fileids()] ['1789', '1793', '1797', '1801', '1805', '1809', '1813', '1817', '1821', '1825', '1829', '1833', '1837', '1841', '1845', '1849', '1853', '1857', '1861', '1865', '1869', '1873', '1877', '1881', '1885', '1889', '1893', '1897', '1901', '1905', '1909', '1913', '1917', '1921', '1925', '1929', '1933', '1937', '1941', '1945', '1949', '1953', '1957', '1961', '1965', '1969', '1973', '1977', '1981', '1985', '1989', '1993', '1997', '2001', '2005', '2009']
條件概率分布
>>> import nltk >>> cfd=nltk.ConditionalFreqDist((target,fileid[:4]) for fileid in inaugural.fileids() for w in inaugural.words(fileid) for target in ['america','citizen'] if w.lower().startswith(target)) >>> cfd.plot()
運行結果:
標注文本語料庫 :許多語料庫都包括語言學標注、詞性標注、命名實體、句法結構、語義角色等
其他語言語料庫 :某些情況下使用語料庫之前學習如何在python中處理字符編碼
>>> nltk.corpus.cess_esp.words() ['El', 'grupo', 'estatal', 'Electricité_de_France', ...]
文本語料庫常見的幾種結構:
- 孤立的沒有結構的文本集;
- 按文體分類成結構(布朗語料庫)
- 分類會重疊的(路透社語料庫)
- 語料庫可以隨時間變化的(就職演說語料庫)
查找NLTK語料庫函數help(nltk.corpus.reader)
6 載入自己的語料庫
構建自己語料庫
>>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root=r'E:\dict' >>> wordlists=PlaintextCorpusReader(corpus_root,'.*') >>> wordlists.fileids() ['dqdg.txt', 'q0.txt', 'q1.txt', 'q10.txt', 'q2.txt', 'q3.txt', 'q5.txt', 'text.txt'] >>> len(wordlists.words('text.txt')) #如果輸入錯誤或者格式不正確,notepad++轉換下編碼格式即可 152389
語料庫信息:
構建完成自己語料庫之后,利用python NLTK內置函數都可以完成對應操作,換言之,其他語料庫的方法,在自己語料庫中通用,唯一的問題是,部分方法NLTK是針對英文語料的,中文語料不通用(典型的就是分詞),解決方法很多,諸如你通過插件等在NLTK工具包內完成對中文的支持。另外也可以在NLTK中利用StandfordNLP工具包完成對自己語料的操作,這部分知識上節講解過。
7 條件概率分布
條件頻率分布是頻率分布的集合,每一個頻率分布有一個不同的條件,這個條件通常是文本的類別。
條件和事件:
頻率分布計算觀察到的事件,如文本中出現的詞匯。條件頻率分布需要給每個事件關聯一個條件,所以不是處理一個詞序列,而是處理一系列配對序列。
詞序列:text=['The','Fulton','County']
配對序列:pairs=[('news','The'),('news','Fulton')]
每隊形式:(條件,事件),如果我們按照文體處理整個布朗語料庫,將有15個條件(一個文體一個條件)和1161192個事件(一個詞一個事件)
按文體計算詞匯:
>>> from nltk.corpus import brown >>> cfd=nltk.ConditionalFreqDist((genre,word) for genre in brown.categories() for word in brown.words(categories=genre))
拆開來看,只看兩個文體:新聞和言情。對於每個文體,便利文件中的每個詞以產生文體與詞配對
>>> genre_word=[(genre,word) for genre in ['news','romance'] for word in brown.words(categories=genre)] >>> len(genre_word) 170576
文體_詞匹配
>>> genre_word[:4] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> genre_word[-4:] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')]
條件頻率:
>>> cfd=nltk.ConditionalFreqDist(genre_word) >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() ['romance', 'news'] >>> len(cfd['news']) 14394 >>> len(cfd['romance']) 8452
訪問條件下的詞匯表
>>> from nltk.corpus import brown >>> import nltk >>> cfd=nltk.ConditionalFreqDist((genre,word) for genre in brown.categories() for word in brown.words(categories=genre)) >>> len(list(cfd['romance'])) 8452 >>> len(set(cfd['romance'])) 8452 >>> cfd['news']['The'] 806
繪制分布圖和分布表
>>> cfd=nltk.ConditionalFreqDist((target,fileid[:4]) for fileid in inaugural.fileids() for word in inaugural.words(fileid) for target in ['america','citizen'] if word.lower().startswith(target)) >>> cfd.plot(cumulative=True)
運行結果:
生成表格形式展示:
cfd.tabulate(conditions=['English','The'],samples=range(10),cumulative=True)
運行結果:
conditions=['English','The'],限定條件
samples=range(10),指定樣本數
8 更多關於python:代碼重用
使用雙連詞生成隨機文本: bigrams()函數能接受一個詞匯鏈表,並建立一個連詞的詞對鏈表
>>> sent=['Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', 'of', ] >>> nltk.bigrams(sent) <generator object bigrams at 0x0103C180>
產生隨機文本:定義一個程序獲取《創世紀》文本中所有的雙連詞,然后構造一個條件頻率分布來記錄哪些詞匯最有可能跟在后面,例如living后面可以是creature。定義一個這樣的函數如下:Crtl+N,編輯函數腳本
import nltk def generate_model(cfdist,word,num=15): for i in range(num): print(word) word=cfdist[word].max() text = nltk.corpus.genesis.words('english-kjv.txt') bigrams = nltk.bigrams(text) cfd=nltk.ConditionalFreqDist(bigrams)
F5調用執行函數:
========================== RESTART: E:/Python/1.py ========================== >>> cfd['living'] FreqDist({'creature': 7, 'thing': 4, 'substance': 2, ',': 1, '.': 1, 'soul': 1}) >>> generate_model(cfd,'living')
運行結果:
Crtl+N打開IDE編輯器,輸入以下模塊
class MyHello: def hello(): print("Hello Python") def bnc(): print("Hello BNC") def add(num1,num2): print("The sum is \t",str(num1+num2))
Crtl+S保存到本地命名hello.py,並F5運行
============== RESTART: E:/sourceCode/NLPPython/day_03/hello.py ============== >>> from hello import * >>> MyHello.add(1,2) The sum is 3 >>> MyHello.hello() Hello Python
詞典資源: 詞典或者詞典資源是一個詞和短語及其相關信息的集合。 詞匯列表語料庫:
過濾文本:此程序計算文本詞匯表,然后刪除所有出現在現有詞匯列表中出現的元素,只留下罕見的或者拼寫錯誤的詞匯 Crtl+N打開IDE編輯器,輸入以下模塊
class WordsPro: def unusual_words(text): text_vocab=set(w.lower() for w in text if w.isalpha()) english_vocab=set(w.lower() for w in nltk.corpus.words.words()) unusual=text_vocab.difference(english_vocab) return sorted(unusual)
Crtl+S保存到本地命名WordsPro.py,並F5運行
========================== RESTART: E:/Python/1.py ========================== >>> import nltk >>> from nltk.corpus import gutenberg >>> len(WordsPro.unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))) 1601 >>>
停用詞語料庫:包括高頻詞如,the、to和and等。
>>> from nltk.corpus import stopwprds >>> stopwords.words('english')
定義一個函數來計算文本中不包含停用詞列表的詞所占的比例,Crtl+N打開IDE編輯器,輸入以下模塊
def content_faction(text): stopwords=nltk.corpus.stopwords.words('english') content = [w for w in text if w.lower() not in stopwords] return len(content)/len(text)
Crtl+S保存到本地命名WordsPro.py,並F5運行
>>> import nltk >>> from nltk.corpus import reuters >>> WordsPro.content_faction(nltk.corpus.reuters.words()) 0.735240435097661
詞迷游戲:3*3的方格出現不同的9個字母,隨機選擇一個字母並利用這個字母組詞,要求如下:
1)詞長大於或等於4,且每個字母只能使用一次
2)至少有一個9字母的詞
3)能組成21個詞為好,32個詞很好,42個詞非常好
python 程序:
>>> import nltk >>> puzzle_letters = nltk.FreqDist('egivrvonl') >>> obligatory = 'r'#默認選擇r >>> wordlist=nltk.corpus.words.words() >>> [w for w in wordlist if len(w) >=6 and obligatory in w and nltk.FreqDist(w)<=puzzle_letters]
運行結果
詞匯工具:Toolbox和Shoebox
Toolbox下載http://www-01.sil.org/computing/toolbox/
9 python 實戰:數據文本分詞並去除停用詞操作:停用詞包下載
1 對數據文本進行分詞
2 構建自己停用詞語料庫
3 去除停用詞
>>> from nltk.tokenize.stanford_segmenter import StanfordSegmenter >>> segmenter = StanfordSegmenter( path_to_jar=r"E:\tools\stanfordNLTK\jar\stanford-segmenter.jar", path_to_slf4j=r"E:\tools\stanfordNLTK\jar\slf4j-api.jar", path_to_sihan_corpora_dict=r"E:\tools\stanfordNLTK\jar\data", path_to_model=r"E:\tools\stanfordNLTK\jar\data\pku.gz", path_to_dict=r"E:\tools\stanfordNLTK\jar\data\dict-chris6.ser.gz" ) >>> with open(r"C:\Users\cuitbnc\Desktop\dqdg.txt","r+") as f: str=f.read() >>> result = segmenter.segment(str) >>> with open(r"C:\Users\cuitbnc\Desktop\text1.txt","w") as wf: wf.write(result) 1122469 >>> from nltk.corpus import PlaintextCorpusReader >>> corpus_root=r'E:\dict\StopWord' >>> wordlists=PlaintextCorpusReader(corpus_root,'.*') >>> wordlists.fileids() ['baidu.txt', 'chuangda.txt', 'hagongda.txt', 'zhongwen.txt', '中文停用詞庫.txt', '四川大學機器智能實驗室停用詞庫.txt'] >>> len(wordlists.words('hagongda.txt')) 977 >>> wordlists.words('hagongda.txt')[:100] ['———', '》),', ')÷(', '1', '-', '”,', ')、', '=(', ':', ...] >>> stopwords=wordlists.words('hagongda.txt') >>> content = [w for w in result if w not in stopwords]