學習Python自然語言處理,記錄一下學習筆記。
運用Python進行自然語言處理需要用到nltk庫,關於nltk庫的安裝,我使用的pip方式。
pip nltk
或者下載whl文件進行安裝。(推薦pip方式,簡單又適用)。
安裝完成后就可以使用該庫了,但是還需要下載學習所需要的數據。啟動ipython,鍵入下面兩行代碼:
>>>import nltk >>>nltk.download()
就會出現下面的一個界面:
選擇book,選擇好文件夾,(我選擇的是E:\nltk_data)。下載數據。
下載完成后,可以驗證一下下載成功與否:
>>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908
如果出現上面的文本,說明下載數據成功。
在進行下面的操作之前,一定要保證先導入數據(from nltk.book import *)
prac1:搜索文本:
1.concordance('要搜索的文本')
>>>text1.concordance('monstrous') Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
2.similar('文本'):搜索那些詞出現在相似的上下文中:
>>>text1.similar('monstrous') exasperate imperial gamesome candid subtly contemptible lazy part pitiable delightfully domineering puzzled determined vexatious modifies fearless christian horrible mouldy doleful >>>text2.similar('monstrous') very heartily so exceedingly a extremely good great remarkably amazingly sweet as vast
可以看出text1和text2在使用monstrous這個詞上在表達意思上完全不同,在text2中,monstrous有正面的意思。
3.common_contexts(['word1','word2'...]):研究共用2個或者2個以上的詞匯的上下文。
>>>text2.common_contexts(['monstrous','very']) a_lucky be_glad am_glad a_pretty is_pretty
4.dispersion_plot():位置信息離散圖。每一列代表一個單詞,每一行代表整個文本。(ps:該函數需要依賴numpy和matplotlib庫)
>>>text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])
piac2:計數詞匯:
計數詞匯主要函數為len(),sorted():用於排序,set():用於得到唯一的詞匯,去除重復。這些函數的用法和Python中一樣,不做重復。
piac3:簡單的統計:
該部分中很多函數都不在適用於python3,有的用法需要自己改進,有的則完全不可用
1.頻率分布:FreqDist(文本):統計文本中每個標識符出現的頻率。該函數在Python3上使用需要改進。
例如我們在text1《白鯨記》中統計最常出現的50個詞:
原始版本:
>>>fdist1=FreqDist(text1) >>>vocabulary1=fdist1.keys() >>>vocabulary[:50]
但是在Python3中卻行不通了。我們需要自己對其進行排序;
>>>fdist1=FreqDist(text1) >>>len(fdist1) 19317 >>>vocabulary1=sorted(fd.items(),key=lambda jj:jj[1],reverse=True) >>>s=[] >>>for i in range(len(vocabulary1)): s.append(vocabulary1[i][0]) >>>print(s) [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
在本書中,凡是涉及到FreqDist的都需要對其進行改進操作。
2.細粒度選擇詞:這里需要用到Python的列表推導式。
例如:選擇text1中長度大於15的單詞:
>>>V=sorted([w for w in set(text1) if len(w)>15]) ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
判斷的條件還有:s.startwith(t)、s.endwith(t)、t in s、s.islower()、s.isupper()、s.isalpha():都是字母、s.isalnum():字母和數字、s.isdigit()、s.istitle()
3.詞語搭配和雙連詞:collocations()函數可以幫助我們完成這一任務。
如查看text4中的搭配:
>>>text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties
本書的第一章中還有一個babelize_shell()翻譯函數,鍵入后會出現下面錯誤:
NameError: name 'babelize_shell' is not defined
原因是因為該模塊已經不再可用了。
利用Python的條件分支和循環就可以簡單的來處理一些文本信息。