自然語言處理1——語言處理與Python（內含糾錯）

本文轉載自查看原文 2016-08-31 11:03 4297 自然語言處理/ Python

學習Python自然語言處理，記錄一下學習筆記。

運用Python進行自然語言處理需要用到nltk庫，關於nltk庫的安裝，我使用的pip方式。

pip nltk

或者下載whl文件進行安裝。（推薦pip方式，簡單又適用）。

安裝完成后就可以使用該庫了，但是還需要下載學習所需要的數據。啟動ipython，鍵入下面兩行代碼：

>>>import nltk
>>>nltk.download()

就會出現下面的一個界面：

選擇book，選擇好文件夾，（我選擇的是E:\nltk_data）。下載數據。

下載完成后，可以驗證一下下載成功與否：

>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

如果出現上面的文本，說明下載數據成功。

在進行下面的操作之前，一定要保證先導入數據（from nltk.book import *）

prac1：搜索文本：

1.concordance（'要搜索的文本'）

>>>text1.concordance('monstrous')
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us ,
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

2.similar('文本')：搜索那些詞出現在相似的上下文中：

>>>text1.similar('monstrous')
exasperate imperial gamesome candid subtly contemptible lazy part
pitiable delightfully domineering puzzled determined vexatious
modifies fearless christian horrible mouldy doleful
>>>text2.similar('monstrous')
very heartily so exceedingly a extremely good great remarkably
amazingly sweet as vast

可以看出text1和text2在使用monstrous這個詞上在表達意思上完全不同，在text2中，monstrous有正面的意思。

3.common_contexts(['word1','word2'...]):研究共用2個或者2個以上的詞匯的上下文。

>>>text2.common_contexts(['monstrous','very'])
a_lucky be_glad am_glad a_pretty is_pretty

4.dispersion_plot():位置信息離散圖。每一列代表一個單詞，每一行代表整個文本。（ps：該函數需要依賴numpy和matplotlib庫）

>>>text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])

piac2:計數詞匯：

計數詞匯主要函數為len(),sorted():用於排序,set()：用於得到唯一的詞匯，去除重復。這些函數的用法和Python中一樣，不做重復。

piac3：簡單的統計：

該部分中很多函數都不在適用於python3，有的用法需要自己改進，有的則完全不可用

1.頻率分布：FreqDist（文本）：統計文本中每個標識符出現的頻率。該函數在Python3上使用需要改進。

例如我們在text1《白鯨記》中統計最常出現的50個詞：

原始版本：

>>>fdist1=FreqDist(text1)
>>>vocabulary1=fdist1.keys()
>>>vocabulary[:50]

但是在Python3中卻行不通了。我們需要自己對其進行排序;

>>>fdist1=FreqDist(text1)
>>>len(fdist1)
19317
>>>vocabulary1=sorted(fd.items(),key=lambda jj:jj[1],reverse=True)
>>>s=[]
>>>for i in range(len(vocabulary1)):
            s.append(vocabulary1[i][0])
>>>print(s)
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was',
 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one',
'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']

在本書中，凡是涉及到FreqDist的都需要對其進行改進操作。

2.細粒度選擇詞：這里需要用到Python的列表推導式。

例如：選擇text1中長度大於15的單詞：

>>>V=sorted([w for w in set(text1) if len(w)>15])
['CIRCUMNAVIGATION',
 'Physiognomically',
 'apprehensiveness',
 'cannibalistically',
 'characteristically',
 'circumnavigating',
 'circumnavigation',
 'circumnavigations',
 'comprehensiveness',
 'hermaphroditical',
 'indiscriminately',
 'indispensableness',
 'irresistibleness',
 'physiognomically',
 'preternaturalness',
 'responsibilities',
 'simultaneousness',
 'subterraneousness',
 'supernaturalness',
 'superstitiousness',
 'uncomfortableness',
 'uncompromisedness',
 'undiscriminating',
 'uninterpenetratingly']

判斷的條件還有：s.startwith(t)、s.endwith(t)、t in s、s.islower()、s.isupper()、s.isalpha():都是字母、s.isalnum():字母和數字、s.isdigit()、s.istitle()

3.詞語搭配和雙連詞：collocations()函數可以幫助我們完成這一任務。

如查看text4中的搭配：

>>>text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

本書的第一章中還有一個babelize_shell()翻譯函數，鍵入后會出現下面錯誤：
NameError: name 'babelize_shell' is not defined

原因是因為該模塊已經不再可用了。

利用Python的條件分支和循環就可以簡單的來處理一些文本信息。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python自然語言處理（一）自然語言處理(1)之NLTK與PYTHON python 自然語言處理（五）____WordNet Python 自然語言處理筆記(一) Python自然語言處理-系列一《Python自然語言處理》中文版-糾錯【更新中。。。】自然語言處理——詞的表示自然語言處理(NLP)——簡介 NLTK自然語言處理庫快速了解什么是自然語言處理