自然語言處理(1)之NLTK與PYTHON


自然語言處理(1)之NLTK與PYTHON

題記: 由於現在的項目是搜索引擎,所以不由的對自然語言處理產生了好奇,再加上一直以來都想學Python,只是沒有機會與時間。碰巧這幾天在亞馬遜上找書時發現了這本《Python自然語言處理》,瞬間覺得這對我同時入門自然語言處理與Python有很大的幫助。所以最近都會學習這本書,也寫下這些筆記。

1. NLTK簡述

NLTK模塊及功能介紹

語言處理任務 NLTK模塊 功能描述
獲取語料庫 nltk.corpus 語料庫和詞典的標准化接口
字符串處理 nltk.tokenize,nltk.stem 分詞、句子分解、提取主干
搭配研究 nltk.collocations t-檢驗,卡方,點互信息
詞性標示符 nltk.tag n-gram,backoff,Brill,HMM,TnT
分類 nltk.classify,nltk.cluster 決策樹,最大熵,朴素貝葉斯,EM,k-means
分塊 nltk.chunk 正則表達式,n-gram,命名實體
解析 nltk.parse 圖標,基於特征,一致性,概率性,依賴項
語義解釋 nltk.sem,nltk.inference λ演算,一階邏輯,模型檢驗
指標評測 nltk.metrics 精度,召回率,協議系數
概率與估計 nltk.probability 頻率分布,平滑概率分布
應用 nltk.app,nltk.chat 圖形化的關鍵詞排序,分析器,WordNet查看器,聊天機器人
語言學領域的工作 nltk.toolbox 處理SIL工具箱格式的數據

2. NLTK安裝

  我的Python版本是2.7.5,NLTK版本2.0.4

 1 DESCRIPTION
 2     The Natural Language Toolkit (NLTK) is an open source Python library
 3     for Natural Language Processing.  A free online book is available.
 4     (If you use the library for academic research, please cite the book.)
 5     
 6     Steven Bird, Ewan Klein, and Edward Loper (2009).
 7     Natural Language Processing with Python.  O'Reilly Media Inc.
 8     http://nltk.org/book
 9     
10     @version: 2.0.4

安裝步驟跟http://www.nltk.org/install.html 一樣

1. 安裝Setuptools: http://pypi.python.org/pypi/setuptools

  在頁面的最下面setuptools-5.7.tar.gz

2. 安裝 Pip: 運行 sudo easy_install pip(一定要以root權限運行)

3. 安裝 Numpy (optional): 運行 sudo pip install -U numpy

4. 安裝 NLTK: 運行 sudo pip install -U nltk

5. 進入python,並輸入以下命令

1 192:chapter2 rcf$ python
2 Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
3 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
4 Type "help", "copyright", "credits" or "license" for more information.
5 >>> import nltk
6 >>> nltk.download()

當出現以下界面進行nltk_data的下載

也可直接到 http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml 去下載數據包,並拖到Download Directory。我就是這么做的。

最后在Python目錄運行以下命令以及結果,說明安裝已成功

 1 from nltk.book import *
 2 *** Introductory Examples for the NLTK Book ***
 3 Loading text1, ..., text9 and sent1, ..., sent9
 4 Type the name of the text or sentence to view it.
 5 Type: 'texts()' or 'sents()' to list the materials.
 6 text1: Moby Dick by Herman Melville 1851
 7 text2: Sense and Sensibility by Jane Austen 1811
 8 text3: The Book of Genesis
 9 text4: Inaugural Address Corpus
10 text5: Chat Corpus
11 text6: Monty Python and the Holy Grail
12 text7: Wall Street Journal
13 text8: Personals Corpus
14 text9: The Man Who Was Thursday by G . K . Chesterton 1908

3. NLTK的初次使用

  現在開始進入正題,由於本人沒學過python,所以使用NLTK也就是學習Python的過程。初次學習NLTK主要使用的時NLTK里面自帶的一些現有數據,上圖中已由顯示,這些數據都在nltk.book里面。

3.1 搜索文本

concordance:搜索text1中的monstrous 

 1 >>> text1.concordance("monstrous")
 2 Building index...
 3 Displaying 11 of 11 matches:
 4 ong the former , one was of a most monstrous size . ... This came towards us , 
 5 ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
 6 ll over with a heathenish array of monstrous clubs and spears . Some were thick
 7 d as you gazed , and wondered what monstrous cannibal and savage could ever hav
 8 that has survived the flood ; most monstrous and most mountainous ! That Himmal
 9 they might scout at Moby Dick as a monstrous fable , or still worse and more de
10 th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
11 ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
12 ere to enter upon those still more monstrous stories of them which are to be fo
13 ght have been rummaged out of this monstrous cabinet there is no telling . But 
14 of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u

similar:查找text1中與monstrous相關的所有詞語

1 >>> text1.similar("monstrous")
2 Building word-context index...
3 abundant candid careful christian contemptible curious delightfully
4 determined doleful domineering exasperate fearless few gamesome
5 horrible impalpable imperial lamentable lazy loving

dispersion_plot:用離散圖判斷詞在文本的位置即偏移量

1 >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])

3.2 計數詞匯

len:獲取長度,即可獲取文章的詞匯個數,也可獲取單個詞的長度

1 >>> len(text1)   #計算text1的詞匯個數
2 260819
3 >>> len(set(text1)) #計算text1 不同的詞匯個數
4 19317
5 >>> len(text1[0])   #計算text1 第一個詞的長度
6 1

sorted:排序

1 >>> sent1
2 ['Call', 'me', 'Ishmael', '.']
3 >>> sorted(sent1)
4 ['.', 'Call', 'Ishmael', 'me']

3.3 頻率分布

nltk.probability.FreqDist

1 >>> fdist1=FreqDist(text1)    #獲取text1的頻率分布情況
2 >>> fdist1              #text1具有19317個樣本,但是總體有260819個值
3 <FreqDist with 19317 samples and 260819 outcomes> 
4 >>> keys=fdist1.keys()       
5 >>> keys[:50]                 #獲取text1的前50個樣本
6 [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
1 >>> fdist1.items()[:50]      #text1的樣本分布情況,比如','出現了18713次,總共的詞為260819
2 [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
1 >>> fdist1.hapaxes()[:50]   #text1的樣本只出現一次的詞
2 ['!\'"', '!)"', '!*', '!--"', '"...', "',--", "';", '):', ');--', ',)', '--\'"', '---"', '---,', '."*', '."--', '.*--', '.--"', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '12', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130']
3 >>> fdist1['!\'"']
4 1

  1 >>> fdist1.plot(50,cumulative=True) #畫出text1的頻率分布圖

 3.4 細粒度的選擇詞

1 >>> long_words=[w for w in set(text1) if len(w) > 15]  #獲取text1內樣本詞匯長度大於15的詞並按字典序排序
2 >>> sorted(long_words)        
3 ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
4 >>> fdist1=FreqDist(text1)    #獲取text1內樣本詞匯長度大於7且出現次數大於7的詞並按字典序排序
5 >>> sorted([wforwin set(text5) if len(w) > 7 and fdist1[w] > 7]) 6 ['American', 'actually', 'afternoon', 'anything', 'attention', 'beautiful', 'carefully', 'carrying', 'children', 'commanded', 'concerning', 'considered', 'considering', 'difference', 'different', 'distance', 'elsewhere', 'employed', 'entitled', 'especially', 'everything', 'excellent', 'experience', 'expression', 'floating', 'following', 'forgotten', 'gentlemen', 'gigantic', 'happened', 'horrible', 'important', 'impossible', 'included', 'individual', 'interesting', 'invisible', 'involved', 'monsters', 'mountain', 'occasional', 'opposite', 'original', 'originally', 'particular', 'pictures', 'pointing', 'position', 'possibly', 'probably', 'question', 'regularly', 'remember', 'revolving', 'shoulders', 'sleeping', 'something', 'sometimes', 'somewhere', 'speaking', 'specially', 'standing', 'starting', 'straight', 'stranger', 'superior', 'supposed', 'surprise', 'terrible', 'themselves', 'thinking', 'thoughts', 'together', 'understand', 'watching', 'whatever', 'whenever', 'wonderful', 'yesterday', 'yourself']

3.5 詞語搭配和雙連詞

用bigrams()可以實現雙連詞

1 >>> bigrams(['more','is','said','than','done'])
2 [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
3 >>> text1.collocations()
4 Building collocations list
5 Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
6 whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
7 years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
8 mate; white whale; ivory leg; one hand

3.6 NLTK頻率分類中定義的函數

例子 描述
fdist=FreqDist(samples) 創建包含給定樣本的頻率分布
fdist.inc(sample) 增加樣本
fdist['monstrous'] 計數給定樣本出現的次數
fdist.freq('monstrous') 樣本總數
fdist.N() 以頻率遞減順序排序的樣本鏈表
fdist.keys() 以頻率遞減的順序便利樣本
for sample in fdist: 數字最大的樣本
fdist.max() 繪制頻率分布表
fdist.tabulate() 繪制頻率分布圖
fdist.plot() 繪制積累頻率分布圖
fdist.plot(cumulative=True) 繪制積累頻率分布圖
fdist1<fdist2 測試樣本在fdist1中出現的樣本是否小於fdist2

最后看下text1的類情況. 使用type可以查看變量類型,使用help()可以獲取類的屬性以及方法。以后想要獲取具體的方法可以使用help(),這個還是很好用的。

 1 >>> type(text1)
 2 <class 'nltk.text.Text'>
 3 >>> help('nltk.text.Text')
 4 Help on class Text in nltk.text:
 5 
 6 nltk.text.Text = class Text(__builtin__.object)
 7  |  A wrapper around a sequence of simple (string) tokens, which is
 8  |  intended to support initial exploration of texts (via the
 9  |  interactive console).  Its methods perform a variety of analyses
10  |  on the text's contexts (e.g., counting, concordancing, collocation
11  |  discovery), and display the results.  If you wish to write a
12  |  program which makes use of these analyses, then you should bypass
13  |  the ``Text`` class, and use the appropriate analysis function or
14  |  class directly instead.
15  |  
16  |  A ``Text`` is typically initialized from a given document or
17  |  corpus.  E.g.:
18  |  
19  |  >>> import nltk.corpus
20  |  >>> from nltk.text import Text
21  |  >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt'))
22  |  
23  |  Methods defined here:
24  |  
25  |  __getitem__(self, i)
26  |  
27  |  __init__(self, tokens, name=None)
28  |      Create a Text object.
29  |      
30  |      :param tokens: The source text.
31  |      :type tokens: sequence of str
32  |  
33  |  __len__(self)
34  |  
35  |  __repr__(self)
36  |      :return: A string representation of this FreqDist.
37  |      :rtype: string
38  |  
39  |  collocations(self, num=20, window_size=2)
40  |      Print collocations derived from the text, ignoring stopwords.
41  |      
42  |      :seealso: find_collocations
43  |      :param num: The maximum number of collocations to print.
44  |      :type num: int
45  |      :param window_size: The number of tokens spanned by a collocation (default=2)
46  |      :type window_size: int
47  |  
48  |  common_contexts(self, words, num=20)
49  |      Find contexts where the specified words appear; list
50  |      most frequent common contexts first.
51  |      
52  |      :param word: The word used to seed the similarity search
53  |      :type word: str
54  |      :param num: The number of words to generate (default=20)
55  |      :type num: int
56  |      :seealso: ContextIndex.common_contexts()

4. 語言理解的技術

1. 詞意消歧

2. 指代消解

3. 自動生成語言

4. 機器翻譯

5. 人機對話系統

6. 文本的含義

5. 總結

雖然是初次接觸Python,NLTK,但是我已經覺得他們的好用以及方便,接下來就會深入的學習他們。

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM