自然語言處理(1)之NLTK與PYTHON
題記: 由於現在的項目是搜索引擎,所以不由的對自然語言處理產生了好奇,再加上一直以來都想學Python,只是沒有機會與時間。碰巧這幾天在亞馬遜上找書時發現了這本《Python自然語言處理》,瞬間覺得這對我同時入門自然語言處理與Python有很大的幫助。所以最近都會學習這本書,也寫下這些筆記。
1. NLTK簡述
NLTK模塊及功能介紹
語言處理任務 | NLTK模塊 | 功能描述 |
獲取語料庫 | nltk.corpus | 語料庫和詞典的標准化接口 |
字符串處理 | nltk.tokenize,nltk.stem | 分詞、句子分解、提取主干 |
搭配研究 | nltk.collocations | t-檢驗,卡方,點互信息 |
詞性標示符 | nltk.tag | n-gram,backoff,Brill,HMM,TnT |
分類 | nltk.classify,nltk.cluster | 決策樹,最大熵,朴素貝葉斯,EM,k-means |
分塊 | nltk.chunk | 正則表達式,n-gram,命名實體 |
解析 | nltk.parse | 圖標,基於特征,一致性,概率性,依賴項 |
語義解釋 | nltk.sem,nltk.inference | λ演算,一階邏輯,模型檢驗 |
指標評測 | nltk.metrics | 精度,召回率,協議系數 |
概率與估計 | nltk.probability | 頻率分布,平滑概率分布 |
應用 | nltk.app,nltk.chat | 圖形化的關鍵詞排序,分析器,WordNet查看器,聊天機器人 |
語言學領域的工作 | nltk.toolbox | 處理SIL工具箱格式的數據 |
2. NLTK安裝
我的Python版本是2.7.5,NLTK版本2.0.4
1 DESCRIPTION 2 The Natural Language Toolkit (NLTK) is an open source Python library 3 for Natural Language Processing. A free online book is available. 4 (If you use the library for academic research, please cite the book.) 5 6 Steven Bird, Ewan Klein, and Edward Loper (2009). 7 Natural Language Processing with Python. O'Reilly Media Inc. 8 http://nltk.org/book 9 10 @version: 2.0.4
安裝步驟跟http://www.nltk.org/install.html 一樣
1. 安裝Setuptools: http://pypi.python.org/pypi/setuptools
在頁面的最下面setuptools-5.7.tar.gz
2. 安裝 Pip: 運行 sudo easy_install pip(一定要以root權限運行)
3. 安裝 Numpy (optional): 運行 sudo pip install -U numpy
4. 安裝 NLTK: 運行 sudo pip install -U nltk
5. 進入python,並輸入以下命令
1 192:chapter2 rcf$ python 2 Python 2.7.5 (default, Mar 9 2014, 22:15:05) 3 [GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin 4 Type "help", "copyright", "credits" or "license" for more information. 5 >>> import nltk 6 >>> nltk.download()
當出現以下界面進行nltk_data的下載
也可直接到 http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml 去下載數據包,並拖到Download Directory。我就是這么做的。
最后在Python目錄運行以下命令以及結果,說明安裝已成功
1 from nltk.book import * 2 *** Introductory Examples for the NLTK Book *** 3 Loading text1, ..., text9 and sent1, ..., sent9 4 Type the name of the text or sentence to view it. 5 Type: 'texts()' or 'sents()' to list the materials. 6 text1: Moby Dick by Herman Melville 1851 7 text2: Sense and Sensibility by Jane Austen 1811 8 text3: The Book of Genesis 9 text4: Inaugural Address Corpus 10 text5: Chat Corpus 11 text6: Monty Python and the Holy Grail 12 text7: Wall Street Journal 13 text8: Personals Corpus 14 text9: The Man Who Was Thursday by G . K . Chesterton 1908
3. NLTK的初次使用
現在開始進入正題,由於本人沒學過python,所以使用NLTK也就是學習Python的過程。初次學習NLTK主要使用的時NLTK里面自帶的一些現有數據,上圖中已由顯示,這些數據都在nltk.book里面。
3.1 搜索文本
concordance:搜索text1中的monstrous
1 >>> text1.concordance("monstrous") 2 Building index... 3 Displaying 11 of 11 matches: 4 ong the former , one was of a most monstrous size . ... This came towards us , 5 ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r 6 ll over with a heathenish array of monstrous clubs and spears . Some were thick 7 d as you gazed , and wondered what monstrous cannibal and savage could ever hav 8 that has survived the flood ; most monstrous and most mountainous ! That Himmal 9 they might scout at Moby Dick as a monstrous fable , or still worse and more de 10 th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l 11 ing Scenes . In connexion with the monstrous pictures of whales , I am strongly 12 ere to enter upon those still more monstrous stories of them which are to be fo 13 ght have been rummaged out of this monstrous cabinet there is no telling . But 14 of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
similar:查找text1中與monstrous相關的所有詞語
1 >>> text1.similar("monstrous") 2 Building word-context index... 3 abundant candid careful christian contemptible curious delightfully 4 determined doleful domineering exasperate fearless few gamesome 5 horrible impalpable imperial lamentable lazy loving
dispersion_plot:用離散圖判斷詞在文本的位置即偏移量
1 >>> text4.dispersion_plot(["citizens","democracy","freedom","duties","America"])
3.2 計數詞匯
len:獲取長度,即可獲取文章的詞匯個數,也可獲取單個詞的長度
1 >>> len(text1) #計算text1的詞匯個數 2 260819 3 >>> len(set(text1)) #計算text1 不同的詞匯個數 4 19317 5 >>> len(text1[0]) #計算text1 第一個詞的長度 6 1
sorted:排序
1 >>> sent1 2 ['Call', 'me', 'Ishmael', '.'] 3 >>> sorted(sent1) 4 ['.', 'Call', 'Ishmael', 'me']
3.3 頻率分布
nltk.probability.FreqDist
1 >>> fdist1=FreqDist(text1) #獲取text1的頻率分布情況 2 >>> fdist1 #text1具有19317個樣本,但是總體有260819個值 3 <FreqDist with 19317 samples and 260819 outcomes> 4 >>> keys=fdist1.keys() 5 >>> keys[:50] #獲取text1的前50個樣本
6 [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
1 >>> fdist1.items()[:50] #text1的樣本分布情況,比如','出現了18713次,總共的詞為260819 2 [(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
1 >>> fdist1.hapaxes()[:50] #text1的樣本只出現一次的詞 2 ['!\'"', '!)"', '!*', '!--"', '"...', "',--", "';", '):', ');--', ',)', '--\'"', '---"', '---,', '."*', '."--', '.*--', '.--"', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '12', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130']
3 >>> fdist1['!\'"']
4 1
1 >>> fdist1.plot(50,cumulative=True) #畫出text1的頻率分布圖
3.4 細粒度的選擇詞
1 >>> long_words=[w for w in set(text1) if len(w) > 15] #獲取text1內樣本詞匯長度大於15的詞並按字典序排序 2 >>> sorted(long_words) 3 ['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly'] 4 >>> fdist1=FreqDist(text1) #獲取text1內樣本詞匯長度大於7且出現次數大於7的詞並按字典序排序
5 >>> sorted([wforwin set(text5) if len(w) > 7 and fdist1[w] > 7]) 6 ['American', 'actually', 'afternoon', 'anything', 'attention', 'beautiful', 'carefully', 'carrying', 'children', 'commanded', 'concerning', 'considered', 'considering', 'difference', 'different', 'distance', 'elsewhere', 'employed', 'entitled', 'especially', 'everything', 'excellent', 'experience', 'expression', 'floating', 'following', 'forgotten', 'gentlemen', 'gigantic', 'happened', 'horrible', 'important', 'impossible', 'included', 'individual', 'interesting', 'invisible', 'involved', 'monsters', 'mountain', 'occasional', 'opposite', 'original', 'originally', 'particular', 'pictures', 'pointing', 'position', 'possibly', 'probably', 'question', 'regularly', 'remember', 'revolving', 'shoulders', 'sleeping', 'something', 'sometimes', 'somewhere', 'speaking', 'specially', 'standing', 'starting', 'straight', 'stranger', 'superior', 'supposed', 'surprise', 'terrible', 'themselves', 'thinking', 'thoughts', 'together', 'understand', 'watching', 'whatever', 'whenever', 'wonderful', 'yesterday', 'yourself']
3.5 詞語搭配和雙連詞
用bigrams()可以實現雙連詞
1 >>> bigrams(['more','is','said','than','done']) 2 [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] 3 >>> text1.collocations() 4 Building collocations list 5 Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm 6 whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab; 7 years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief 8 mate; white whale; ivory leg; one hand
3.6 NLTK頻率分類中定義的函數
例子 | 描述 |
fdist=FreqDist(samples) | 創建包含給定樣本的頻率分布 |
fdist.inc(sample) | 增加樣本 |
fdist['monstrous'] | 計數給定樣本出現的次數 |
fdist.freq('monstrous') | 樣本總數 |
fdist.N() | 以頻率遞減順序排序的樣本鏈表 |
fdist.keys() | 以頻率遞減的順序便利樣本 |
for sample in fdist: | 數字最大的樣本 |
fdist.max() | 繪制頻率分布表 |
fdist.tabulate() | 繪制頻率分布圖 |
fdist.plot() | 繪制積累頻率分布圖 |
fdist.plot(cumulative=True) | 繪制積累頻率分布圖 |
fdist1<fdist2 | 測試樣本在fdist1中出現的樣本是否小於fdist2 |
最后看下text1的類情況. 使用type可以查看變量類型,使用help()可以獲取類的屬性以及方法。以后想要獲取具體的方法可以使用help(),這個還是很好用的。
1 >>> type(text1) 2 <class 'nltk.text.Text'> 3 >>> help('nltk.text.Text') 4 Help on class Text in nltk.text: 5 6 nltk.text.Text = class Text(__builtin__.object) 7 | A wrapper around a sequence of simple (string) tokens, which is 8 | intended to support initial exploration of texts (via the 9 | interactive console). Its methods perform a variety of analyses 10 | on the text's contexts (e.g., counting, concordancing, collocation 11 | discovery), and display the results. If you wish to write a 12 | program which makes use of these analyses, then you should bypass 13 | the ``Text`` class, and use the appropriate analysis function or 14 | class directly instead. 15 | 16 | A ``Text`` is typically initialized from a given document or 17 | corpus. E.g.: 18 | 19 | >>> import nltk.corpus 20 | >>> from nltk.text import Text 21 | >>> moby = Text(nltk.corpus.gutenberg.words('melville-moby_dick.txt')) 22 | 23 | Methods defined here: 24 | 25 | __getitem__(self, i) 26 | 27 | __init__(self, tokens, name=None) 28 | Create a Text object. 29 | 30 | :param tokens: The source text. 31 | :type tokens: sequence of str 32 | 33 | __len__(self) 34 | 35 | __repr__(self) 36 | :return: A string representation of this FreqDist. 37 | :rtype: string 38 | 39 | collocations(self, num=20, window_size=2) 40 | Print collocations derived from the text, ignoring stopwords. 41 | 42 | :seealso: find_collocations 43 | :param num: The maximum number of collocations to print. 44 | :type num: int 45 | :param window_size: The number of tokens spanned by a collocation (default=2) 46 | :type window_size: int 47 | 48 | common_contexts(self, words, num=20) 49 | Find contexts where the specified words appear; list 50 | most frequent common contexts first. 51 | 52 | :param word: The word used to seed the similarity search 53 | :type word: str 54 | :param num: The number of words to generate (default=20) 55 | :type num: int 56 | :seealso: ContextIndex.common_contexts()
4. 語言理解的技術
1. 詞意消歧
2. 指代消解
3. 自動生成語言
4. 機器翻譯
5. 人機對話系統
6. 文本的含義
5. 總結
雖然是初次接觸Python,NLTK,但是我已經覺得他們的好用以及方便,接下來就會深入的學習他們。