nltk中的三元詞組,二元詞組


在做英文文本處理時,常常會遇到這樣的情況,需要我們提取出里面的詞組進行主題抽取,尤其是具有行業特色的,比如金融年報等。其中主要進行的是進行雙連詞和三連詞的抽取,那如何進行雙連詞和三連詞的抽取呢?這是本文將要介紹的具體內容。

1. nltk.bigrams(tokens) 和 nltk.trigrams(tokens)

一般如果只是要求窮舉雙連詞或三連詞,則可以直接用nltk中的函數bigrams()或trigrams(), 效果如下面代碼:

 1 >>> import nltk
 2 >>> str='you are my sunshine, and all of things are so beautiful just for you.'
 3 >>> tokens=nltk.wordpunct_tokenize(str)
 4 >>> bigram=nltk.bigrams(tokens)
 5 >>> bigram
 6 <generator object bigrams at 0x025C1C10>
 7 >>> list(bigram)
 8 [('you', 'are'), ('are', 'my'), ('my', 'sunshine'), ('sunshine', ','), (',', 'and'), ('and', 'all'), ('all', 'of'), ('of', 'things'), ('things', 'are'), ('are', 'so'), ('so', 'beautiful'), ('beautiful
 9 ', 'just'), ('just', 'for'), ('for', 'you'), ('you', '.')]
10 >>> trigram=nltk.trigrams(tokens)
11 >>> list(trigram)
12 [('you', 'are', 'my'), ('are', 'my', 'sunshine'), ('my', 'sunshine', ','), ('sunshine', ',', 'and'), (',', 'and', 'all'), ('and', 'all', 'of'), ('all', 'of', 'things'), ('of', 'things', 'are'), ('thin
13 gs', 'are', 'so'), ('are', 'so', 'beautiful'), ('so', 'beautiful', 'just'), ('beautiful', 'just', 'for'), ('just', 'for', 'you'), ('for', 'you', '.')]
View Code

2. nltk.ngrams(tokens, n)

如果要求窮舉四連詞甚至更長的多詞組,則可以用統一的函數ngrams(tokens, n),其中n表示n詞詞組, 該函數表達形式較統一,效果如下代碼:

 1 >>> nltk.ngrams(tokens, 2)
 2 <generator object ngrams at 0x027AAF30>
 3 >>> list(nltk.ngrams(tokens,2))
 4 [('you', 'are'), ('are', 'my'), ('my', 'sunshine'), ('sunshine', ','), (',', 'and'), ('and', 'all'), ('all', 'of'), ('of', 'things'), ('things', 'are'), ('are', 'so'), ('so', 'beautiful'), ('beautiful
 5 ', 'just'), ('just', 'for'), ('for', 'you'), ('you', '.')]
 6 >>> list(nltk.ngrams(tokens,3))
 7 [('you', 'are', 'my'), ('are', 'my', 'sunshine'), ('my', 'sunshine', ','), ('sunshine', ',', 'and'), (',', 'and', 'all'), ('and', 'all', 'of'), ('all', 'of', 'things'), ('of', 'things', 'are'), ('thin
 8 gs', 'are', 'so'), ('are', 'so', 'beautiful'), ('so', 'beautiful', 'just'), ('beautiful', 'just', 'for'), ('just', 'for', 'you'), ('for', 'you', '.')]
 9 >>> list(nltk.ngrams(tokens,4))
10 [('you', 'are', 'my', 'sunshine'), ('are', 'my', 'sunshine', ','), ('my', 'sunshine', ',', 'and'), ('sunshine', ',', 'and', 'all'), (',', 'and', 'all', 'of'), ('and', 'all', 'of', 'things'), ('all', '
11 of', 'things', 'are'), ('of', 'things', 'are', 'so'), ('things', 'are', 'so', 'beautiful'), ('are', 'so', 'beautiful', 'just'), ('so', 'beautiful', 'just', 'for'), ('beautiful', 'just', 'for', 'you'),
12  ('just', 'for', 'you', '.')]
View Code

3. nltk.collocations下的相關類

nltk.collocations下有三個類:BigramCollocationFinder, QuadgramCollocationFinder, TrigramCollocationFinder

1)BigramCollocationFinder

它是一個發現二元詞組並對其進行排序的工具,一般使用函數from_words()去構建一個搜索器,而不是直接生成一個實例。發現器主要調用以下方法:

above_score(self, score_fn, min_score): 返回分數超過min_score的n元詞組,並按分數從大到小對其進行排序。這里當然返回的是二元詞組,這里的分數有多種定義,后面將做詳細介紹。

apply_freq_filter(self, min_freq):過濾掉詞組出現頻率小於min_freq的詞組。

apply_ngram_filter(self, fn): 過濾掉符合條件fn的詞組。在判斷條件fn時,是將整個詞組進行判斷是否滿足條件fn,如果滿足條件,則將該詞組過濾掉。

apply_word_filter(self, fn): 過濾掉符合條件fn的詞組。在判斷條件fn時,是將詞組中的詞一一判斷,如果有一個詞滿足條件fn,則該詞組滿足條件,將會被過濾掉。

nbest(self, score_fn, n): 返回分數最高的前n個詞組。

score_ngrams(self, score_fn): 返回由詞組和對應分數組成的序列,並將其從高到低排列。

 1 >>> finder=nltk.collocations.BigramCollocationFinder.from_words(tokens)
 2 >>> bigram_measures=nltk.collocations.BigramAssocMeasures()
 3 >>> finder.nbest(bigram_measures.pmi, 10)
 4 [(',', 'and'), ('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('sunshine', ','), ('are', 'my')]
 5 >>> finder.nbest(bigram_measures.pmi, 100)
 6 [(',', 'and'), ('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('sunshine', ','), ('are', 'my'), ('are', 'so'), ('for'
 7 , 'you'), ('things', 'are'), ('you', '.'), ('you', 'are')]
 8 >>> finder.apply_ngram_filter(lambda w1,w2: w1 in [',', '.'] and w2 in [',', '.'] )
 9 >>> finder.nbest(bigram_measures.pmi, 100)
10 [(',', 'and'), ('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('sunshine', ','), ('are', 'my'), ('are', 'so'), ('for'
11 , 'you'), ('things', 'are'), ('you', '.'), ('you', 'are')]
12 >>> finder.apply_word_filter(lambda x: x in [',', '.'])
13 >>> finder.nbest(bigram_measures.pmi, 100)
14 [('all', 'of'), ('and', 'all'), ('beautiful', 'just'), ('just', 'for'), ('my', 'sunshine'), ('of', 'things'), ('so', 'beautiful'), ('are', 'my'), ('are', 'so'), ('for', 'you'), ('things', 'are'), ('yo
15 u', 'are')]
View Code

2)TrigramCollocationFinder 和 QuadgramCollocationFinder

用法同BigramCollocationFinder, 只不過這里生產的是三元詞組搜索器, 而QuadgramCollocationFinder產生的是四元詞組搜索器。對應函數也同上。

4. 計算詞組詞頻

>>> sorted(finder.ngram_fd.items(), key=lambda t: (-t[1], t[0]))[:10]
[(('all', 'of'), 1), (('and', 'all'), 1), (('are', 'my'), 1), (('are', 'so'), 1), (('beautiful', 'just'), 1), (('for', 'you'), 1), (('just', 'for'), 1), (('my', 'sunshine'), 1), (('of', 'things'), 1),
 (('so', 'beautiful'), 1)]

###這里的key是排序依據,就是說先按t[1](詞頻)排序,-表示從大到小;再按照詞組(t[0])排序,默認從a-z.

5. 判斷的分數

在nltk.collocations.ngramAssocMeasures下,有多種分數:

chi_sq(cls, n_ii, n_ix_xi_tuple, n_xx): 使用卡方分布計算出的各個n元詞組的分數。

pmi(cls, *marginals): 使用點互信息計算出的各個n元詞組的分數。

likelihood_ratio(cls, *marginals): 使用最大似然比計算出的各個n元詞組的分數。

student_t(cls, *marginals): 使用針對單元詞組的帶有獨立假設的學生t檢驗計算各個n元詞組的分數

以上是比較常用的幾種分數,當然還有很多其他的分數,比如:poisson_stirling, jaccard, fisher, phi_sq等。

 1 >>> bigram_measures=nltk.collocations.BigramAssocMeasures()
 2 >>> bigram_measures.student_t(8, (15828, 4675), 14307668)
 3 0.9999319894802036
 4 >>> bigram_measures.student_t(8, (42, 20), 14307668)
 5 2.828406367705413
 6 >>> bigram_measures.chi_sq(8, (15828, 4675), 14307668)
 7 1.5488692067282201
 8 >>> bigram_measures.chi_sq(59, (67, 65), 571007)
 9 456399.76190356724
10 >>> bigram_measures.likelihood_ratio(110, (2552, 221), 31777)
11 270.721876936225
12 >>> bigram_measures.pmi(110, (2552, 221), 31777)
13 2.6317398492166078
14 >>> bigram_measures.pmi
15 <bound method type.pmi of <class 'nltk.metrics.association.BigramAssocMeasures'>>
16 >>> bigram_measures.likelihood_ratio
17 <bound method type.likelihood_ratio of <class 'nltk.metrics.association.BigramAssocMeasures'>>
18 >>> bigram_measures.chi_sq
19 <bound method type.chi_sq of <class 'nltk.metrics.association.BigramAssocMeasures'>>
20 >>> bigram_measures.student_t
21 <bound method type.student_t of <class 'nltk.metrics.association.BigramAssocMeasures'>>

6. Ranking and correlation

It is useful to consider the results of finding collocations as a ranking, and the rankings output using different association measures can be compared using the Spearman correlation coefficient.

Ranks can be assigned to a sorted list of results trivially by assigning strictly increasing ranks to each result:

>>> from nltk.metrics.spearman import *
>>> results_list = ['item1', 'item2', 'item3', 'item4', 'item5']
>>> print(list(ranks_from_sequence(results_list)))
[('item1', 0), ('item2', 1), ('item3', 2), ('item4', 3), ('item5', 4)]

If scores are available for each result, we may allow sufficiently similar results (differing by no more than rank_gap) to be assigned the same rank:

>>> results_scored = [('item1', 50.0), ('item2', 40.0), ('item3', 38.0),
...                   ('item4', 35.0), ('item5', 14.0)]
>>> print(list(ranks_from_scores(results_scored, rank_gap=5)))
[('item1', 0), ('item2', 1), ('item3', 1), ('item4', 1), ('item5', 4)]

The Spearman correlation coefficient gives a number from -1.0 to 1.0 comparing two rankings. A coefficient of 1.0 indicates identical rankings; -1.0 indicates exact opposite rankings.

>>> print('%0.1f' % spearman_correlation(
...         ranks_from_sequence(results_list),
...         ranks_from_sequence(results_list)))
1.0
>>> print('%0.1f' % spearman_correlation(
...         ranks_from_sequence(reversed(results_list)),
...         ranks_from_sequence(results_list)))
-1.0
>>> results_list2 = ['item2', 'item3', 'item1', 'item5', 'item4']
>>> print('%0.1f' % spearman_correlation(
...        ranks_from_sequence(results_list),
...        ranks_from_sequence(results_list2)))
0.6
>>> print('%0.1f' % spearman_correlation(
...        ranks_from_sequence(reversed(results_list)),
...        ranks_from_sequence(results_list2)))
-0.6

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM