python+NLTK 自然語言學習處理六：分類和標注詞匯一

本文轉載自查看原文 2018-04-09 22:07 2215 python+NLTK 自然語言學習

在一段句子中是由各種詞匯組成的。有名詞，動詞，形容詞和副詞。要理解這些句子，首先就需要將這些詞類識別出來。將詞匯按它們的詞性(parts-of-speech,POS)分類並相應地對它們進行標注。這個過程叫做詞性標注。

要進行詞性標注，就需要用到詞性標注器(part-of-speech tagger).代碼如下

text=nltk.word_tokenize("customer found there are abnormal issue")

print(nltk.pos_tag(text))

提示錯誤：這是因為找不到詞性標注器

LookupError:

**********************************************************************

Resource averaged_perceptron_tagger not found.

Please use the NLTK Downloader to obtain the resource:

>>> import nltk

>>> nltk.download('averaged_perceptron_tagger')

Searched in:

- '/home/zhf/nltk_data'

- '/usr/share/nltk_data'

- '/usr/local/share/nltk_data'

- '/usr/lib/nltk_data'

- '/usr/local/lib/nltk_data'

- '/usr/nltk_data'

- '/usr/lib/nltk_data'

**********************************************************************

運行nltk.download進行下載，並將文件拷貝到前面錯誤提示的搜索路徑中去，

>>> import nltk

>>> nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data] /root/nltk_data...

[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.

True

以及對應的幫助文檔

>>> nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /root/nltk_data...

[nltk_data] Unzipping help/tagsets.zip.

True

運行結果：

[('customer', 'NN'), ('found', 'VBD'), ('there', 'EX'), ('are', 'VBP'), ('abnormal', 'JJ'), ('issue', 'NN')]

在這里得到了每個詞以及每個詞的詞性。下表是一個簡化的詞性標記集

標記	含義	例子
ADJ	形容詞	new, good, high, special, big, local
ADV	動詞	really, already, still, early, now
CNJ	連詞	and, or, but, if, while, although
DET	限定詞	the, a, some, most, every, no
EX	存在量詞	there, there’s
FW	外來詞	dolce, ersatz, esprit, quo, maitre
MOD	情態動詞	will, can, would, may, must, should
N	名詞	year, home, costs, time, education
NP	專有名詞	Alison, Africa, April, Washington
NUM	數詞	twenty-four, fourth, 1991, 14:24
PRO	代詞	he, their, her, its, my, I, us
P	介詞	on, of, at, with, by, into, under
TO	詞 to	to
UH	感嘆詞	ah, bang, ha, whee, hmpf, oops
V	動詞	is, has, get, do, make, see, run
VD	過去式	said, took, told, made, asked
VG	現在分詞	making, going, playing, working
VN	過去分詞	given, taken, begun, sung
WH	Wh 限定詞	who, which, when, what, where, how

如果解析的對象是由單獨的詞/標記字符串構成的，可以用str2tuple的方法將詞和標記解析出來並形成元組。使用方法如下：

[nltk.tag.str2tuple(t) for t in "customer/NN found/VBD there/EX are/VBP abnormal/JJ issue/NN".split()]

運行結果：

[('customer', None), ('found', None), ('there', None), ('are', None), ('abnormal', None), ('issue', None)]

對於在NLTK中自帶的各種文本，也自帶詞性標記器

nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

那么借助與Freqdist和以及繪圖工具。我們就可以畫出各個詞性的頻率分布圖，便於我們觀察句子結構

brown_news_tagged=nltk.corpus.brown.tagged_words(categories='news')

tag_fd=nltk.FreqDist(tag for (word,tag) in brown_news_tagged)

tag_fd.plot(50,cumulative=True)

結果如下，繪制出了前50個

假如我們正在學習一個詞，想看下它在文本中的應用，比如后面都用的什么詞。可以采用如下的方法，我想看下oftern后面都跟的是一些什么詞語

brown_learned_text=nltk.corpus.brown.words(categories='learned')

ret=sorted(set(b for(a,b) in nltk.bigrams(brown_learned_text) if a=='often'))

在這里用到了bigrams方法，這個方法主要是形成雙連詞。

比如下面的這段文本，生成雙連詞

for word in nltk.bigrams("customer found there are abnormal issue".split()):

print(word)

結果如下：

('customer', 'found')

('found', 'there')

('there', 'are')

('are', 'abnormal')

('abnormal', 'issue')

光看后面跟了那些詞語還不夠，我們還需要查看后面的詞語都是一些什么詞性。

1 首先是對詞語進行詞性標記。形成詞語和詞性的二元組

2 然后根據bigrams形成連詞，然后根據第一個詞是否是often，得到后面詞語的詞性

brown_learned_text=nltk.corpus.brown.tagged_words(categories='learned')

tags=[b[1] for (a,b) in nltk.bigrams(brown_learned_text) if a[0]=='often']

fd=nltk.FreqDist(tags)

fd.tabulate()

結果如下：

VBN VB VBD JJ IN QL , CS RB AP VBG RP VBZ QLP BEN WRB . TO HV

15 10 8 5 4 3 3 3 3 1 1 1 1 1 1 1 1 1 1

同樣的，如果我們想的到三連詞，可以采用trigrams的方法。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python+NLTK 自然語言學習處理二：文本 python+NLTK 自然語言學習處理八：分類文本一 python+NLTK 自然語言學習處理：環境搭建自然語言處理(1)之NLTK與PYTHON NLTK學習筆記(五):分類和標注詞匯利用NLTK在Python下進行自然語言處理 nltk RegexpTokenizer類:python自然語言處理自然語言處理NLTK之入門 NLTK自然語言處理庫 NLTK與自然語言處理基礎