【語言處理與Python】7.5命名實體識別/7.6關系抽取

本文轉載自查看原文 2013-05-30 23:26 5381 Python自然語言處理

7.5命名實體識別（NER）

目標是識別所有文字提及的命名實體。

可以分成兩個子任務：確定NE的邊界和確定其類型。

NLTK提供了一個已經訓練好的可以識別命名實體的分類器，如果我們設置參數binary=True，那么命名實體只被標注為NE，沒有類型標簽。可以通過代碼來看：

>>>sent = nltk.corpus.treebank.tagged_sents()[22]
>>>print nltk.ne_chunk(sent, binary=True) 
(S
The/DT
(NE U.S./NNP)
is/VBZ
one/CD
...
according/VBG
to/TO
(NE Brooke/NNPT./NNPMossman/NNP)
...)
>>>print nltk.ne_chunk(sent)
(S
The/DT
(GPE U.S./NNP)
is/VBZ
one/CD
...
according/VBG
to/TO
(PERSON Brooke/NNPT./NNPMossman/NNP)
...)

7.6關系抽取

一旦文本中的命名實體已被識別，我們就可以提取它們之間存在的關系。

進行這一任務的方法之一，就是尋找所有的（X,α, Y)形式的三元組，我們可以使用正則表達式從α的實體中抽出我們正在查找的關系。下面的例子搜索包含詞in的字符串。

特殊的正則表達式(?!\b.+ing\b)是一個否定預測先行斷言，允許我們忽略如success in supervising the transition of中的字符串，其中in 后面跟一個動名詞。

>>>IN = re.compile(r'.*\bin\b(?!\b.+ing)')
>>>for docin nltk.corpus.ieer.parsed_docs('NYT_19980315'):
...     for rel in nltk.sem.extract_rels('ORG', 'LOC', doc,
...                 corpus='ieer',pattern = IN):
...         print nltk.sem.show_raw_rtuple(rel)
[ORG: 'WHYY'] 'in' [LOC: 'Philadelphia']
[ORG: 'McGlashan &AMP;Sarrail']'firm in' [LOC: 'San Mateo']
[ORG: 'Freedom Forum']'in' [LOC: 'Arlington']
[ORG: 'Brookings Institution'] ', the research group in' [LOC: 'Washington']
[ORG: 'Idealab'] ', a self-described businessincubator basedin' [LOC: 'Los Angeles']
[ORG: 'Open Text']', basedin' [LOC: 'Waterloo']
[ORG: 'WGBH'] 'in' [LOC: 'Boston']
[ORG: 'Bastille Opera']'in' [LOC: 'Paris']
[ORG: 'Omnicom'] 'in' [LOC: 'New York']
[ORG: 'DDB Needham']'in' [LOC: 'New York']
[ORG: 'Kaplan ThalerGroup']'in' [LOC: 'New York']
[ORG: 'BBDO South']'in' [LOC: 'Atlanta']
[ORG: 'Georgia-Pacific'] 'in' [LOC: 'Atlanta']

如前文所示，CoNLL2002命名實體語料庫的荷蘭語部分不只包含命名實體標注，也包含詞性標注。這允許我們設計對這些標記敏感的模式，如下面的例子所示。show_clause()方法以分條形式輸出關系，其中二元關系符號作為參數relsym的值被指定。

>>>from nltk.corpusimport conll2002
>>>vnv= """
... (
... is/V| #3rdsing present and
... was/V| #past forms of the verb zijn ('be')
... werd/V| #and also present
... wordt/V #pastof worden('become')
... )
... .* #followed byanything
... van/Prep #followed byvan('of')
... """
>>>VAN= re.compile(vnv, re.VERBOSE)
>>>for docin conll2002.chunked_sents('ned.train'):
...     for r in nltk.sem.extract_rels('PER', 'ORG', doc,
...         corpus='conll2002', pattern=VAN):
...         print nltk.sem.show_clause(r,relsym="VAN") 
VAN("cornet_d'elzius",'buitenlandse_handel')
VAN('johan_rottiers','kardinaal_van_roey_instituut')
VAN('annie_lennox','eurythmics')

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python自然語言處理學習筆記(64)： 7.5 命名實體識別實體抽取（命名實體識別）調研報告命名實體識別命名實體識別自然語言處理實戰---基於HMM算法實現命名實體識別 python實現命名實體識別指標（實體級別）命名實體識別(1) - HMM 命名實體識別(NER) 自然語言18.2_NLTK命名實體識別命名實體識別總結