Python自然語言處理學習筆記之信息提取步驟&分塊（chunking）

本文轉載自查看原文 2015-08-24 19:37 10063

一、信息提取模型　　

　　信息提取的步驟共分為五步，原始數據為未經處理的字符串，

第一步：分句，用nltk.sent_tokenize(text)實現,得到一個list of strings

第二步：分詞，[nltk.word_tokenize(sent) for sent in sentences]實現，得到list of lists of strings

第三步：標記詞性，[nltk.pos_tag(sent) for sent in sentences]實現得到一個list of lists of tuples

前三步可以定義在一個函數中：

>>> def ie_preprocess(document):
...    sentences = nltk.sent_tokenize(document) 
...    sentences = [nltk.word_tokenize(sent) for sent in sentences] 
...    sentences = [nltk.pos_tag(sent) for sent in sentences]

第四步：實體識別（entity detection）在這一步，既要識別已定義的實體（指那些約定成俗的習語和專有名詞），也要識別未定義的實體，得到一個樹的列表

第五步：關系識別（relation detection）尋找實體之間的關系，並用tuple標記，最后得到一個tuple列表

二、分塊（chunking）

　　分塊是第四步entity detection的基礎，本文只介紹一種塊noun phrase chunking即NP-chunking，這種塊通常比完整的名詞詞組小，例如：the market for system-management software是一個名詞詞組，但是它會被分為兩個NP-chunking——the market 和 system-management software。任何介詞短語和從句都不會包含在NP-chunking中，因為它們內部總是會包含其他的名詞詞組。

　　從一個句子中提取分塊需要用到正則表達式，先給出示例代碼：

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""
cp = nltk.RegexpParser(grammar)
sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), 
                 ("her", "PP$"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
     
>>> print(cp.parse(sentence)) 
(S
  (NP Rapunzel/NNP)
  let/VBD
  down/RP
  (NP her/PP$ long/JJ golden/JJ hair/NN))

　　正則表達式的格式為"""塊名：{<表達式>...<>}

{...}”""

如：

grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+} # chunk sequences of proper nouns """

　　大括號內為分塊規則（chunking rule），可以有一個或多個，當rule不止一個時，RegexpParser會依次調用各個規則，並不斷更新分塊結果，直到所有的rule都被調用。nltk.RegexpParser(grammar)用於依照chunking rule創建一個chunk分析器，cp.parse()則在目標句子中運行分析器，最后的結果是一個樹結構，我們可以用print打印它，或者用result.draw()將其畫出。

　　在chunking rule中還用一種表達式chink，用於定義chunk中我們不想要的模式，這種表達式的格式為：‘ }表達式{ ’ 使用chink的結果一般有三種，一、chink定義的表達式和整個chunk都匹配，則將整個chunk刪除；二、匹配的序列在chunk中間，則chunk分裂為兩個小chunk；三、在chunk的邊緣，則chunk會變小。使用方法如下：

grammar = r"""
  NP:
    {<.*>+}          # Chunk everything
    }<VBD|IN>+{      # Chink sequences of VBD and IN
  """
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),
       ("dog", "NN"), ("barked", "VBD"), ("at", "IN"),  ("the", "DT"), ("cat", "NN")]
cp = nltk.RegexpParser(grammar)
     
>>> print(cp.parse(sentence))
 (S
   (NP the/DT little/JJ yellow/JJ dog/NN)
   barked/VBD
   at/IN
   (NP the/DT cat/NN))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python自然語言處理---信息提取 Python自然語言處理學習——jieba分詞 Python自然語言處理學習筆記之性別識別《TensorFlow與自然語言處理應用》PDF代碼+雅蘭《Python自然語言處理》PDF中英文代碼+《基於深度學習的自然語言處理》中文PDF筆記 python自然語言處理（一）自然語言處理(1)之NLTK與PYTHON 自然語言處理----詞干提取器自然語言處理之信息安全應用（作業）自然語言處理NLP學習筆記二：NLP實戰-開源工具tensorflow與jiagu使用學習筆記TF059:自然語言處理、智能聊天機器人