1.文本切分

本文轉載自查看原文 2019-08-14 18:30 441

文本切分

之前討論了文本結構、成文和表示。具體來說，標識（token）是具有一定的句法語義且獨立的最小文本成分。一段文本或一個文本文件具有幾個組成部分，包括可以進一步細分為從句、短語和單詞的語句。最流行的文本切分技術包括句子切分和詞語切分，用於將文本語料庫分解成句子，並將每個句子分解成單詞。因此，文本切分可以定義為將文本數據分解或拆分為具有更小且有意義的成文（即標識）的過程。

句子切分

句子切分（sentence tokenization）是將文本語料庫分解成句子的過程，這些句子是組成語料庫的第一級切分結果。這個過程也稱為句子分隔，因為嘗試將文本分割成有意義的句子。任何文本語料庫都是文本的集合，其中每一段落包含多個句子。

執行句子切分有多種技術，基本技術包括在句子之間尋找特定的分隔符，例如句號 ( . )、換行符 ( \n ) 或者分號 ( ; )。將使用 NLTK 框架進行切分，該框架提供用於執行句子切分的各種接口。將主要關注以下句子切分器：

sent_tokenize
PunktSentenceTokenizer
RegexpTokenizer

在將文本分割成句子之前，需要一些測試該系統的文本。下面將加載一些示例文本，以及部分在 NLTK 中可用的古騰堡（Gutenberg）資料庫。可以使用以下代碼段加載必要的依存項：

 
                  import  
                  nltk 
                 
                  from  
                  nltk.corpus  
                  import  
                  gutenberg 
                 
                  from  
                  pprint  
                  import  
                  pprint

注意：

如果第一次執行則需要執行：

 
                    import  
                    nltk 
                   
                    nltk.download( 
                    'gutenberg' 
                    )

則會下載所需要的書籍列表。下載成功后執行代碼進行查看：

 
                    In [ 
                    7 
                    ]: nltk.corpus.gutenberg.fileids() 
                   
                    Out[ 
                    7 
                    ]: 
                   
                    [ 
                    'austen-emma.txt' 
                    , 
                   
                    'austen-persuasion.txt' 
                    , 
                   
                    'austen-sense.txt' 
                    , 
                   
                    'bible-kjv.txt' 
                    , 
                   
                    'blake-poems.txt' 
                    , 
                   
                    'bryant-stories.txt' 
                    , 
                   
                    'burgess-busterbrown.txt' 
                    , 
                   
                    'carroll-alice.txt' 
                    , 
                   
                    'chesterton-ball.txt' 
                    , 
                   
                    'chesterton-brown.txt' 
                    , 
                   
                    'chesterton-thursday.txt' 
                    , 
                   
                    'edgeworth-parents.txt' 
                    , 
                   
                    'melville-moby_dick.txt' 
                    , 
                   
                    'milton-paradise.txt' 
                    , 
                   
                    'shakespeare-caesar.txt' 
                    , 
                   
                    'shakespeare-hamlet.txt' 
                    , 
                   
                    'shakespeare-macbeth.txt' 
                    , 
                   
                    'whitman-leaves.txt' 
                    ]

如果執行時出現以下錯誤：

error 折疊源碼

 
               
                
                  
                    In [ 
                    14 
                    ]: alice  
                    =  
                    gutenberg.raw(fileids 
                    = 
                    'carrolll-alice.txt' 
                    ) 
                   
 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                    - 
                   
 
                    BadZipFile                                Traceback (most recent call last) 
                   
 
                    <ipython 
                    - 
                    input 
                    - 
                    14 
                    - 
                    158d1a6a9aa4 
                    >  
                    in  
                    <module>() 
                   
 
                    - 
                    - 
                    - 
                    - 
                    >  
                    1  
                    alice  
                    =  
                    gutenberg.raw(fileids 
                    = 
                    'carrolll-alice.txt' 
                    ) 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    corpus 
                    / 
                    util.py  
                    in  
                    __getattr__( 
                    self 
                    , attr) 
                   
 
                         
                    114              
                    raise  
                    AttributeError( 
                    "LazyCorpusLoader object has no attribute '__bases__'" 
                    ) 
                   
 
                         
                    115 
                   
 
                    - 
                    - 
                    >  
                    116          
                    self 
                    .__load() 
                   
 
                         
                    117          
                    # This looks circular, but its not, since __load() changes our 
                   
 
                         
                    118          
                    # __class__ to something new: 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    corpus 
                    / 
                    util.py  
                    in  
                    __load( 
                    self 
                    ) 
                   
 
                          
                    76          
                    else 
                    : 
                   
 
                          
                    77              
                    try 
                    : 
                   
 
                    - 
                    - 
                    - 
                    >  
                    78                  
                    root  
                    =  
                    nltk.data.find( 
                    '{}/{}' 
                    . 
                    format 
                    ( 
                    self 
                    .subdir,  
                    self 
                    .__name)) 
                   
 
                          
                    79              
                    except  
                    LookupError as e: 
                   
 
                          
                    80                  
                    try 
                    : root  
                    =  
                    nltk.data.find( 
                    '{}/{}' 
                    . 
                    format 
                    ( 
                    self 
                    .subdir, zip_name)) 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    data.py  
                    in  
                    find(resource_name, paths) 
                   
 
                         
                    653                                       
                    [pieces[i]  
                    +  
                    '.zip' 
                    ]  
                    +  
                    pieces[i:]) 
                   
 
                         
                    654              
                    try 
                    : 
                   
 
                    - 
                    - 
                    >  
                    655                  
                    return  
                    find(modified_name, paths) 
                   
 
                         
                    656              
                    except  
                    LookupError: 
                   
 
                         
                    657                  
                    pass 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    data.py  
                    in  
                    find(resource_name, paths) 
                   
 
                         
                    639                  
                    if  
                    os.path.exists(p): 
                   
 
                         
                    640                      
                    try 
                    : 
                   
 
                    - 
                    - 
                    >  
                    641                          
                    return  
                    ZipFilePathPointer(p, zipentry) 
                   
 
                         
                    642                      
                    except  
                    IOError: 
                   
 
                         
                    643                          
                    # resource not in zipfile 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    compat.py  
                    in  
                    _decorator( 
                    * 
                    args,  
                    * 
                    * 
                    kwargs) 
                   
 
                         
                    219      
                    def  
                    _decorator( 
                    * 
                    args,  
                    * 
                    * 
                    kwargs): 
                   
 
                         
                    220          
                    args  
                    =  
                    (args[ 
                    0 
                    ], add_py3_data(args[ 
                    1 
                    ]))  
                    +  
                    args[ 
                    2 
                    :] 
                   
 
                    - 
                    - 
                    >  
                    221          
                    return  
                    init_func( 
                    * 
                    args,  
                    * 
                    * 
                    kwargs) 
                   
 
                         
                    222      
                    return  
                    wraps(init_func)(_decorator) 
                   
 
                         
                    223 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    data.py  
                    in  
                    __init__( 
                    self 
                    , zipfile, entry) 
                   
 
                         
                    486          
                    """ 
                   
 
                         
                    487          
                    if  
                    isinstance 
                    (zipfile, string_types): 
                   
 
                    - 
                    - 
                    >  
                    488              
                    zipfile  
                    =  
                    OpenOnDemandZipFile(os.path.abspath(zipfile)) 
                   
 
                         
                    489 
                   
 
                         
                    490          
                    # Normalize the entry string, it should be relative: 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    compat.py  
                    in  
                    _decorator( 
                    * 
                    args,  
                    * 
                    * 
                    kwargs) 
                   
 
                         
                    219      
                    def  
                    _decorator( 
                    * 
                    args,  
                    * 
                    * 
                    kwargs): 
                   
 
                         
                    220          
                    args  
                    =  
                    (args[ 
                    0 
                    ], add_py3_data(args[ 
                    1 
                    ]))  
                    +  
                    args[ 
                    2 
                    :] 
                   
 
                    - 
                    - 
                    >  
                    221          
                    return  
                    init_func( 
                    * 
                    args,  
                    * 
                    * 
                    kwargs) 
                   
 
                         
                    222      
                    return  
                    wraps(init_func)(_decorator) 
                   
 
                         
                    223 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    site 
                    - 
                    packages 
                    / 
                    nltk 
                    / 
                    data.py  
                    in  
                    __init__( 
                    self 
                    , filename) 
                   
 
                        
                    1012          
                    if  
                    not  
                    isinstance 
                    (filename, string_types): 
                   
 
                        
                    1013              
                    raise  
                    TypeError( 
                    'ReopenableZipFile filename must be a string' 
                    ) 
                   
 
                    - 
                    >  
                    1014          
                    zipfile.ZipFile.__init__( 
                    self 
                    , filename) 
                   
 
                        
                    1015          
                    assert  
                    self 
                    .filename  
                    = 
                    =  
                    filename 
                   
 
                        
                    1016          
                    self 
                    .close() 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    Cellar 
                    / 
                    python 
                    / 
                    3.6 
                    . 
                    4_4 
                    / 
                    Frameworks 
                    / 
                    Python.framework 
                    / 
                    Versions 
                    / 
                    3.6 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    zipfile.py  
                    in  
                    __init__( 
                    self 
                    ,  
                    file 
                    , mode, compression, allowZip64) 
                   
 
                        
                    1106          
                    try 
                    : 
                   
 
                        
                    1107              
                    if  
                    mode  
                    = 
                    =  
                    'r' 
                    : 
                   
 
                    - 
                    >  
                    1108                  
                    self 
                    ._RealGetContents() 
                   
 
                        
                    1109              
                    elif  
                    mode  
                    in  
                    ( 
                    'w' 
                    ,  
                    'x' 
                    ): 
                   
 
                        
                    1110                  
                    # set the modified flag so central directory gets written 
                   

                       
                   
 
                    / 
                    usr 
                    / 
                    local 
                    / 
                    Cellar 
                    / 
                    python 
                    / 
                    3.6 
                    . 
                    4_4 
                    / 
                    Frameworks 
                    / 
                    Python.framework 
                    / 
                    Versions 
                    / 
                    3.6 
                    / 
                    lib 
                    / 
                    python3. 
                    6 
                    / 
                    zipfile.py  
                    in  
                    _RealGetContents( 
                    self 
                    ) 
                   
 
                        
                    1173              
                    raise  
                    BadZipFile( 
                    "File is not a zip file" 
                    ) 
                   
 
                        
                    1174          
                    if  
                    not  
                    endrec: 
                   
 
                    - 
                    >  
                    1175              
                    raise  
                    BadZipFile( 
                    "File is not a zip file" 
                    ) 
                   
 
                        
                    1176          
                    if  
                    self 
                    .debug >  
                    1 
                    : 
                   
 
                        
                    1177              
                    print 
                    (endrec) 
                   

                       
                   
 
                    BadZipFile:  
                    File  
                    is  
                    not  
                    a  
                    zip  
                    file 
                   
 
                
 
               
             

則說明網絡問題，請使用可以連接國外服務器資源的服務器。

 
                  alice  
                  =  
                  gutenberg.raw(fileids 
                  = 
                  'carroll-alice.txt' 
                  ) 
                 
                  sample_text  
                  =  
                  'We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!'

可以使用以下代碼查看 "Akuce ub Wibderkabd" 語料庫的長度及其前幾行內容：

 
                  In [ 
                  12 
                  ]:  
                  print 
                  ( 
                  len 
                  (alice)) 
                 
                  144395 
                 
                  In [ 
                  13 
                  ]:  
                  print 
                  (alice[ 
                  0 
                  : 
                  100 
                  ]) 
                 
                  [Alice's Adventures  
                  in  
                  Wonderland by Lewis Carroll  
                  1865 
                  ] 
                 
                  CHAPTER I. Down the Rabbit 
                  - 
                  Hole 
                 
                  Alice was

nltk.sent_tokenize 函數是 nltk 推薦的默認的句子切分函數。它內部使用了一個 PunktSentenceTokenizer 類的示例。然而，它不僅僅是一個普通的對象或示例，它依據在幾種語言模型上完成了預訓練，並且在除英語外的許多語言上取得了良好的運行效果。

以下是代碼段展示了該函數在示例文本中的基本操作：

注意：

第一次執行需要執行：

 
                    nltk.download( 
                    'punkt' 
                    )

 
                  default_st  
                  =  
                  nltk.sent_tokenize 
                 
                  alice_sentences  
                  =  
                  default_st(text 
                  = 
                  alice) 
                 
                  sample_sentences  
                  =  
                  default_st(text 
                  = 
                  sample_text) 
                 
                  print 
                  ( 
                  'Total sentences in sample_text:' 
                  ,  
                  len 
                  (sample_sentences)) 
                 
                  print 
                  ( 
                  'Sample text sentences :-' 
                  ) 
                 
                  pprint(sample_sentences) 
                 
                  print 
                  ( 
                  '\nTotal sentences in alice:' 
                  ,  
                  len 
                  (alice_sentences)) 
                 
                  pprint(alice_sentences[ 
                  0 
                  : 
                  5 
                  ])

運行上述代碼段，你將得到以下輸出，該輸出給出句子總數以及這些句子在文本語料庫中的模樣：

 
                  Total sentences  
                  in  
                  sample_text:  
                  3 
                 
                  Sample text sentences : 
                  - 
                 
                  [ 
                  'We will discuss briefly about the basic syntax, structure and design ' 
                 
                  'philosophies.' 
                  , 
                 
                  'There is a defined hierarchical syntax for Python code which you should ' 
                 
                  'remember when writing code!' 
                  , 
                 
                  'Python is a really powerful programming language!' 
                  ] 
                 
                  Total sentences  
                  in  
                  alice:  
                  1625 
                 
                  [ 
                  "[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I." 
                  , 
                 
                  'Down the Rabbit-Hole\n' 
                 
                  '\n' 
                 
                  'Alice was beginning to get very tired of sitting by her sister on the\n' 
                 
                  'bank, and of having nothing to do: once or twice she had peeped into the\n' 
                 
                  'book her sister was reading, but it had no pictures or conversations in\n' 
                 
                  "it, 'and what is the use of a book,' thought Alice 'without pictures or\n" 
                 
                  "conversation?'" 
                  , 
                 
                  'So she was considering in her own mind (as well as she could, for the\n' 
                 
                  'hot day made her feel very sleepy and stupid), whether the pleasure\n' 
                 
                  'of making a daisy-chain would be worth the trouble of getting up and\n' 
                 
                  'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n' 
                 
                  'close by her.' 
                  , 
                 
                  'There was nothing so VERY remarkable in that; nor did Alice think it so\n' 
                 
                  "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!" 
                  , 
                 
                  'Oh dear!' 
                  ]

現在，應該可以看出，句子切分器其實是非常智能的，它不僅會使用句號來划分語句。它還會考慮到其他標點符號以及單詞大小寫。

我們也可以對其他語言的文本進行語句切分。如果正在處理德語文本，可以使用已經訓練好的 sent_tokenize，或者在德語文本中加載一個預先訓練好的切分模型得到一個 PunktSentenceTokenizer 實例中並執行相同的操作。以下代碼段顯示了德語中的語句切分過程。

首先加載德語文本語料庫並檢查它：

注意：

第一次執行需要執行：

 
                    nltk.download( 
                    'europarl_raw' 
                    )

 
                  In [ 
                  34 
                  ]: german_text  
                  =  
                  europarl_raw.german.raw(fileids 
                  = 
                  'ep-00-01-17.de' 
                  ) 
                 
                  In [ 
                  35 
                  ]:  
                  from  
                  nltk.corpus  
                  import  
                  europarl_raw 
                 
                  In [ 
                  36 
                  ]: german_text  
                  =  
                  europarl_raw.german.raw(fileids 
                  = 
                  'ep-00-01-17.de' 
                  ) 
                 
                  In [ 
                  37 
                  ]:  
                  print 
                  ( 
                  len 
                  (german_text)) 
                 
                  157171 
                 
                  In [ 
                  38 
                  ]:  
                  print 
                  (german_text[ 
                  0 
                  : 
                  100 
                  ]) 
                 
                  Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem  
                  17.  
                  Dezember unterbrochene Sit

然后，使用默認的 sent_tokenize 切分器和一個從 nltk 源加載的預訓練的德語切分器來講文本語料庫分割成句子：

 
                  In [ 
                  40 
                  ]: german_sentences_def  
                  =  
                  default_st(text 
                  = 
                  german_text, language 
                  = 
                  'german' 
                  ) 
                 
                  In [ 
                  41 
                  ]: german_tokenizer  
                  =  
                  nltk.data.load(resource_url 
                  = 
                  'tokenizers/punkt/german.pickle' 
                  ) 
                 
                  In [ 
                  42 
                  ]: german_sentences  
                  =  
                  german_tokenizer.tokenize(german_text) 
                 
                  In [ 
                  43 
                  ]:  
                  print 
                  ( 
                  type 
                  (german_tokenizer)) 
                 
                  < 
                  class  
                  'nltk.tokenize.punkt.PunktSentenceTokenizer' 
                  >

有此可以看出 german_tokenizer 是 PunktSentenceTokenizer 的一個實例，它專門用來處理德語。

接下來，對此從默認切分器獲得的句子是否與從預訓練切分器獲得的句子相同，理想情況下應為 True。之后，顯示部分示例句子的切分結果：

 
                  In [ 
                  45 
                  ]:  
                  print 
                  (german_sentences_def  
                  = 
                  =  
                  german_sentences) 
                 
                  True 
                 
                  In [ 
                  46 
                  ]:  
                  for  
                  sent  
                  in  
                  german_sentences[ 
                  0 
                  : 
                  5 
                  ]: 
                 
                  ....:      
                  print 
                  (sent) 
                 
                  ....: 
                 
                  Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem  
                  17.  
                  Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten . 
                 
                  Wie Sie feststellen konnten , ist der gefürchtete  
                  " Millenium-Bug "  
                  nicht eingetreten . 
                 
                  Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden . 
                 
                  Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode  
                  in  
                  den nächsten Tagen . 
                 
                  Heute möchte ich Sie bitten  
                  -  
                  das ist auch der Wunsch einiger Kolleginnen und Kollegen  
                  -  
                  , allen Opfern der Stürme , insbesondere  
                  in  
                  den verschiedenen Ländern der Europäischen Union ,  
                  in  
                  einer Schweigeminute zu gedenken .

從結果可以看出前端的假設是正確的，可以用兩種方式來切分英語之外的語言句子。使用默認的 PunktSentenceTokenizer 類也能很方便的實現句子切分，如下所示：

 
                  In [ 
                  47 
                  ]: punkt_st  
                  =  
                  nltk.tokenize.PunktSentenceTokenizer() 
                 
                  In [ 
                  48 
                  ]: sample_sentences  
                  =  
                  punkt_st.tokenize(sample_text) 
                 
                  In [ 
                  49 
                  ]: pprint(sample_sentences) 
                 
                  [ 
                  'We will discuss briefly about the basic syntax, structure and design ' 
                 
                  'philosophies.' 
                  , 
                 
                  'There is a defined hierarchical syntax for Python code which you should ' 
                 
                  'remember when writing code!' 
                  , 
                 
                  'Python is a really powerful programming language!' 
                  ]

可以看到，得到了與預期一致的輸出。在句子切分這部分知識中，要介紹的是使用 RegexpTokenizer 類的示例將文本切分為句子，將使用基於正則表達式的模式萊切分句子。

以下代碼顯示了如何使用正則表達式來分隔句子：

 
                  In [ 
                  50 
                  ]: SENTENCE_TOKENS_PATTERN  
                  =  
                  r 
                  '(?<!\w\.\w.)(?<![A-Z][a-z]\.])(?<![A-Z]\.)(?<=\.|\?|\!)\s' 
                 
                  In [ 
                  51 
                  ]: regex_st  
                  =  
                  nltk.tokenize.RegexpTokenizer( 
                 
                  ....:        pattern 
                  = 
                  SENTENCE_TOKENS_PATTERN, 
                 
                  ....:        gaps 
                  = 
                  True 
                 
                  ....: ) 
                 
                  In [ 
                  52 
                  ]: sample_sentences  
                  =  
                  regex_st.tokenize(sample_text) 
                 
                  In [ 
                  53 
                  ]: pprint(sample_sentences) 
                 
                  [ 
                  'We will discuss briefly about the basic syntax, structure and design ' 
                 
                  'philosophies.' 
                  , 
                 
                  'There is a defined hierarchical syntax for Python code which you should ' 
                 
                  'remember when writing code!' 
                  , 
                 
                  'Python is a really powerful programming language!' 
                  ]

通過上面的輸出可以看出，獲得的切分結果與使用其他切分器切分的結果相同。

詞語切分

詞語切分（word tokeninzation）是將句子分解或分割成其組成單詞的過程。句子是單詞的集合，通過詞語切分，在本質上，將一個句子分割成單詞列表，該單詞列表又可以重建句子。詞語分隔在很多過程中都是非常重要的，特別是在文本清晰和規范化時，諸如磁感提取和詞型還原基於詞干、標識信息的操作會在每個單詞實施。與句子切分類似，nltk 為詞語切分提供了各種有用的接口。

work_tokenize
TreebankWordTokenizer
RegexpTokenizer
從 RegexoTokenizer 繼承的切分器

將使用例句 "The brown fox wasn't that quick and he couldn't win the race" 作為各種切分器的輸入。nltk.word_tokenize 函數是 nltk 默認並推薦的詞語切分器。該切分器實際上是 TreebankWordTokenizer 類的一個實例或對象，並且是該核心類的一個封裝。以下代碼可與說其用法：

 
             
              
                
                  In [ 
                  9 
                  ]: sentence  
                  =  
                  "The brown fox wasn't that quick and he couldn't win the race" 
                 

                     
                 
 
                  In [ 
                  10 
                  ]: default_wt  
                  =  
                  nltk.word_tokenize 
                 

                     
                 
 
                  In [ 
                  11 
                  ]: words  
                  =  
                  default_wt(sentence) 
                 

                     
                 
 
                  In [ 
                  12 
                  ]:  
                  print 
                  (words) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  'was' 
                  ,  
                  "n't" 
                  , 'that 
                  ', ' 
                  quick 
                  ', ' 
                  and 
                  ', ' 
                  he 
                  ', ' 
                  could 
                  ', "n' 
                  t",  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

TreebankWordTokenizer 基於 Penn Treebank，並使用各種正則表達式來分隔文本。當然，這里的一個主要假設是我們已經預先執行了句子切分。Penn Treebank 使用的原始切分器是一個 sed 腳本，可以在 https://catalog.ldc.upenn.edu/ldc99t42 下載，從而了解句子切分為單詞的簡要模式。該切分器的一些主要功能包括：

分隔和分離出現在句子末尾的句點。
分隔和分離空格前的逗號和單引號。
將大多數表標點符號分隔成獨立標識。
分隔常規的縮寫詞，例如將 “don't” 分割成 “do” 和 “n‘t”。

以下代碼段展示了 TreebankWordTokenizr 的語句切分中的用法：

 
             
              
                
                  In [ 
                  13 
                  ]: treebank_wt  
                  =  
                  nltk.TreebankWordTokenizer() 
                 

                     
                 
 
                  In [ 
                  14 
                  ]: words  
                  =  
                  treebank_wt.tokenize(sentence) 
                 

                     
                 
 
                  In [ 
                  15 
                  ]:  
                  print 
                  (words) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  'was' 
                  ,  
                  "n't" 
                  , 'that 
                  ', ' 
                  quick 
                  ', ' 
                  and 
                  ', ' 
                  he 
                  ', ' 
                  could 
                  ', "n' 
                  t",  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

可以看出，正如所預期的那樣，上述代碼段的輸出與 word_tokenize() 的輸出相似，因為他們使用了相同的分詞機制。

現在來看看如何使用正則表達式的 RegexpTokenizer 類切分句子。請切記，在詞語切分中有兩個主要參數：pattern 參數和 gaps 參數。pattern 參數用於構建切分器；gaps 參數如果設置為 True，用於查找標識之間的間隙。否則，它用於查找標識本身。

以下代碼段展示了一些實用正則表達式執行詞語切分的示例：

 
             
              
                
                  In [ 
                  21 
                  ]: TOKEN_PATTERN  
                  =  
                  r 
                  '\w+' 
                 

                     
                 
 
                  In [ 
                  22 
                  ]: regex_wt  
                  =  
                  nltk.RegexpTokenizer(pattern 
                  = 
                  TOKEN_PATTERN,gaps 
                  = 
                  False 
                  ) 
                 

                     
                 
 
                  In [ 
                  23 
                  ]: words  
                  =  
                  regex_wt.tokenize(sentence) 
                 

                     
                 
 
                  In [ 
                  24 
                  ]:  
                  print 
                  (words) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  'wasn' 
                  ,  
                  't' 
                  ,  
                  'that' 
                  ,  
                  'quick' 
                  ,  
                  'and' 
                  ,  
                  'he' 
                  ,  
                  'couldn' 
                  ,  
                  't' 
                  ,  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

 
             
              
                
                  In [ 
                  25 
                  ]: GAP_PATTERN  
                  =  
                  r 
                  '\s+' 
                 

                     
                 
 
                  In [ 
                  26 
                  ]: regex_wt  
                  =  
                  nltk.RegexpTokenizer(pattern 
                  = 
                  GAP_PATTERN,gaps 
                  = 
                  True 
                  ) 
                 

                     
                 
 
                  In [ 
                  27 
                  ]: words  
                  =  
                  regex_wt.tokenize(sentence) 
                 

                     
                 
 
                  In [ 
                  28 
                  ]:  
                  print 
                  (words) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  "wasn't" 
                  , 'that 
                  ', ' 
                  quick 
                  ', ' 
                  and 
                  ', ' 
                  he 
                  ', "couldn' 
                  t",  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

 
             
              
                
                  In [ 
                  29 
                  ]: word_indices  
                  =  
                  list 
                  (regex_wt.span_tokenize(sentence)) 
                 

                     
                 
 
                  In [ 
                  30 
                  ]:  
                  print 
                  (word_indices) 
                 
 
                  [( 
                  0 
                  ,  
                  3 
                  ), ( 
                  4 
                  ,  
                  9 
                  ), ( 
                  10 
                  ,  
                  13 
                  ), ( 
                  14 
                  ,  
                  20 
                  ), ( 
                  21 
                  ,  
                  25 
                  ), ( 
                  26 
                  ,  
                  31 
                  ), ( 
                  32 
                  ,  
                  35 
                  ), ( 
                  36 
                  ,  
                  38 
                  ), ( 
                  39 
                  ,  
                  47 
                  ), ( 
                  48 
                  ,  
                  51 
                  ), ( 
                  52 
                  ,  
                  55 
                  ), ( 
                  56 
                  ,  
                  60 
                  )] 
                 

                     
                 
 
                  In [ 
                  31 
                  ]:  
                  print 
                  ([sentence[start:end]  
                  for  
                  start, end  
                  in  
                  word_indices]) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  "wasn't" 
                  , 'that 
                  ', ' 
                  quick 
                  ', ' 
                  and 
                  ', ' 
                  he 
                  ', "couldn' 
                  t",  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

除了基礎的 RegexpTokenizer 類之類，還有幾個派生類可以執行不同類型的詞語切分。WordPunktTokenizer 使用 r'\w+|[^\w\s]+' 模式將句子切分成獨立的字母和非字母標識。WhitespaceTokenizer 基於諸如縮進符、換行符及空格的空白字符將句子分割成單詞。

以下代碼說明了上述派生類的用法：

 
             
              
                
                  In [ 
                  32 
                  ]: wordpunkt_wt  
                  =  
                  nltk.WordPunctTokenizer() 
                 

                     
                 
 
                  In [ 
                  33 
                  ]: words  
                  =  
                  wordpunkt_wt.tokenize(sentence) 
                 

                     
                 
 
                  In [ 
                  34 
                  ]:  
                  print 
                  (words) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  'wasn' 
                  ,  
                  "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'" 
                  ,  
                  't' 
                  ,  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

 
             
              
                
                  In [ 
                  35 
                  ]: whitespace_wt  
                  =  
                  nltk.WhitespaceTokenizer() 
                 

                     
                 
 
                  In [ 
                  36 
                  ]: words  
                  =  
                  whitespace_wt.tokenize(sentence) 
                 

                     
                 
 
                  In [ 
                  37 
                  ]:  
                  print 
                  (words) 
                 
 
                  [ 
                  'The' 
                  ,  
                  'brown' 
                  ,  
                  'fox' 
                  ,  
                  "wasn't" 
                  , 'that 
                  ', ' 
                  quick 
                  ', ' 
                  and 
                  ', ' 
                  he 
                  ', "couldn' 
                  t",  
                  'win' 
                  ,  
                  'the' 
                  ,  
                  'race' 
                  ] 
                 
 
              
 
             
           

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 NLTK實現文本切分 python3將一個文本按照一定字符數切分水平切分和垂直切分的理解用split()切分 tomcat日志按天切分 VC切分窗口和多視圖 Qt中用QSS切分圖片 Hadoop：HDFS數據存儲與切分 python之字符串切分切分拆分集合list的方式