1.文本切分

本文转载自查看原文 2019-08-14 18:30 441

文本切分

之前讨论了文本结构、成文和表示。具体来说，标识（token）是具有一定的句法语义且独立的最小文本成分。一段文本或一个文本文件具有几个组成部分，包括可以进一步细分为从句、短语和单词的语句。最流行的文本切分技术包括句子切分和词语切分，用于将文本语料库分解成句子，并将每个句子分解成单词。因此，文本切分可以定义为将文本数据分解或拆分为具有更小且有意义的成文（即标识）的过程。

句子切分

句子切分（sentence tokenization）是将文本语料库分解成句子的过程，这些句子是组成语料库的第一级切分结果。这个过程也称为句子分隔，因为尝试将文本分割成有意义的句子。任何文本语料库都是文本的集合，其中每一段落包含多个句子。

执行句子切分有多种技术，基本技术包括在句子之间寻找特定的分隔符，例如句号 ( . )、换行符 ( \n ) 或者分号 ( ; )。将使用 NLTK 框架进行切分，该框架提供用于执行句子切分的各种接口。将主要关注以下句子切分器：

sent_tokenize
PunktSentenceTokenizer
RegexpTokenizer

在将文本分割成句子之前，需要一些测试该系统的文本。下面将加载一些示例文本，以及部分在 NLTK 中可用的古腾堡（Gutenberg）资料库。可以使用以下代码段加载必要的依存项：

 
          import  
          nltk 
         
          from  
          nltk.corpus  
          import  
          gutenberg 
         
          from  
          pprint  
          import  
          pprint

注意：

如果第一次执行则需要执行：

 
            import  
            nltk 
           
            nltk.download( 
            'gutenberg' 
            )

则会下载所需要的书籍列表。下载成功后执行代码进行查看：

 
            In [ 
            7 
            ]: nltk.corpus.gutenberg.fileids() 
           
            Out[ 
            7 
            ]: 
           
            [ 
            'austen-emma.txt' 
            , 
           
            'austen-persuasion.txt' 
            , 
           
            'austen-sense.txt' 
            , 
           
            'bible-kjv.txt' 
            , 
           
            'blake-poems.txt' 
            , 
           
            'bryant-stories.txt' 
            , 
           
            'burgess-busterbrown.txt' 
            , 
           
            'carroll-alice.txt' 
            , 
           
            'chesterton-ball.txt' 
            , 
           
            'chesterton-brown.txt' 
            , 
           
            'chesterton-thursday.txt' 
            , 
           
            'edgeworth-parents.txt' 
            , 
           
            'melville-moby_dick.txt' 
            , 
           
            'milton-paradise.txt' 
            , 
           
            'shakespeare-caesar.txt' 
            , 
           
            'shakespeare-hamlet.txt' 
            , 
           
            'shakespeare-macbeth.txt' 
            , 
           
            'whitman-leaves.txt' 
            ]

如果执行时出现以下错误：

error 折叠源码

 
       
        
          
            In [ 
            14 
            ]: alice  
            =  
            gutenberg.raw(fileids 
            = 
            'carrolll-alice.txt' 
            ) 
           
 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
            - 
           
 
            BadZipFile                                Traceback (most recent call last) 
           
 
            <ipython 
            - 
            input 
            - 
            14 
            - 
            158d1a6a9aa4 
            >  
            in  
            <module>() 
           
 
            - 
            - 
            - 
            - 
            >  
            1  
            alice  
            =  
            gutenberg.raw(fileids 
            = 
            'carrolll-alice.txt' 
            ) 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            corpus 
            / 
            util.py  
            in  
            __getattr__( 
            self 
            , attr) 
           
 
                 
            114              
            raise  
            AttributeError( 
            "LazyCorpusLoader object has no attribute '__bases__'" 
            ) 
           
 
                 
            115 
           
 
            - 
            - 
            >  
            116          
            self 
            .__load() 
           
 
                 
            117          
            # This looks circular, but its not, since __load() changes our 
           
 
                 
            118          
            # __class__ to something new: 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            corpus 
            / 
            util.py  
            in  
            __load( 
            self 
            ) 
           
 
                  
            76          
            else 
            : 
           
 
                  
            77              
            try 
            : 
           
 
            - 
            - 
            - 
            >  
            78                  
            root  
            =  
            nltk.data.find( 
            '{}/{}' 
            . 
            format 
            ( 
            self 
            .subdir,  
            self 
            .__name)) 
           
 
                  
            79              
            except  
            LookupError as e: 
           
 
                  
            80                  
            try 
            : root  
            =  
            nltk.data.find( 
            '{}/{}' 
            . 
            format 
            ( 
            self 
            .subdir, zip_name)) 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            data.py  
            in  
            find(resource_name, paths) 
           
 
                 
            653                                       
            [pieces[i]  
            +  
            '.zip' 
            ]  
            +  
            pieces[i:]) 
           
 
                 
            654              
            try 
            : 
           
 
            - 
            - 
            >  
            655                  
            return  
            find(modified_name, paths) 
           
 
                 
            656              
            except  
            LookupError: 
           
 
                 
            657                  
            pass 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            data.py  
            in  
            find(resource_name, paths) 
           
 
                 
            639                  
            if  
            os.path.exists(p): 
           
 
                 
            640                      
            try 
            : 
           
 
            - 
            - 
            >  
            641                          
            return  
            ZipFilePathPointer(p, zipentry) 
           
 
                 
            642                      
            except  
            IOError: 
           
 
                 
            643                          
            # resource not in zipfile 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            compat.py  
            in  
            _decorator( 
            * 
            args,  
            * 
            * 
            kwargs) 
           
 
                 
            219      
            def  
            _decorator( 
            * 
            args,  
            * 
            * 
            kwargs): 
           
 
                 
            220          
            args  
            =  
            (args[ 
            0 
            ], add_py3_data(args[ 
            1 
            ]))  
            +  
            args[ 
            2 
            :] 
           
 
            - 
            - 
            >  
            221          
            return  
            init_func( 
            * 
            args,  
            * 
            * 
            kwargs) 
           
 
                 
            222      
            return  
            wraps(init_func)(_decorator) 
           
 
                 
            223 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            data.py  
            in  
            __init__( 
            self 
            , zipfile, entry) 
           
 
                 
            486          
            """ 
           
 
                 
            487          
            if  
            isinstance 
            (zipfile, string_types): 
           
 
            - 
            - 
            >  
            488              
            zipfile  
            =  
            OpenOnDemandZipFile(os.path.abspath(zipfile)) 
           
 
                 
            489 
           
 
                 
            490          
            # Normalize the entry string, it should be relative: 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            compat.py  
            in  
            _decorator( 
            * 
            args,  
            * 
            * 
            kwargs) 
           
 
                 
            219      
            def  
            _decorator( 
            * 
            args,  
            * 
            * 
            kwargs): 
           
 
                 
            220          
            args  
            =  
            (args[ 
            0 
            ], add_py3_data(args[ 
            1 
            ]))  
            +  
            args[ 
            2 
            :] 
           
 
            - 
            - 
            >  
            221          
            return  
            init_func( 
            * 
            args,  
            * 
            * 
            kwargs) 
           
 
                 
            222      
            return  
            wraps(init_func)(_decorator) 
           
 
                 
            223 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            site 
            - 
            packages 
            / 
            nltk 
            / 
            data.py  
            in  
            __init__( 
            self 
            , filename) 
           
 
                
            1012          
            if  
            not  
            isinstance 
            (filename, string_types): 
           
 
                
            1013              
            raise  
            TypeError( 
            'ReopenableZipFile filename must be a string' 
            ) 
           
 
            - 
            >  
            1014          
            zipfile.ZipFile.__init__( 
            self 
            , filename) 
           
 
                
            1015          
            assert  
            self 
            .filename  
            = 
            =  
            filename 
           
 
                
            1016          
            self 
            .close() 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            Cellar 
            / 
            python 
            / 
            3.6 
            . 
            4_4 
            / 
            Frameworks 
            / 
            Python.framework 
            / 
            Versions 
            / 
            3.6 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            zipfile.py  
            in  
            __init__( 
            self 
            ,  
            file 
            , mode, compression, allowZip64) 
           
 
                
            1106          
            try 
            : 
           
 
                
            1107              
            if  
            mode  
            = 
            =  
            'r' 
            : 
           
 
            - 
            >  
            1108                  
            self 
            ._RealGetContents() 
           
 
                
            1109              
            elif  
            mode  
            in  
            ( 
            'w' 
            ,  
            'x' 
            ): 
           
 
                
            1110                  
            # set the modified flag so central directory gets written 
           

               
           
 
            / 
            usr 
            / 
            local 
            / 
            Cellar 
            / 
            python 
            / 
            3.6 
            . 
            4_4 
            / 
            Frameworks 
            / 
            Python.framework 
            / 
            Versions 
            / 
            3.6 
            / 
            lib 
            / 
            python3. 
            6 
            / 
            zipfile.py  
            in  
            _RealGetContents( 
            self 
            ) 
           
 
                
            1173              
            raise  
            BadZipFile( 
            "File is not a zip file" 
            ) 
           
 
                
            1174          
            if  
            not  
            endrec: 
           
 
            - 
            >  
            1175              
            raise  
            BadZipFile( 
            "File is not a zip file" 
            ) 
           
 
                
            1176          
            if  
            self 
            .debug >  
            1 
            : 
           
 
                
            1177              
            print 
            (endrec) 
           

               
           
 
            BadZipFile:  
            File  
            is  
            not  
            a  
            zip  
            file 
           
 
        
 
       
     

则说明网络问题，请使用可以连接国外服务器资源的服务器。

 
          alice  
          =  
          gutenberg.raw(fileids 
          = 
          'carroll-alice.txt' 
          ) 
         
          sample_text  
          =  
          'We will discuss briefly about the basic syntax, structure and design philosophies. There is a defined hierarchical syntax for Python code which you should remember when writing code! Python is a really powerful programming language!'

可以使用以下代码查看 "Akuce ub Wibderkabd" 语料库的长度及其前几行内容：

 
          In [ 
          12 
          ]:  
          print 
          ( 
          len 
          (alice)) 
         
          144395 
         
          In [ 
          13 
          ]:  
          print 
          (alice[ 
          0 
          : 
          100 
          ]) 
         
          [Alice's Adventures  
          in  
          Wonderland by Lewis Carroll  
          1865 
          ] 
         
          CHAPTER I. Down the Rabbit 
          - 
          Hole 
         
          Alice was

nltk.sent_tokenize 函数是 nltk 推荐的默认的句子切分函数。它内部使用了一个 PunktSentenceTokenizer 类的示例。然而，它不仅仅是一个普通的对象或示例，它依据在几种语言模型上完成了预训练，并且在除英语外的许多语言上取得了良好的运行效果。

以下是代码段展示了该函数在示例文本中的基本操作：

注意：

第一次执行需要执行：

 
            nltk.download( 
            'punkt' 
            )

 
          default_st  
          =  
          nltk.sent_tokenize 
         
          alice_sentences  
          =  
          default_st(text 
          = 
          alice) 
         
          sample_sentences  
          =  
          default_st(text 
          = 
          sample_text) 
         
          print 
          ( 
          'Total sentences in sample_text:' 
          ,  
          len 
          (sample_sentences)) 
         
          print 
          ( 
          'Sample text sentences :-' 
          ) 
         
          pprint(sample_sentences) 
         
          print 
          ( 
          '\nTotal sentences in alice:' 
          ,  
          len 
          (alice_sentences)) 
         
          pprint(alice_sentences[ 
          0 
          : 
          5 
          ])

运行上述代码段，你将得到以下输出，该输出给出句子总数以及这些句子在文本语料库中的模样：

 
          Total sentences  
          in  
          sample_text:  
          3 
         
          Sample text sentences : 
          - 
         
          [ 
          'We will discuss briefly about the basic syntax, structure and design ' 
         
          'philosophies.' 
          , 
         
          'There is a defined hierarchical syntax for Python code which you should ' 
         
          'remember when writing code!' 
          , 
         
          'Python is a really powerful programming language!' 
          ] 
         
          Total sentences  
          in  
          alice:  
          1625 
         
          [ 
          "[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I." 
          , 
         
          'Down the Rabbit-Hole\n' 
         
          '\n' 
         
          'Alice was beginning to get very tired of sitting by her sister on the\n' 
         
          'bank, and of having nothing to do: once or twice she had peeped into the\n' 
         
          'book her sister was reading, but it had no pictures or conversations in\n' 
         
          "it, 'and what is the use of a book,' thought Alice 'without pictures or\n" 
         
          "conversation?'" 
          , 
         
          'So she was considering in her own mind (as well as she could, for the\n' 
         
          'hot day made her feel very sleepy and stupid), whether the pleasure\n' 
         
          'of making a daisy-chain would be worth the trouble of getting up and\n' 
         
          'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n' 
         
          'close by her.' 
          , 
         
          'There was nothing so VERY remarkable in that; nor did Alice think it so\n' 
         
          "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!" 
          , 
         
          'Oh dear!' 
          ]

现在，应该可以看出，句子切分器其实是非常智能的，它不仅会使用句号来划分语句。它还会考虑到其他标点符号以及单词大小写。

我们也可以对其他语言的文本进行语句切分。如果正在处理德语文本，可以使用已经训练好的 sent_tokenize，或者在德语文本中加载一个预先训练好的切分模型得到一个 PunktSentenceTokenizer 实例中并执行相同的操作。以下代码段显示了德语中的语句切分过程。

首先加载德语文本语料库并检查它：

注意：

第一次执行需要执行：

 
            nltk.download( 
            'europarl_raw' 
            )

 
          In [ 
          34 
          ]: german_text  
          =  
          europarl_raw.german.raw(fileids 
          = 
          'ep-00-01-17.de' 
          ) 
         
          In [ 
          35 
          ]:  
          from  
          nltk.corpus  
          import  
          europarl_raw 
         
          In [ 
          36 
          ]: german_text  
          =  
          europarl_raw.german.raw(fileids 
          = 
          'ep-00-01-17.de' 
          ) 
         
          In [ 
          37 
          ]:  
          print 
          ( 
          len 
          (german_text)) 
         
          157171 
         
          In [ 
          38 
          ]:  
          print 
          (german_text[ 
          0 
          : 
          100 
          ]) 
         
          Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem  
          17.  
          Dezember unterbrochene Sit

然后，使用默认的 sent_tokenize 切分器和一个从 nltk 源加载的预训练的德语切分器来讲文本语料库分割成句子：

 
          In [ 
          40 
          ]: german_sentences_def  
          =  
          default_st(text 
          = 
          german_text, language 
          = 
          'german' 
          ) 
         
          In [ 
          41 
          ]: german_tokenizer  
          =  
          nltk.data.load(resource_url 
          = 
          'tokenizers/punkt/german.pickle' 
          ) 
         
          In [ 
          42 
          ]: german_sentences  
          =  
          german_tokenizer.tokenize(german_text) 
         
          In [ 
          43 
          ]:  
          print 
          ( 
          type 
          (german_tokenizer)) 
         
          < 
          class  
          'nltk.tokenize.punkt.PunktSentenceTokenizer' 
          >

有此可以看出 german_tokenizer 是 PunktSentenceTokenizer 的一个实例，它专门用来处理德语。

接下来，对此从默认切分器获得的句子是否与从预训练切分器获得的句子相同，理想情况下应为 True。之后，显示部分示例句子的切分结果：

 
          In [ 
          45 
          ]:  
          print 
          (german_sentences_def  
          = 
          =  
          german_sentences) 
         
          True 
         
          In [ 
          46 
          ]:  
          for  
          sent  
          in  
          german_sentences[ 
          0 
          : 
          5 
          ]: 
         
          ....:      
          print 
          (sent) 
         
          ....: 
         
          Wiederaufnahme der Sitzungsperiode Ich erkläre die am Freitag , dem  
          17.  
          Dezember unterbrochene Sitzungsperiode des Europäischen Parlaments für wiederaufgenommen , wünsche Ihnen nochmals alles Gute zum Jahreswechsel und hoffe , daß Sie schöne Ferien hatten . 
         
          Wie Sie feststellen konnten , ist der gefürchtete  
          " Millenium-Bug "  
          nicht eingetreten . 
         
          Doch sind Bürger einiger unserer Mitgliedstaaten Opfer von schrecklichen Naturkatastrophen geworden . 
         
          Im Parlament besteht der Wunsch nach einer Aussprache im Verlauf dieser Sitzungsperiode  
          in  
          den nächsten Tagen . 
         
          Heute möchte ich Sie bitten  
          -  
          das ist auch der Wunsch einiger Kolleginnen und Kollegen  
          -  
          , allen Opfern der Stürme , insbesondere  
          in  
          den verschiedenen Ländern der Europäischen Union ,  
          in  
          einer Schweigeminute zu gedenken .

从结果可以看出前端的假设是正确的，可以用两种方式来切分英语之外的语言句子。使用默认的 PunktSentenceTokenizer 类也能很方便的实现句子切分，如下所示：

 
          In [ 
          47 
          ]: punkt_st  
          =  
          nltk.tokenize.PunktSentenceTokenizer() 
         
          In [ 
          48 
          ]: sample_sentences  
          =  
          punkt_st.tokenize(sample_text) 
         
          In [ 
          49 
          ]: pprint(sample_sentences) 
         
          [ 
          'We will discuss briefly about the basic syntax, structure and design ' 
         
          'philosophies.' 
          , 
         
          'There is a defined hierarchical syntax for Python code which you should ' 
         
          'remember when writing code!' 
          , 
         
          'Python is a really powerful programming language!' 
          ]

可以看到，得到了与预期一致的输出。在句子切分这部分知识中，要介绍的是使用 RegexpTokenizer 类的示例将文本切分为句子，将使用基于正则表达式的模式莱切分句子。

以下代码显示了如何使用正则表达式来分隔句子：

 
          In [ 
          50 
          ]: SENTENCE_TOKENS_PATTERN  
          =  
          r 
          '(?<!\w\.\w.)(?<![A-Z][a-z]\.])(?<![A-Z]\.)(?<=\.|\?|\!)\s' 
         
          In [ 
          51 
          ]: regex_st  
          =  
          nltk.tokenize.RegexpTokenizer( 
         
          ....:        pattern 
          = 
          SENTENCE_TOKENS_PATTERN, 
         
          ....:        gaps 
          = 
          True 
         
          ....: ) 
         
          In [ 
          52 
          ]: sample_sentences  
          =  
          regex_st.tokenize(sample_text) 
         
          In [ 
          53 
          ]: pprint(sample_sentences) 
         
          [ 
          'We will discuss briefly about the basic syntax, structure and design ' 
         
          'philosophies.' 
          , 
         
          'There is a defined hierarchical syntax for Python code which you should ' 
         
          'remember when writing code!' 
          , 
         
          'Python is a really powerful programming language!' 
          ]

通过上面的输出可以看出，获得的切分结果与使用其他切分器切分的结果相同。

词语切分

词语切分（word tokeninzation）是将句子分解或分割成其组成单词的过程。句子是单词的集合，通过词语切分，在本质上，将一个句子分割成单词列表，该单词列表又可以重建句子。词语分隔在很多过程中都是非常重要的，特别是在文本清晰和规范化时，诸如磁感提取和词型还原基于词干、标识信息的操作会在每个单词实施。与句子切分类似，nltk 为词语切分提供了各种有用的接口。

work_tokenize
TreebankWordTokenizer
RegexpTokenizer
从 RegexoTokenizer 继承的切分器

将使用例句 "The brown fox wasn't that quick and he couldn't win the race" 作为各种切分器的输入。nltk.word_tokenize 函数是 nltk 默认并推荐的词语切分器。该切分器实际上是 TreebankWordTokenizer 类的一个实例或对象，并且是该核心类的一个封装。以下代码可与说其用法：

 
     
      
        
          In [ 
          9 
          ]: sentence  
          =  
          "The brown fox wasn't that quick and he couldn't win the race" 
         

             
         
 
          In [ 
          10 
          ]: default_wt  
          =  
          nltk.word_tokenize 
         

             
         
 
          In [ 
          11 
          ]: words  
          =  
          default_wt(sentence) 
         

             
         
 
          In [ 
          12 
          ]:  
          print 
          (words) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          'was' 
          ,  
          "n't" 
          , 'that 
          ', ' 
          quick 
          ', ' 
          and 
          ', ' 
          he 
          ', ' 
          could 
          ', "n' 
          t",  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

TreebankWordTokenizer 基于 Penn Treebank，并使用各种正则表达式来分隔文本。当然，这里的一个主要假设是我们已经预先执行了句子切分。Penn Treebank 使用的原始切分器是一个 sed 脚本，可以在 https://catalog.ldc.upenn.edu/ldc99t42 下载，从而了解句子切分为单词的简要模式。该切分器的一些主要功能包括：

分隔和分离出现在句子末尾的句点。
分隔和分离空格前的逗号和单引号。
将大多数表标点符号分隔成独立标识。
分隔常规的缩写词，例如将 “don't” 分割成 “do” 和 “n‘t”。

以下代码段展示了 TreebankWordTokenizr 的语句切分中的用法：

 
     
      
        
          In [ 
          13 
          ]: treebank_wt  
          =  
          nltk.TreebankWordTokenizer() 
         

             
         
 
          In [ 
          14 
          ]: words  
          =  
          treebank_wt.tokenize(sentence) 
         

             
         
 
          In [ 
          15 
          ]:  
          print 
          (words) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          'was' 
          ,  
          "n't" 
          , 'that 
          ', ' 
          quick 
          ', ' 
          and 
          ', ' 
          he 
          ', ' 
          could 
          ', "n' 
          t",  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

可以看出，正如所预期的那样，上述代码段的输出与 word_tokenize() 的输出相似，因为他们使用了相同的分词机制。

现在来看看如何使用正则表达式的 RegexpTokenizer 类切分句子。请切记，在词语切分中有两个主要参数：pattern 参数和 gaps 参数。pattern 参数用于构建切分器；gaps 参数如果设置为 True，用于查找标识之间的间隙。否则，它用于查找标识本身。

以下代码段展示了一些实用正则表达式执行词语切分的示例：

 
     
      
        
          In [ 
          21 
          ]: TOKEN_PATTERN  
          =  
          r 
          '\w+' 
         

             
         
 
          In [ 
          22 
          ]: regex_wt  
          =  
          nltk.RegexpTokenizer(pattern 
          = 
          TOKEN_PATTERN,gaps 
          = 
          False 
          ) 
         

             
         
 
          In [ 
          23 
          ]: words  
          =  
          regex_wt.tokenize(sentence) 
         

             
         
 
          In [ 
          24 
          ]:  
          print 
          (words) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          'wasn' 
          ,  
          't' 
          ,  
          'that' 
          ,  
          'quick' 
          ,  
          'and' 
          ,  
          'he' 
          ,  
          'couldn' 
          ,  
          't' 
          ,  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          25 
          ]: GAP_PATTERN  
          =  
          r 
          '\s+' 
         

             
         
 
          In [ 
          26 
          ]: regex_wt  
          =  
          nltk.RegexpTokenizer(pattern 
          = 
          GAP_PATTERN,gaps 
          = 
          True 
          ) 
         

             
         
 
          In [ 
          27 
          ]: words  
          =  
          regex_wt.tokenize(sentence) 
         

             
         
 
          In [ 
          28 
          ]:  
          print 
          (words) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          "wasn't" 
          , 'that 
          ', ' 
          quick 
          ', ' 
          and 
          ', ' 
          he 
          ', "couldn' 
          t",  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          29 
          ]: word_indices  
          =  
          list 
          (regex_wt.span_tokenize(sentence)) 
         

             
         
 
          In [ 
          30 
          ]:  
          print 
          (word_indices) 
         
 
          [( 
          0 
          ,  
          3 
          ), ( 
          4 
          ,  
          9 
          ), ( 
          10 
          ,  
          13 
          ), ( 
          14 
          ,  
          20 
          ), ( 
          21 
          ,  
          25 
          ), ( 
          26 
          ,  
          31 
          ), ( 
          32 
          ,  
          35 
          ), ( 
          36 
          ,  
          38 
          ), ( 
          39 
          ,  
          47 
          ), ( 
          48 
          ,  
          51 
          ), ( 
          52 
          ,  
          55 
          ), ( 
          56 
          ,  
          60 
          )] 
         

             
         
 
          In [ 
          31 
          ]:  
          print 
          ([sentence[start:end]  
          for  
          start, end  
          in  
          word_indices]) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          "wasn't" 
          , 'that 
          ', ' 
          quick 
          ', ' 
          and 
          ', ' 
          he 
          ', "couldn' 
          t",  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

除了基础的 RegexpTokenizer 类之类，还有几个派生类可以执行不同类型的词语切分。WordPunktTokenizer 使用 r'\w+|[^\w\s]+' 模式将句子切分成独立的字母和非字母标识。WhitespaceTokenizer 基于诸如缩进符、换行符及空格的空白字符将句子分割成单词。

以下代码说明了上述派生类的用法：

 
     
      
        
          In [ 
          32 
          ]: wordpunkt_wt  
          =  
          nltk.WordPunctTokenizer() 
         

             
         
 
          In [ 
          33 
          ]: words  
          =  
          wordpunkt_wt.tokenize(sentence) 
         

             
         
 
          In [ 
          34 
          ]:  
          print 
          (words) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          'wasn' 
          ,  
          "'", 't', 'that', 'quick', 'and', 'he', 'couldn', "'" 
          ,  
          't' 
          ,  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

 
     
      
        
          In [ 
          35 
          ]: whitespace_wt  
          =  
          nltk.WhitespaceTokenizer() 
         

             
         
 
          In [ 
          36 
          ]: words  
          =  
          whitespace_wt.tokenize(sentence) 
         

             
         
 
          In [ 
          37 
          ]:  
          print 
          (words) 
         
 
          [ 
          'The' 
          ,  
          'brown' 
          ,  
          'fox' 
          ,  
          "wasn't" 
          , 'that 
          ', ' 
          quick 
          ', ' 
          and 
          ', ' 
          he 
          ', "couldn' 
          t",  
          'win' 
          ,  
          'the' 
          ,  
          'race' 
          ] 
         
 
      
 
     
   

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python3将一个文本按照一定字符数切分水平切分和垂直切分的理解用split()切分 tomcat日志按天切分 Qt中用QSS切分图片 Hadoop：HDFS数据存储与切分 python之字符串切分 1.什么是文本分类 linux audit审计（4）--audit的日志切分，以及与rsyslog的切分协同使用关于数据库的水平切分和垂直切分的一些概念（转）