NLTK——NLTK的正則表達式分詞器（nltk.regexp_tokenize）

本文轉載自查看原文 2019-05-16 15:45 710 NLTK

在《Python自然語言處理》一書中的P121出現來一段利用NLTK自帶的正則表達式分詞器——nlt.regexp_tokenize,書中代碼為:

1 text = 'That U.S.A. poster-print ex-costs-ed $12.40 ... 8% ?  _'
2     pattern = r'''(?x)    # set flag to allow verbose regexps
3         ([A-Z]\.)+        # abbreviations, e.g. U.S.A.
4        |\w+(-\w+)*        # words with optional internal hyphens
5        |\$?\d+(\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
6        |\.\.\.            # ellipsis
7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
8      '''

其中text變量結尾的“8%”和“_”是我自己加上去的。

預期輸出應該是：

1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8%', '?', '_']

可實際代碼是：

1 [('', '', ''), ('A.', '', ''), ('', '-print', ''), ('', '-ed', ''), ('', '', '.40'), ('', '', ''), ('', '', ''), ('', '', ''), ('', '', '')]

會出現這樣的問題是由於nltk.internals.compile_regexp_to_noncapturing()在V3.1版本的NLTK中已經被拋棄（盡管在更早的版本中它仍然可以運行），為此我們把之前定義的pattern稍作修改（參考：https://blog.csdn.net/baimafujinji/article/details/51051505）

1 pattern = r'''(?x)    # set flag to allow verbose regexps
2         (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
3        |\w+(?:-\w+)*        # words with optional internal hyphens
4        |\$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
5        #|\w+(?:-\w+)* 
6        |\.\.\.            # ellipsis
7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
8      '''

實際輸出結果是:

1 ['That', 'U.S.A.', 'poster-print', 'ex-costs-ed', '$12.40', '...', '8', '?', '_']

我們發現‘8’應該顯示成‘8%’才對，后發現將第三行的‘*’去掉或者將第三四行調換位置即可正常顯示，修改后代碼如下：

1 pattern = r'''(?x)    # set flag to allow verbose regexps
2         (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
3        #|\w+(?:-\w+)*        # words with optional internal hyphens
4        |\$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
5        |\w+(?:-\w+)* 
6        |\.\.\.            # ellipsis
7        |(?:[.,;"'?():-_`])  # these are separate tokens; includes ], [
8      '''

此時結果顯示正常，所以得出結論就是‘*’影響了它下面的正則表達式中的百分號'%'的匹配。至於為什么就不得而知了。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 nltk 的分詞器punkt： ssl問題無法下載利用NLTK進行分詞正則表達式（RegExp） JS正則表達式（RegExp）正則表達式(RegExp) sql的正則表達式REGEXP MySQL之正則表達式（REGEXP） MySQL之正則表達式（REGEXP） MySQL REGEXP正則表達式 NLTK的使用