NLTK 停用詞、罕見詞



一、停用詞 stopwords

停用詞:跟要做的實際主題不相關的文本,在 NPL任務中(信息檢索、分類)毫無意義;通常情況下,冠詞 和 代詞都會被列為;一般歧義不大,移除后影響小。

一般情況下,給定語言的停用詞都是人工制定,跨語料庫,針對最常見單詞的停用詞表。停用詞表可能使用網站上找到已有的,也可能是基於給定語料庫自動生成。
簡單的生成停用詞表方式之一:基於相關單詞在文檔中出現的頻率。

NLTK 庫中涵蓋了 22 種語言的停用詞表。


相關函數:

  • nltk.corpus. stopwords

1、查看停用詞

from nltk.corpus import stopwords # 加載停用詞
stopwords.readme().replace('\n', ' ')  # 停用詞說明文檔,由於有很多 \n 符號,所以這樣操作來方便查看
 
'''
    'Stopwords Corpus  This corpus contains lists of stop words for several languages.  These are high-frequency grammatical words which are usually ignored in text retrieval applications.  They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/  The stop words for the Romanian language were obtained from: http://arlc.ro/resources/  The English list has been augmented https://github.com/nltk/nltk_data/issues/22  The German list has been corrected https://github.com/nltk/nltk_data/pull/49  A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52  A Nepali list has been added https://github.com/nltk/nltk_data/pull/83  An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100  A Greek list has been added https://github.com/nltk/nltk_data/pull/103  An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '

'''

# 查看停用詞表,不同語言;沒有對中文的支持
stopwords.fileids() 
 

'''
    ['arabic',
     'azerbaijani',
     'danish',
     'dutch',
     'english',
     'finnish',
     'french',
     'german',
     'greek',
     'hungarian',
     'indonesian',
     'italian',
     'kazakh',
     'nepali',
     'norwegian',
     'portuguese',
     'romanian',
     'russian',
     'slovene',
     'spanish',
     'swedish',
     'tajik',
     'turkish']
'''
 
# 查看英文停用詞表

stopwords.raw('english').replace('\n', ' ')
 
'''
    "i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "

'''



2、停用詞過濾



 
test_words = [word.lower() for word in tokens]

# 轉化為集合,方便求和停用詞表的交集
test_words_set = set(test_words)  

test_words_set
'''
    {',',
     '.',
     'and',
     'api',
     'articles',
     'browse',
     'code',
     'developer',
     'documentation',
     'including',
     'latest',
     'reference',
     'sample',
     'the',
     'tutorials'}
'''

# 查看和停用詞表的交集
stopwords_english = set(stopwords.words('english'))
test_words_set.intersection(stopwords_english)
 
 #   {'and', 'the'}

# 把停用詞過濾掉
filtered = [w for w in test_words_set if(w not in stopwords_english) ]

filtered
'''
    ['documentation',
     'api',
     'tutorials',
     'articles',
     '.',
     'including',
     'latest',
     'code',
     'sample',
     'developer',
     ',',
     'reference',
     'browse']
'''

二、罕見詞

去除噪音性質的分詞;
根據不同場景指定不同的規則,如 html 的標簽、過長的名字等

import nltk
from nltk.tokenize import word_tokenize

str = 'arXiv is a free distribution service and an open-access archive for 1,812,439 scholarly articles. Materials on this site are not peer-reviewed by arXiv.'
tokens = word_tokenize(str) 

# 獲取相關術語在語料庫中的分布情況
freq_dist = nltk.FreqDist(tokens)
'''
FreqDist({'arXiv': 2, '.': 2, 'is': 1, 'a': 1, 'free': 1, 'distribution': 1, 'service': 1, 'and': 1, 'an': 1, 'open-access': 1, ...})
'''

# 選取其中最稀有的詞組成一個表,用這個表來過濾原始語料庫
rarewords = freq_dist.keys()[-5:]
rarewords

after_rare_words = [w for w in tokens not in rarewords]


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM