NLP&Python筆記——語料庫

本文轉載自查看原文 2018-07-19 21:56 792 NLP/ Python

什么是語料庫？文本語料庫是一個大型結構化文本的集合。

NLTK包含了許多語料庫：

（1）古滕堡語料庫
（2）網絡和聊天文本
（3）布朗語料庫
（4）路透社語料庫
（5）就職演講語料庫
（6）標注文本語料庫

詞匯列表語料庫
（1）詞匯列表：nltk.corpus.words.words()
詞匯語料庫是Unix 中的/usr/dict/words 文件，被一些拼寫檢查程序使用。下面這段代碼的功能是：過濾文本，留下罕見或拼寫錯誤的詞匯，刪除在詞匯列表中出現過的詞匯。

#coding:utf-8
import nltk
def unusual_words(text):
    text_vocab=set(w.lower() for w in text if w.isalpha())
    english_vocab=set(w.lower() for w in nltk.corpus.words.words())
    unusual=text_vocab.difference(english_vocab)    #求差集
    return sorted(unusual)
print(unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')))
print(unusual_words(nltk.corpus.nps_chat.words()))

（2）停用詞語料庫：nltk.corpus.stopwords.words()
停用詞語料庫包含一些高頻詞，在處理時可以從文檔中過濾掉，以便區分文本。下面這段代碼實現了計算文本中不包含在停用詞語料庫中的詞所占的比例。

import nltk
def content_fraction(text):
    stopwords=nltk.corpus.stopwords.words('english')
    content=[w for w in text if w.lower() not in stopwords]
    return len(content)*1.0/len(text)
print(content_fraction(nltk.corpus.reuters.words()))

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 NLP&Python筆記——nltk模塊基礎操作解析搜狗新聞語料庫 hanlp學習一：詞性標注（語料庫建設）【數據預處理】TIMIT語料庫WAV文件轉換 NLTK中文語料庫sinica_treebank nltk安裝配置以及語料庫的安裝配置基於《美國當代英語語料庫COCA詞頻20000》提取的純單詞文件自然語言處理2.1——NLTK文本語料庫中文文本分類語料庫-TanCorpV1.0 國內可用免費語料庫（已經整理過，凡沒有標注不可用的鏈接均可用）