CountVectorizer()類解析

本文轉載自查看原文 2018-08-10 12:00 4413 sklearn庫

主要可以參考下面幾個鏈接：

1.sklearn文本特征提取

2.使用scikit-learn tfidf計算詞語權重

3.sklearn官方中文文檔

4.sklearn.feature_extraction.text.CountVectorizer

補充一下：CounterVectorizer()類的函數transfome()的用法

它主要是把新的文本轉化為特征矩陣，只不過，這些特征是已經確定過的。而這個特征序列是前面的fit_transfome()輸入的語料庫確定的特征。見例子：

1 >>>from sklearn.feature_extraction.text import CountVectorizer
2 >>>vec=CountVectrizer()
3 >>>vec.transform(['Something completely new.']).toarray()

錯誤返回，sklearn.exceptions.NotFittedError: CountVectorizer - Vocabulary wasn't fitted.表示沒有對應的詞匯表，這個文本無法轉換。其實就是沒有建立vocabulary表，沒法對文本按照矩陣索引來統計詞的個位數

corpus = [
     'This is the first document.',
    'This is the second second document.',
   'And the third one.',
   'Is this the first document?']
X = vec.fit_transform(corpus)
X.toarray()

　vocabulary列表

>>>vec.get_feature_names()
 ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

　得到的稀疏矩陣是

array([[0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 1, 0, 1, 0, 2, 1, 0, 1],
       [1, 0, 0, 0, 1, 0, 1, 1, 0],
       [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

建立vocabulary后可以用transform（）來對新文本進行矩陣化了

>>>vec.transform(['this is']).toarray()
 array([[0, 0, 0, 1, 0, 0, 0, 0, 1]], dtype=int64)
>>>vec.transform(['too bad']).toarray()
array([[0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=int64)

簡單分析'this is'在vocabulary表里面，則對應詞統計數量，形成矩陣。而'too bad'在vocabulary表中沒有這兩詞，所以矩陣都為0.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 sklearn——CountVectorizer詳解 sklearn 詞袋 CountVectorizer Scikit-learn CountVectorizer與TfidfVectorizer TfidfVectorizer、CountVectorizer 和 TfidfTransformer 的簡單教程深入解析ConcurrentHashMap類 LxmlLinkExtractor類參數解析類數組對象解析 python類和self解析 SpringBoot 配置類解析 JAVA JSON解析：類XPATH解析JSON