參考鏈接:
https://www.jianshu.com/p/caa4b923117c
https://blog.csdn.net/papaaa/article/details/78821631
1.CountVectorizer
CountVectorizer會將文本中的詞語轉換為詞頻矩陣,它通過fit_transform函數計算各個詞語出現的次數,通過get_feature_names()可獲得所有文本的關鍵詞,通過toarray()可看到詞頻矩陣的結果。
代碼如下:
from sklearn.feature_extraction.text import CountVectorizer texts=["dog cat fish","dog cat cat","fish bird", 'bird'] cv = CountVectorizer() cv_fit=cv.fit_transform(texts) print("文本的關鍵詞:\n", cv.get_feature_names()) print("詞頻矩陣:\n", cv_fit.toarray()) print("cv_fit:\n", cv_fit)
返回的結果為稀疏矩陣:
2.TfidfTransformer
TfidfTransformer用於統計vectorizer中每個詞語的TF-IDF值。代碼如下:
from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer texts=["dog cat fish","dog cat cat","dog fish", 'dog pig pig bird'] cv = CountVectorizer() cv_fit=cv.fit_transform(texts) transformer = TfidfTransformer() tfidf = transformer.fit_transform(cv_fit) print (tfidf.toarray())
輸出結果為:
3.TfidfTransformer
將原始文檔的集合轉化為tf-idf特性的矩陣,相當於CountVectorizer配合TfidfTransformer使用的效果。
即TfidfVectorizer類將CountVectorizer和TfidfTransformer類封裝在一起。
代碼如下:
from sklearn.feature_extraction.text import TfidfVectorizer texts=["dog cat fish","dog cat cat","dog fish", 'dog pig pig bird'] tv = TfidfVectorizer(max_features=100, ngram_range=(1, 1), stop_words='english') X_description = tv.fit_transform(texts) print(X_description.toarray())
結果為:
可觀察到輸出的結果和上面的結果是一毛一樣的。
ngram_range=(1, 1)也可以改為(2,3),這就是2-gram.
stop_words暫時只支持英文,即”english”