做nlp的時候,如果用到tf-idf,sklearn中用CountVectorizer與TfidfTransformer兩個類,下面對和兩個類進行講解
一、訓練以及測試
CountVectorizer與TfidfTransformer在處理訓練數據的時候都用fit_transform方法,在測試集用transform方法。fit包含訓練的意思,表示訓練好了去測試,如果在測試集中也用fit_transform,那顯然導致結果錯誤。
#變量:content_train 訓練集,content_test測試集
vectorizer = CountVectorizer()
tfidftransformer = TfidfTransformer()
#訓練 用fit_transform
count_train=vectorizer.fit_transform(content_train)
tfidf = tfidftransformer.fit_transform(count_train)
#測試
count_test=vectorizer.transform(content_test)
test_tfidf = tfidftransformer.transform(count_test)
測試集的if-idf
test_weight = test_tfidf.toarray()
二、tf-idf詞典的保存
我們總是需要保存tf-idf的詞典,然后計算測試集的tfidf,這里要注意sklearn中保存有兩種方法:pickle與joblib。我們這里用pickle
1 train_content = segmentWord(X_train) 2 test_content = segmentWord(X_test) 3 # replace 必須加,保存訓練集的特征 4 vectorizer = CountVectorizer(decode_error="replace") 5 tfidftransformer = TfidfTransformer() 6 # 注意在訓練的時候必須用vectorizer.fit_transform、tfidftransformer.fit_transform 7 # 在預測的時候必須用vectorizer.transform、tfidftransformer.transform 8 vec_train = vectorizer.fit_transform(train_content) 9 tfidf = tfidftransformer.fit_transform(vec_train) 10 11 # 保存經過fit的vectorizer 與 經過fit的tfidftransformer,預測時使用 12 feature_path = 'models/feature.pkl' 13 with open(feature_path, 'wb') as fw: 14 pickle.dump(vectorizer.vocabulary_, fw) 15 16 tfidftransformer_path = 'models/tfidftransformer.pkl' 17 with open(tfidftransformer_path, 'wb') as fw: 18 pickle.dump(tfidftransformer, fw)
注意:vectorizer 與tfidftransformer都要保存,而且只能 fit_transform 之后保存,表示vectorizer 與tfidftransformer已經用訓練集訓練好了。
三、tf-idf加載,測試新數據
1 # 加載特征 2 feature_path = 'models/feature.pkl' 3 loaded_vec = CountVectorizer(decode_error="replace", vocabulary=pickle.load(open(feature_path, "rb"))) 4 # 加載TfidfTransformer 5 tfidftransformer_path = 'models/tfidftransformer.pkl' 6 tfidftransformer = pickle.load(open(tfidftransformer_path, "rb")) 7 #測試用transform,表示測試數據,為list 8 test_tfidf = tfidftransformer.transform(loaded_vec.transform(test_content))