特征抽取sklearn.feature_extraction 模塊提供了從原始數據如文本,圖像等眾抽取能夠被機器學習算法直接處理的特征向量。
1.特征抽取方法之 Loading Features from Dicts
measurements=[ {'city':'Dubai','temperature':33.}, {'city':'London','temperature':12.}, {'city':'San Fransisco','temperature':18.}, ] from sklearn.feature_extraction import DictVectorizer vec=DictVectorizer() print(vec.fit_transform(measurements).toarray()) print(vec.get_feature_names()) #[[ 1. 0. 0. 33.] #[ 0. 1. 0. 12.] #[ 0. 0. 1. 18.]] #['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
2.特征抽取方法之 Features hashing
3.特征抽取方法之 Text Feature Extraction
詞袋模型 the bag of words represenatation
#詞袋模型 from sklearn.feature_extraction.text import CountVectorizer #查看默認的參數 vectorizer=CountVectorizer(min_df=1) print(vectorizer) """ CountVectorizer(analyzer='word', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None) """ corpus=["this is the first document.", "this is the second second document.", "and the third one.", "Is this the first document?"] x=vectorizer.fit_transform(corpus) print(x) """ (0, 1) 1 (0, 2) 1 (0, 6) 1 (0, 3) 1 (0, 8) 1 (1, 5) 2 (1, 1) 1 (1, 6) 1 (1, 3) 1 (1, 8) 1 (2, 4) 1 (2, 7) 1 (2, 0) 1 (2, 6) 1 (3, 1) 1 (3, 2) 1 (3, 6) 1 (3, 3) 1 (3, 8) 1 """
默認是可以識別的字符串至少為2個字符
analyze=vectorizer.build_analyzer() print(analyze("this is a document to anzlyze.")==
(["this","is","document","to","anzlyze"])) #True
在fit階段被analyser發現的每一個詞語都會被分配一個獨特的整形索引,該索引對應於特征向量矩陣中的一列
print(vectorizer.get_feature_names()==( ["and","document","first","is","one","second","the","third","this"] )) #True print(x.toarray()) """ [[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]] """
獲取屬性
print(vectorizer.vocabulary_.get('document')) #1
對於一些沒有出現過的字或者字符,則會顯示為0
vectorizer.transform(["somthing completely new."]).toarray() """ [[0 1 1 1 0 0 1 0 1] [0 1 0 1 0 2 1 0 1] [1 0 0 0 1 0 1 1 0] [0 1 1 1 0 0 1 0 1]] """
在上邊的語料庫中,第一個和最后一個單詞是一模一樣的,只是順序不一樣,他們會被編碼成相同的特征向量,所以詞袋表示法會丟失了單詞順序的前后相關性信息,為了保持某些局部的順序性,可以抽取2個詞和一個詞
bigram_vectorizer=CountVectorizer(ngram_range=(1,2),token_pattern=r"\b\w+\b",min_df=1) analyze=bigram_vectorizer.build_analyzer() print(analyze("Bi-grams are cool!")==(['Bi','grams','are','cool','Bi grams', 'grams are','are cool'])) #True x_2=bigram_vectorizer.fit_transform(corpus).toarray() print(x_2) """ [[0 0 1 1 1 1 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0] [0 0 1 0 0 1 1 0 0 2 1 1 1 0 1 0 0 0 1 1 0] [1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 1 0 0 0] [0 0 1 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 1]] """