sklearn.feature_extraction.FeatureHasher(n_features=1048576, input_type="dict", dtype=<class 'numpy.float64'>, alternate_sign=True, non_negative=False):
特征散列化的實現類。
此類將符號特性名稱(字符串)的序列轉換為scipy.sparse矩陣,使用哈希函數計算與名稱對應的矩陣列。使用的散列函數是帶符號的32位版本的Murmurhash3.
字節字符串類型的特征名稱按原樣使用。Unicode字符串首先轉換為UTF-8,但沒有進行Unicode規范化。特征值必須是(有限)數字
本類是DictVectorizer和CountVectorizer的低內存替代品,用於大規模(在線)學習和內存緊張的情況,例如在嵌入式設備上運行預測代碼時。
n_features: integer
輸出矩陣的特征數,少量的特征可能引發hash沖突,大量的特征會導致線性學習的維度擴大。
input_type:
"dict"表示輸入數據是字典形式的[{feature_name: value}, …],
"pair"表示輸入數據是pair形式的[[(feature_name1, value1), (feature_name2, value2)], …]
"string"表示數據是字符串形式的[[feature_name1, feature_name1]],其中有個value1個feature_name1,value2個feature_name2
其中feature_name必須是字符串,value必須是數字。在"string"的情況下,每個feature_name隱含value是1。特征名稱會進行hash處理,來計算該特征名稱對應的hash列。value的符號在輸出的數據中可能會發生反轉。
dtype:
特征值得類型。這個值將傳遞給scipy.sparse矩陣作為構造器dtype參數的值。這個參數不能設置為bool,np.boolean或者其他無符號的整型。
alternate_sign:
如果為True,則在特征計算出的hash值上交替添加一個符號(正數變成負數),以便於在散列空間中大致的保留內部積。這種方法類似於稀疏隨機投影。
non_negative:
如果為真,在計算結果返回前,對特征矩陣進行絕對值計算。當與alternate_sign=True一同使用時,會顯著降低內部積的保存性能。
該類的方法與其他的特征提取類的方法一致。
以下代碼例子來源自sklearn官網API。
地址: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher
例子1:
from sklearn.feature_extraction import FeatureHasher h = FeatureHasher(n_features=10, input_type='string', dtype=int, alternate_sign=False) d = [{'dog': 1, 'cat': 2, 'elephant': 4}, {'dog': 2, 'run': 5}] d = [[('dog', 1), ('cat', 2), ('elephant', 4)], [('dog', 2), ('run', 5)]] d = [['dog', 'cat', 'cat', 'elephant', 'elephant','elephant','elephant',], ["dog", "dog", "run", 'run', 'run', 'run', 'run'], ["run", "run"]] f = h.transform(d) print(f.toarray()) print(h.get_params())
例子2:
from __future__ import print_function from collections import defaultdict import re import sys from time import time import numpy as np from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction import DictVectorizer, FeatureHasher from memory_profiler import profile def n_nonzero_columns(X): """Returns the number of non-zero columns in a CSR matrix X.""" return len(np.unique(X.nonzero()[1])) def tokens(doc): """ 簡單的將doc拆分成詞語,刪除英文字母外的符號,並且都小寫化 :param doc: :return: """ return (tok.lower() for tok in re.findall(r"\w+", doc)) def token_freqs(doc): """ 對doc中的詞語進行頻率統計 :param doc: :return: """ freq = defaultdict(int) for tok in tokens(doc): freq[tok] += 1 return freq @profile def dict_vectorizer(raw_data, data_size_mb): print("DictVectorizer") t0 = time() vectorizer = DictVectorizer() X = vectorizer.fit_transform(token_freqs(d) for d in raw_data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) print("Found %d unique terms\n" % len(vectorizer.get_feature_names())) print("X.shape: ", X.shape) @profile def feature_hasher_freq(raw_data, data_size_mb, n_features): print("FeatureHasher on frequency dicts") t0 = time() hasher = FeatureHasher(n_features=n_features) X = hasher.transform(token_freqs(d) for d in raw_data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) print("Found %d unique terms\n" % n_nonzero_columns(X)) print("X.shape: ", X.shape) del X @profile def feature_hasher_terms(raw_data, data_size_mb, n_features): print("FeatureHasher on raw tokens") t0 = time() hasher = FeatureHasher(n_features=n_features, input_type="string") X = hasher.transform(tokens(d) for d in raw_data) duration = time() - t0 print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) print("Found %d unique terms" % n_nonzero_columns(X)) print("X.shape: ", X.shape) del X @profile def compare(): # 1. 只選擇一部分數據 categories = [ 'alt.atheism', 'comp.graphics', 'comp.sys.ibm.pc.hardware', 'misc.forsale', 'rec.autos', 'sci.space', 'talk.religion.misc', ] print("Usage: %s [n_features_for_hashing]" % sys.argv[0]) print(" The default number of features is 2**18.\n\n") try: n_features = int(sys.argv[1]) except IndexError: n_features = 2 ** 18 except ValueError: print("not a valid number of features: %r" % sys.argv[1]) sys.exit(1) print("Loading 20 newsgroups training data") # 2. 第一次運行時,下載文件需要較長的時間 # data_home 下載下來的文件保存的位置 # 如果data_home中沒有文件,download_if_missing設置為True,程序會自動下載文件到data_home raw_data = fetch_20newsgroups(data_home=r"D:\學習\sklearn_dataset\20newsbydate", subset='train', categories=categories, download_if_missing=True ).data # 3. 計算文本的大小 data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6 print("%d documents - %0.3fMB\n" % (len(raw_data), data_size_mb)) dict_vectorizer(raw_data, data_size_mb) feature_hasher_freq(raw_data, data_size_mb, n_features) feature_hasher_terms(raw_data, data_size_mb, n_features) if __name__ == '__main__': compare()
例子2輸出:
Usage: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py [n_features_for_hashing] The default number of features is 2**18. Loading 20 newsgroups training data 3803 documents - 6.245MB DictVectorizer done in 16.495944s at 0.379MB/s Found 47928 unique terms X.shape: (3803, 47928) Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py Line # Mem usage Increment Line Contents ================================================ 42 98.9 MiB 98.9 MiB @profile 43 def dict_vectorizer(raw_data, data_size_mb): 44 98.9 MiB 0.0 MiB print("DictVectorizer") 45 98.9 MiB 0.0 MiB t0 = time() 46 98.9 MiB 0.0 MiB vectorizer = DictVectorizer() 47 130.7 MiB 1.3 MiB X = vectorizer.fit_transform(token_freqs(d) for d in raw_data) 48 130.7 MiB 0.0 MiB duration = time() - t0 49 130.7 MiB 0.0 MiB print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) 50 130.7 MiB 0.0 MiB print("Found %d unique terms\n" % len(vectorizer.get_feature_names())) 51 130.7 MiB 0.0 MiB print("X.shape: ", X.shape) FeatureHasher on frequency dicts done in 8.953512s at 0.697MB/s Found 43873 unique terms X.shape: (3803, 262144) Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py Line # Mem usage Increment Line Contents ================================================ 53 106.5 MiB 106.5 MiB @profile 54 def feature_hasher_freq(raw_data, data_size_mb, n_features): 55 106.5 MiB 0.0 MiB print("FeatureHasher on frequency dicts") 56 106.5 MiB 0.0 MiB t0 = time() 57 106.5 MiB 0.0 MiB hasher = FeatureHasher(n_features=n_features) 58 116.8 MiB 4.0 MiB X = hasher.transform(token_freqs(d) for d in raw_data) 59 116.8 MiB 0.0 MiB duration = time() - t0 60 116.8 MiB 0.0 MiB print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) 61 116.8 MiB 0.0 MiB print("Found %d unique terms\n" % n_nonzero_columns(X)) 62 116.8 MiB 0.0 MiB print("X.shape: ", X.shape) 63 106.6 MiB 0.0 MiB del X FeatureHasher on raw tokens done in 9.989571s at 0.625MB/s Found 43873 unique terms X.shape: (3803, 262144) Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py Line # Mem usage Increment Line Contents ================================================ 65 106.6 MiB 106.6 MiB @profile 66 def feature_hasher_terms(raw_data, data_size_mb, n_features): 67 106.6 MiB 0.0 MiB print("FeatureHasher on raw tokens") 68 106.6 MiB 0.0 MiB t0 = time() 69 106.6 MiB 0.0 MiB hasher = FeatureHasher(n_features=n_features, input_type="string") 70 118.6 MiB 4.0 MiB X = hasher.transform(tokens(d) for d in raw_data) 71 118.6 MiB 0.0 MiB duration = time() - t0 72 118.6 MiB 0.0 MiB print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration)) 73 118.6 MiB 0.0 MiB print("Found %d unique terms" % n_nonzero_columns(X)) 74 118.6 MiB 0.0 MiB print("X.shape: ", X.shape) 75 106.7 MiB 0.0 MiB del X Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py Line # Mem usage Increment Line Contents ================================================ 78 71.5 MiB 71.5 MiB @profile 79 def compare(): 80 # 1. 只選擇一部分數據 81 categories = [ 82 71.5 MiB 0.0 MiB 'alt.atheism', 83 71.5 MiB 0.0 MiB 'comp.graphics', 84 71.5 MiB 0.0 MiB 'comp.sys.ibm.pc.hardware', 85 71.5 MiB 0.0 MiB 'misc.forsale', 86 71.5 MiB 0.0 MiB 'rec.autos', 87 71.5 MiB 0.0 MiB 'sci.space', 88 71.5 MiB 0.0 MiB 'talk.religion.misc', 89 ] 90 91 71.5 MiB 0.0 MiB print("Usage: %s [n_features_for_hashing]" % sys.argv[0]) 92 71.5 MiB 0.0 MiB print(" The default number of features is 2**18.\n\n") 93 94 71.5 MiB 0.0 MiB try: 95 71.5 MiB 0.0 MiB n_features = int(sys.argv[1]) 96 71.5 MiB 0.0 MiB except IndexError: 97 71.5 MiB 0.0 MiB n_features = 2 ** 18 98 except ValueError: 99 print("not a valid number of features: %r" % sys.argv[1]) 100 sys.exit(1) 101 102 71.5 MiB 0.0 MiB print("Loading 20 newsgroups training data") 103 # 2. 第一次運行時,下載文件需要較長的時間 104 # data_home 下載下來的文件保存的位置 105 # 如果data_home中沒有文件,download_if_missing設置為True,程序會自動下載文件到data_home 106 71.5 MiB 0.0 MiB raw_data = fetch_20newsgroups(data_home=r"D:\學習\sklearn_dataset\20newsbydate", 107 71.5 MiB 0.0 MiB subset='train', 108 71.5 MiB 0.0 MiB categories=categories, 109 98.0 MiB 26.5 MiB download_if_missing=True 110 ).data 111 112 # 3. 計算文本的大小 113 98.9 MiB 0.1 MiB data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6 114 98.9 MiB 0.0 MiB print("%d documents - %0.3fMB\n" % (len(raw_data), data_size_mb)) 115 116 106.5 MiB 7.6 MiB dict_vectorizer(raw_data, data_size_mb) 117 106.6 MiB 0.1 MiB feature_hasher_freq(raw_data, data_size_mb, n_features) 118 106.7 MiB 0.1 MiB feature_hasher_terms(raw_data, data_size_mb, n_features)
從輸出信息可以看出:
FeatureHasher相比於DictVectorizer:
1. FeatureHasher轉化的速度更快。如果更改n_features, FeatureHasher的速度會發生變化,但是仍然比DictVectorizer更快一些。
2. FeatureHasher的特征數少於DictVectorizer,部分特征被壓縮了。