特征抽取: sklearn.feature_extraction.FeatureHasher


sklearn.feature_extraction.FeatureHasher(n_features=1048576, input_type="dict", dtype=<class 'numpy.float64'>, alternate_sign=True, non_negative=False):
  特征散列化的實現類。
  此類將符號特性名稱(字符串)的序列轉換為scipy.sparse矩陣,使用哈希函數計算與名稱對應的矩陣列。使用的散列函數是帶符號的32位版本的Murmurhash3.

  字節字符串類型的特征名稱按原樣使用。Unicode字符串首先轉換為UTF-8,但沒有進行Unicode規范化。特征值必須是(有限)數字
  本類是DictVectorizer和CountVectorizer的低內存替代品,用於大規模(在線)學習和內存緊張的情況,例如在嵌入式設備上運行預測代碼時。

  n_features: integer
    輸出矩陣的特征數,少量的特征可能引發hash沖突,大量的特征會導致線性學習的維度擴大。
  input_type:
    "dict"表示輸入數據是字典形式的[{feature_name: value}, …],
    "pair"表示輸入數據是pair形式的[[(feature_name1, value1), (feature_name2, value2)], …]
    "string"表示數據是字符串形式的[[feature_name1, feature_name1]],其中有個value1個feature_name1,value2個feature_name2
    其中feature_name必須是字符串,value必須是數字。在"string"的情況下,每個feature_name隱含value是1。特征名稱會進行hash處理,來計算該特征名稱對應的hash列。value的符號在輸出的數據中可能會發生反轉。
  dtype:
    特征值得類型。這個值將傳遞給scipy.sparse矩陣作為構造器dtype參數的值。這個參數不能設置為bool,np.boolean或者其他無符號的整型。
  alternate_sign:
    如果為True,則在特征計算出的hash值上交替添加一個符號(正數變成負數),以便於在散列空間中大致的保留內部積。這種方法類似於稀疏隨機投影。

  non_negative:
    如果為真,在計算結果返回前,對特征矩陣進行絕對值計算。當與alternate_sign=True一同使用時,會顯著降低內部積的保存性能。

  該類的方法與其他的特征提取類的方法一致。
  以下代碼例子來源自sklearn官網API。
  地址: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html#sklearn.feature_extraction.FeatureHasher

例子1:
    

   from sklearn.feature_extraction import FeatureHasher
    h = FeatureHasher(n_features=10, input_type='string', dtype=int, alternate_sign=False)
    d = [{'dog': 1, 'cat': 2, 'elephant': 4}, {'dog': 2, 'run': 5}]
    d = [[('dog', 1), ('cat', 2), ('elephant', 4)], [('dog', 2), ('run', 5)]]
    d = [['dog', 'cat', 'cat', 'elephant', 'elephant','elephant','elephant',],
         ["dog", "dog", "run", 'run', 'run', 'run', 'run'],
         ["run", "run"]]
    f = h.transform(d)
    print(f.toarray())
    print(h.get_params())

 


例子2:  

from __future__ import print_function
from collections import defaultdict
import re
import sys
from time import time
 
import numpy as np
 
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction import DictVectorizer, FeatureHasher
from memory_profiler import profile
 
 
def n_nonzero_columns(X):
    """Returns the number of non-zero columns in a CSR matrix X."""
    return len(np.unique(X.nonzero()[1]))
 
 
def tokens(doc):
    """
    簡單的將doc拆分成詞語,刪除英文字母外的符號,並且都小寫化
    :param doc:
    :return:
    """
    return (tok.lower() for tok in re.findall(r"\w+", doc))
 
 
def token_freqs(doc):
    """
    對doc中的詞語進行頻率統計
    :param doc:
    :return:
    """
    freq = defaultdict(int)
    for tok in tokens(doc):
        freq[tok] += 1
    return freq
 
@profile
def dict_vectorizer(raw_data, data_size_mb):
    print("DictVectorizer")
    t0 = time()
    vectorizer = DictVectorizer()
    X = vectorizer.fit_transform(token_freqs(d) for d in raw_data)
    duration = time() - t0
    print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
    print("Found %d unique terms\n" % len(vectorizer.get_feature_names()))
    print("X.shape: ", X.shape)
 
@profile
def feature_hasher_freq(raw_data, data_size_mb, n_features):
    print("FeatureHasher on frequency dicts")
    t0 = time()
    hasher = FeatureHasher(n_features=n_features)
    X = hasher.transform(token_freqs(d) for d in raw_data)
    duration = time() - t0
    print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
    print("Found %d unique terms\n" % n_nonzero_columns(X))
    print("X.shape: ", X.shape)
    del X
 
@profile
def feature_hasher_terms(raw_data, data_size_mb, n_features):
    print("FeatureHasher on raw tokens")
    t0 = time()
    hasher = FeatureHasher(n_features=n_features, input_type="string")
    X = hasher.transform(tokens(d) for d in raw_data)
    duration = time() - t0
    print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
    print("Found %d unique terms" % n_nonzero_columns(X))
    print("X.shape: ", X.shape)
    del X
 
 
@profile
def compare():
    # 1. 只選擇一部分數據
    categories = [
        'alt.atheism',
        'comp.graphics',
        'comp.sys.ibm.pc.hardware',
        'misc.forsale',
        'rec.autos',
        'sci.space',
        'talk.religion.misc',
    ]
 
    print("Usage: %s [n_features_for_hashing]" % sys.argv[0])
    print("       The default number of features is 2**18.\n\n")
 
    try:
        n_features = int(sys.argv[1])
    except IndexError:
        n_features = 2 ** 18
    except ValueError:
        print("not a valid number of features: %r" % sys.argv[1])
        sys.exit(1)
 
    print("Loading 20 newsgroups training data")
    # 2. 第一次運行時,下載文件需要較長的時間
    # data_home 下載下來的文件保存的位置
    # 如果data_home中沒有文件,download_if_missing設置為True,程序會自動下載文件到data_home
    raw_data = fetch_20newsgroups(data_home=r"D:\學習\sklearn_dataset\20newsbydate",
                                  subset='train',
                                  categories=categories,
                                  download_if_missing=True
                                  ).data
 
    # 3. 計算文本的大小
    data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6
    print("%d documents - %0.3fMB\n" % (len(raw_data), data_size_mb))
 
    dict_vectorizer(raw_data, data_size_mb)
    feature_hasher_freq(raw_data, data_size_mb, n_features)
    feature_hasher_terms(raw_data, data_size_mb, n_features)
 
 
if __name__ == '__main__':
    compare()

例子2輸出:
  

Usage: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py [n_features_for_hashing]
       The default number of features is 2**18.
 
 
Loading 20 newsgroups training data
3803 documents - 6.245MB
 
DictVectorizer
done in 16.495944s at 0.379MB/s
Found 47928 unique terms
 
X.shape:  (3803, 47928)
Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py
 
Line #    Mem usage    Increment   Line Contents
================================================
    42     98.9 MiB     98.9 MiB   @profile
    43                             def dict_vectorizer(raw_data, data_size_mb):
    44     98.9 MiB      0.0 MiB       print("DictVectorizer")
    45     98.9 MiB      0.0 MiB       t0 = time()
    46     98.9 MiB      0.0 MiB       vectorizer = DictVectorizer()
    47    130.7 MiB      1.3 MiB       X = vectorizer.fit_transform(token_freqs(d) for d in raw_data)
    48    130.7 MiB      0.0 MiB       duration = time() - t0
    49    130.7 MiB      0.0 MiB       print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
    50    130.7 MiB      0.0 MiB       print("Found %d unique terms\n" % len(vectorizer.get_feature_names()))
    51    130.7 MiB      0.0 MiB       print("X.shape: ", X.shape)
 
 
FeatureHasher on frequency dicts
done in 8.953512s at 0.697MB/s
Found 43873 unique terms
 
X.shape:  (3803, 262144)
Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py
 
Line #    Mem usage    Increment   Line Contents
================================================
    53    106.5 MiB    106.5 MiB   @profile
    54                             def feature_hasher_freq(raw_data, data_size_mb, n_features):
    55    106.5 MiB      0.0 MiB       print("FeatureHasher on frequency dicts")
    56    106.5 MiB      0.0 MiB       t0 = time()
    57    106.5 MiB      0.0 MiB       hasher = FeatureHasher(n_features=n_features)
    58    116.8 MiB      4.0 MiB       X = hasher.transform(token_freqs(d) for d in raw_data)
    59    116.8 MiB      0.0 MiB       duration = time() - t0
    60    116.8 MiB      0.0 MiB       print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
    61    116.8 MiB      0.0 MiB       print("Found %d unique terms\n" % n_nonzero_columns(X))
    62    116.8 MiB      0.0 MiB       print("X.shape: ", X.shape)
    63    106.6 MiB      0.0 MiB       del X
 
 
FeatureHasher on raw tokens
done in 9.989571s at 0.625MB/s
Found 43873 unique terms
X.shape:  (3803, 262144)
Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py
 
Line #    Mem usage    Increment   Line Contents
================================================
    65    106.6 MiB    106.6 MiB   @profile
    66                             def feature_hasher_terms(raw_data, data_size_mb, n_features):
    67    106.6 MiB      0.0 MiB       print("FeatureHasher on raw tokens")
    68    106.6 MiB      0.0 MiB       t0 = time()
    69    106.6 MiB      0.0 MiB       hasher = FeatureHasher(n_features=n_features, input_type="string")
    70    118.6 MiB      4.0 MiB       X = hasher.transform(tokens(d) for d in raw_data)
    71    118.6 MiB      0.0 MiB       duration = time() - t0
    72    118.6 MiB      0.0 MiB       print("done in %fs at %0.3fMB/s" % (duration, data_size_mb / duration))
    73    118.6 MiB      0.0 MiB       print("Found %d unique terms" % n_nonzero_columns(X))
    74    118.6 MiB      0.0 MiB       print("X.shape: ", X.shape)
    75    106.7 MiB      0.0 MiB       del X
 
 
Filename: D:/Project/nlplearn/sklearn_learn/plot_hashing_vs_dictvectorizer.py
 
Line #    Mem usage    Increment   Line Contents
================================================
    78     71.5 MiB     71.5 MiB   @profile
    79                             def compare():
    80                                 # 1. 只選擇一部分數據
    81                                 categories = [
    82     71.5 MiB      0.0 MiB           'alt.atheism',
    83     71.5 MiB      0.0 MiB           'comp.graphics',
    84     71.5 MiB      0.0 MiB           'comp.sys.ibm.pc.hardware',
    85     71.5 MiB      0.0 MiB           'misc.forsale',
    86     71.5 MiB      0.0 MiB           'rec.autos',
    87     71.5 MiB      0.0 MiB           'sci.space',
    88     71.5 MiB      0.0 MiB           'talk.religion.misc',
    89                                 ]
    90                             
    91     71.5 MiB      0.0 MiB       print("Usage: %s [n_features_for_hashing]" % sys.argv[0])
    92     71.5 MiB      0.0 MiB       print("       The default number of features is 2**18.\n\n")
    93                             
    94     71.5 MiB      0.0 MiB       try:
    95     71.5 MiB      0.0 MiB           n_features = int(sys.argv[1])
    96     71.5 MiB      0.0 MiB       except IndexError:
    97     71.5 MiB      0.0 MiB           n_features = 2 ** 18
    98                                 except ValueError:
    99                                     print("not a valid number of features: %r" % sys.argv[1])
   100                                     sys.exit(1)
   101                             
   102     71.5 MiB      0.0 MiB       print("Loading 20 newsgroups training data")
   103                                 # 2. 第一次運行時,下載文件需要較長的時間
   104                                 # data_home 下載下來的文件保存的位置
   105                                 # 如果data_home中沒有文件,download_if_missing設置為True,程序會自動下載文件到data_home
   106     71.5 MiB      0.0 MiB       raw_data = fetch_20newsgroups(data_home=r"D:\學習\sklearn_dataset\20newsbydate",
   107     71.5 MiB      0.0 MiB                                     subset='train',
   108     71.5 MiB      0.0 MiB                                     categories=categories,
   109     98.0 MiB     26.5 MiB                                     download_if_missing=True
   110                                                               ).data
   111                             
   112                                 # 3. 計算文本的大小
   113     98.9 MiB      0.1 MiB       data_size_mb = sum(len(s.encode('utf-8')) for s in raw_data) / 1e6
   114     98.9 MiB      0.0 MiB       print("%d documents - %0.3fMB\n" % (len(raw_data), data_size_mb))
   115                             
   116    106.5 MiB      7.6 MiB       dict_vectorizer(raw_data, data_size_mb)
   117    106.6 MiB      0.1 MiB       feature_hasher_freq(raw_data, data_size_mb, n_features)
   118    106.7 MiB      0.1 MiB       feature_hasher_terms(raw_data, data_size_mb, n_features)

  從輸出信息可以看出:
    FeatureHasher相比於DictVectorizer:
      1. FeatureHasher轉化的速度更快。如果更改n_features, FeatureHasher的速度會發生變化,但是仍然比DictVectorizer更快一些。
      2. FeatureHasher的特征數少於DictVectorizer,部分特征被壓縮了。





免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM