大多數數據挖掘算法都依賴於數值或類別型特征,從數據集中抽取數值和類別型特征,並選出最佳特征。
特征可用於建模, 模型以機器挖掘算法能夠理解的近似的方式來表示現實
特征選擇的另一個優點在於:降低真實世界的復雜度,模型比現實更容易操縱
特征選擇
scikit-learn中的VarianceThreshold轉換器可用來刪除特征值的方差達不到最低標准 的特征。
import numpy as np x= np.arange(30).reshape((10,3))#10個個體、3個特征的數據集 print(x) x[:,1] = 1 #把所有第二列的數值都改為1 print(x) from sklearn.feature_selection import VarianceThreshold vt = VarianceThreshold() #VarianceThreshold轉換器,用它處理數據集 Xt = vt.fit_transform(x) print(Xt)#第二列消失 print(vt.variances_)#輸出每一列的方差 結果: [[ 0 1 2] [ 3 4 5] [ 6 7 8] [ 9 10 11] [12 13 14] [15 16 17] [18 19 20] [21 22 23] [24 25 26] [27 28 29]] [[ 0 1 2] [ 3 1 5] [ 6 1 8] [ 9 1 11] [12 1 14] [15 1 17] [18 1 20] [21 1 23] [24 1 26] [27 1 29]] [[ 0 2] [ 3 5] [ 6 8] [ 9 11] [12 14] [15 17] [18 20] [21 23] [24 26] [27 29]] [ 74.25 0. 74.25]
例子:用Adult數據集借助特征為復雜的現實世界建模,預測一個人是否年收入多於五萬美元
import os import pandas as pd data_folder = os.path.join(os.getcwd(),'Data','adult') adult_filename = os.path.join(data_folder,'adult.data.txt') adult = pd.read_csv(adult_filename,header=None, names=["Age", "Work-Class", "fnlwgt", "Education", "Education-Num", "Marital-Status", "Occupation", "Relationship", "Race", "Sex", "Capital-gain", "Capital-loss", "Hours-per-week", "Native-Country", "Earnings-Raw"]) adult.dropna(how='all', inplace=True) #我們需要刪除包含無效數字的行(設置inplace參數為真,表示改動當前數據框,而不是新建一個)。 # print(adult["Work-Class"].unique())#數據框的unique函數就能得到所有的工作情況 adult["LongHours"] = adult["Hours-per-week"] > 40 #通過離散化過程轉換為類別型特征,把連續值轉換為類別型特征 #測試單個特征在Adult數據集上的表現, X = adult[["Age", "Education-Num", "Capital-gain", "Capital-loss","Hours-per-week"]].values y = (adult["Earnings-Raw"] == ' >50K').values from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 transformer = SelectKBest(score_func=chi2, k=3) #使用SelectKBest轉換器類,用卡方函數打分,初始化轉換器 Xt_chi2 = transformer.fit_transform(X, y)#調用fit_transform方法,對相同的數據集進行預處理和轉換 print(transformer.scores_)#每一列的相關性 from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import cross_val_score clf = DecisionTreeClassifier(random_state=14) scores_chi2 = cross_val_score(clf, Xt_chi2, y, scoring='accuracy') print(scores_chi2)
結果:
[ 8.60061182e+03 2.40142178e+03 8.21924671e+07 1.37214589e+06
6.47640900e+03]
[ 0.82577851 0.82992445 0.83009306] #正確率達到83%
創建特征
特征之間相關性很強,或者特征冗余,會增加算法處理難度。出於這個原因,創建特征。
from collections import defaultdict import os import numpy as np import pandas as pd data_folder = os.path.join(os.getcwd(), "Data") data_filename = os.path.join(data_folder, "adult", "ad.data.txt") #前幾個特征是數值,但是pandas會把它們當成字符串。要修復這個問題,我們需要編寫將字符串轉換為數字的函數,該函數能夠把只包含數字的字符串轉換為數字,把其余的轉化為“NaN” def convert_number(x): try: return float(x) except ValueError: return np.nan converters = defaultdict(convert_number) converters[1558] = lambda x: 1 if x.strip() == "ad." else 0 #把類別這一列各個類別值由字符串轉換為數值 for i in range(1558):#要這樣定義才使得字典前面有定義 converters[i]=lambda x:convert_number(x) ads = pd.read_csv(data_filename, header=None, converters=converters) # print(ads[:5]) ads.dropna(inplace=True)#刪除空行 #抽取用於分類算法的x矩陣和y數組 X = ads.drop(1558, axis=1).values y = ads[1558] from sklearn.decomposition import PCA #主成分分析算法(Principal Component Analysis,PCA)的目的是找到能用較少信息描述數據集的特征組合,用PCA算法得到的數據創建模型,不僅能夠近似地表示原始數據集,還能提升分類任務的正確率 pca = PCA(n_components=5) Xd = pca.fit_transform(X) np.set_printoptions(precision=3, suppress=True) print(pca.explained_variance_ratio_ )#每個特征的方差 from sklearn.tree import DecisionTreeClassifier from sklearn.cross_validation import cross_val_score clf = DecisionTreeClassifier(random_state=14) scores_reduced = cross_val_score(clf, Xd, y, scoring='accuracy') print(scores_reduced) #把PCA返回的前兩個特征做成圖形 from matplotlib import pyplot as plt classes = set(y) colors = ['red', 'green'] for cur_class, color in zip(classes, colors): mask = (y == cur_class).values plt.scatter(Xd[mask, 0], Xd[mask, 1], marker='o', color=color, label=int(cur_class)) plt.legend() plt.show() 結果: [ 0.854 0.145 0.001 0. 0. ] [ 0.944 0.924 0.925]