機器學習-降低維度的實現方法

本文轉載自查看原文 2019-01-06 18:24 821 MachineLearning

降低維度的方法

選擇特征
從原有的特征中挑選出對結果影響最大的特征
抽取特征
將數據從高維度空間投影到低維度空間

一、選擇特征

移除低變異數的特征

假設某特征的特征值只有0和1，並且在所有輸入樣本中，95%的實例的該特征取值都是1，那就可以認為這個特征作用不大。如果100%都是1，那這個特征就沒意義了。當特征值都是離散型變量的時候這種方法才能用，如果是連續型變量，就需要將連續變量離散化之后才能用。而且實際當中，一般不太會有95%以上都取某個值的特征存在，所以這種方法雖然簡單但是不太好用。可以把它作為特征選擇的預處理，先去掉那些取值變化小的特征，然后再從接下來提到的的特征選擇方法中選擇合適的進行進一步的特征選擇。

import pandas
from sklearn.feature_selection import VarianceThreshold # 
df = pandas.read_csv('Data/customer_behavior.csv')
X = df[['bachelor','gender','age','salary']]
sel = VarianceThreshold(threshold=0) # 方差小於threshold的值的特征變量將被刪除，默認為0
X_val = sel.fit_transform(X)
names = X.columns[sel.get_support()]
print(names)
# 以上設置threshold的值為0，即表示某一特征變量在數據集中沒有變化，這個特征對結果沒有任何影響，會被刪除

單變量特征篩選

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
X = df[['bachelor','gender','age','salary']]
y = df['purchased'].values
clf = SelectKBest(chi2,k=2) # k=2與結果最相關的2個變量
clf.fit(X,y)
print(clf.scores_) # 各個特征變量與結果的相關程度
X_new = clf.fit_transform(X,y)
print(X_new)  # 最相關的2個變量的dataframe

逐步剔除特征(Recursive feature elimination)

from sklearn.feature_selection import RFE
from sklearn.svm import SVC
clf = SVC(kernel='linear') # RFE中只能使用線性分類模型
rfe = RFE(clf,n_features_to_select=1) # 逐步剔除到只剩1個特征變量
rfe.fit(X_val,y)
for x in rfe.ranking_:
    print(names[x-1],rfe.ranking_[x-1])

使用隨機森林篩選特征

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10,random_state=123)
clf.fit(X_val,y)
for feature in zip(names,clf.feature_importances_): #clf.feature_importances_ 給出每個特征對結果的重要性
    print(feature)

特征重要性可視化

import matplotlib.pyplot as plt
plt.title('Feature Importance')
plt.bar(range(0,len(names)),clf.feature_importances_)
plt.xticks(range(0,len(names)),names)
plt.show()

二、抽取特征

主成分分析 PCA

from sklearn.datasets import load_iris
iris= load_iris()
X = iris.data
y = iris.target
from sklearn.decomposition import PCA
pca = PCA(n_components = 2) # n_components 定義壓縮成2個主成分
pca.fit(X)
X_reduced = pca.transform(X)
print(X_reduced.shape)
from matplotlib import pyplot as plt
plt.scatter(X_reduced[:,0],X_reduced[:,1],c=y)
plt.show()
for component in pca.components_:
    print('+'.join("%.3f * %s"%(value,name) for value,name in 
zip(component,iris.feature_names)))
# 0.362 * sepal length (cm)+-0.082 * sepal width (cm)+0.857 * petal length (cm)+0.359 * petal width (cm)
# 0.657 * sepal length (cm)+0.730 * sepal width (cm)+-0.176 * petal length (cm)+-0.075 * petal width (cm)
# 顯示主成分是怎么構成的
print(pca.explained_variance_)
print(pca.explained_variance_ratio_)
# 結果[ 0.92461621  0.05301557]，主成分1比主成分2解釋的比例高很多

奇異值分解 SVD

from scipy.linalg import svd
U,S,V = svd(X,full_matrices=False)  #將矩陣X分解成3個相乘的矩陣
import numpy as np
S = np.diag(S) # 分解完后的S是一個1維矩陣，恢復成對角矩陣
print(U.dot(S).dot(V)) # 3個矩陣點乘后與X相同

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2) # 使用sklearn的truncatedsvd將矩陣降到2維數據
X_new = svd.fit_transform(X)
plt.scatter(X_new[:,0],X_new[:,1],c=y)
plt.show()

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習入門03 - 降低損失 (Reducing Loss) Python基於機器學習方法實現的電影推薦系統機器學習之降維方法方法概述：機器學習機器學習核方法機器學習九機器學習中常用的采樣方法常用機器學習方法總結機器學習問題方法總結機器學習降維方法總結機器學習模型評估方法