3(3).特征選擇---嵌入法(特征重要性評估)


一、正則化

1.L1/Lasso

  L1正則方法具有稀疏解的特性,因此天然具備特征選擇的特性,但是要注意,L1沒有選到的特征不代表不重要,原因是兩個具有高相關性的特征可能只保留了一個,如果要確定哪個特征重要應再通過L2正則方法交叉檢驗。

舉例:下面的例子在波士頓房價數據上運行了Lasso,其中參數alpha是通過grid search進行優化

from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_boston

boston = load_boston()
scaler = StandardScaler()
X = scaler.fit_transform(boston["data"])
Y = boston["target"]
names = boston["feature_names"]

lasso = Lasso(alpha=.3)
lasso.fit(X, Y)

print "Lasso model: ", pretty_print_linear(lasso.coef_, names, sort = True)

  可以看到,很多特征的系數都是0。如果繼續增加alpha的值,得到的模型就會越來越稀疏,即越來越多的特征系數會變成0。然而,L1正則化像非正則化線性模型一樣也是不穩定的,如果特征集合中具有相關聯的特征,當數據發生細微變化時也有可能導致很大的模型差異。

  

2.L2/Ridge

舉例:

from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100

#We run the method 10 times with different random seeds
for i in range(10):
    print("Random seed %s" % i)
    np.random.seed(seed=i)
    X_seed = np.random.normal(0, 1, size)
    X1 = X_seed + np.random.normal(0, .1, size)
    X2 = X_seed + np.random.normal(0, .1, size)
    X3 = X_seed + np.random.normal(0, .1, size)
    Y = X1 + X2 + X3 + np.random.normal(0, 1, size)
    X = np.array([X1, X2, X3]).T

    lr = LinearRegression()
    lr.fit(X,Y)
    print("Linear model:", pretty_print_linear(lr.coef_))

    ridge = Ridge(alpha=10)
    ridge.fit(X,Y)
    print("Ridge model:", pretty_print_linear(ridge.coef_))

  

二、基於樹模型的特征重要性

1.RF

 

 

2.ExtraTree

 

 

3.Adaboost

 

 

4.GBDT

 

 

5.XGboost

xgboost的基學習器可以是gbtree也可以是gbliner。當基學習器是gbtree時,可以計算特征重要性; 在基礎的xgboost模塊中,計算特征重要性調用get_score() 在xgboost的sklearn API中,計算特征重要性調用feature_importance_; feature_importance_依然派生於get_score()。xgboost實現中 Booster類get_score方法輸出特征重要性,其中importance_type參數支持5種特征重要性的計算方法:
get_score(fmap='', importance_type='weight')

fmap是一個包含特征名稱映射關系的txt文檔; importance_type指importance的計算類型;可取值有5個:

‘weight’: the number of times a feature is used to split the data across all trees.
‘gain’: the average gain across all splits the feature is used in.
‘cover’: the average coverage across all splits the feature is used in.
‘total_gain’: the total gain across all splits the feature is used in.
‘total_cover’: the total coverage across all splits the feature is used in.

[1]importance_type=weight(默認值),某特征在所有樹中作為划分屬性的次數(某特征在整個樹群節點中出現的次數,出現越多,價值就越高)

[2]importance_type=gain,某特征在作為划分屬性時loss平均的降低量(某特征在整個樹群作為分裂節點的信息增益之和再除以某特征出現的頻次)

[3] importance_type= cover,某特征在作為划分屬性時對樣本的覆蓋度(某特征節點樣本的二階導數和再除以某特征出現總頻次)

[4]importance_type=total_gain,同gain,average_over_splits=False,這里total_gain就是除以出現次數的gain

[5]importance_type=total_cover,同cover,average_over_splits=False,這里total_cover就是除以出現次數的gain

從構造函數中發現,xgboost sklearn API在計算特征重要性的時候默認importance_type="gain",而原始的get_score方法默認importance_type="weight"

def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
                 verbosity=1, silent=None, objective="reg:linear", booster='gbtree',
                 n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
                 max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
                 colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
                 base_score=0.5, random_state=0, seed=None, missing=None,
                 # 在這一步進行了聲明
                 importance_type="gain", **kwargs):

  

 

6.LightGBM

 

 

 7.RF、Xgboost、ExtraTree每個選出topk特征,再進行融合

 

from sklearn import ensemble
from sklearn.model_selection import GridSearchCV
import xgboost as xgb

def get_top_k_feature(features,model,top_n_features):
    feature_imp_sorted_rf = pd.DataFrame({'feature':features,'importance':model.best_estimator_.feature_importances_}).sort_values('importance',ascending='False')
    features_top_n = feature_imp_sorted_rf.head(top_n_features)['feature']
    return features_top_n

def ensemble_model_feature(X,Y,top_n_features):
    features = list(X)
    
    #隨機森林
    rf = ensemble.RandomForestRegressor()
    rf_param_grid = {'n_estimators':[900],'random_state':[2,4,6,8]}
    rf_grid = GridSearchCV(rf,rf_param_grid,cv=10,verbose=1,n_jobs=25)
    rf_grid.fit(X,Y)
    top_n_features_rf = get_top_k_feature(features=features,model=rf_grid,top_n_features=top_n_features)
    print('RF 選擇完畢')
    
    #Adaboost
    abr = ensemble.AdaBoostRegressor()
    abr_grid = GridSearchCV(abr,rf_param_grid,cv=10,n_jobs=25)
    abr_grid.fit(X,Y)
    top_n_features_bgr = get_top_k_feature(features=features,model=abr_grid,top_n_features=top_n_features)
    print('Adaboost選擇完畢')
    
    #ExtraTree
    etr = ensemble.ExtraTreesRegressor()
    etr_grid = GridSearchCV(etr,rf_param_grid,cv=10,n_jobs=25)
    etr_grid.fit(X,Y)
    top_n_features_etr = get_top_k_feature(features=features,model=etr_grid,top_n_features=top_n_features)
    print('ExtraTree選擇完畢')
    
    #融合以上3個模型
    features_top_n = pd.concat([top_n_features_rf,top_n_features_bgr,top_n_features_etr],ignore_index=True).drop_duplicates()
    print(features_top_n)
    print(len(features_top_n))
    return features_top_n

  

 

 

 

 

 

參考文獻:

【1】樹模型特征重要性評估方法

【2】用xgboost模型對特征重要性進行排序

【3】xgboost特征重要性源代碼

【4】機器學習的特征重要性究竟是怎么算的(知乎)

【5】特征工程

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM