一、正則化
1.L1/Lasso
L1正則方法具有稀疏解的特性,因此天然具備特征選擇的特性,但是要注意,L1沒有選到的特征不代表不重要,原因是兩個具有高相關性的特征可能只保留了一個,如果要確定哪個特征重要應再通過L2正則方法交叉檢驗。
舉例:下面的例子在波士頓房價數據上運行了Lasso,其中參數alpha是通過grid search進行優化
from sklearn.linear_model import Lasso from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_boston boston = load_boston() scaler = StandardScaler() X = scaler.fit_transform(boston["data"]) Y = boston["target"] names = boston["feature_names"] lasso = Lasso(alpha=.3) lasso.fit(X, Y) print "Lasso model: ", pretty_print_linear(lasso.coef_, names, sort = True)
可以看到,很多特征的系數都是0。如果繼續增加alpha的值,得到的模型就會越來越稀疏,即越來越多的特征系數會變成0。然而,L1正則化像非正則化線性模型一樣也是不穩定的,如果特征集合中具有相關聯的特征,當數據發生細微變化時也有可能導致很大的模型差異。
2.L2/Ridge
舉例:
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score
size = 100
#We run the method 10 times with different random seeds
for i in range(10):
print("Random seed %s" % i)
np.random.seed(seed=i)
X_seed = np.random.normal(0, 1, size)
X1 = X_seed + np.random.normal(0, .1, size)
X2 = X_seed + np.random.normal(0, .1, size)
X3 = X_seed + np.random.normal(0, .1, size)
Y = X1 + X2 + X3 + np.random.normal(0, 1, size)
X = np.array([X1, X2, X3]).T
lr = LinearRegression()
lr.fit(X,Y)
print("Linear model:", pretty_print_linear(lr.coef_))
ridge = Ridge(alpha=10)
ridge.fit(X,Y)
print("Ridge model:", pretty_print_linear(ridge.coef_))
二、基於樹模型的特征重要性
1.RF
2.ExtraTree
3.Adaboost
4.GBDT
5.XGboost
get_score(fmap='', importance_type='weight')
fmap是一個包含特征名稱映射關系的txt文檔; importance_type指importance的計算類型;可取值有5個:
‘weight’: the number of times a feature is used to split the data across all trees. ‘gain’: the average gain across all splits the feature is used in. ‘cover’: the average coverage across all splits the feature is used in. ‘total_gain’: the total gain across all splits the feature is used in. ‘total_cover’: the total coverage across all splits the feature is used in.
[1]importance_type=weight(默認值),某特征在所有樹中作為划分屬性的次數(某特征在整個樹群節點中出現的次數,出現越多,價值就越高)
[2]importance_type=gain,某特征在作為划分屬性時loss平均的降低量(某特征在整個樹群作為分裂節點的信息增益之和再除以某特征出現的頻次)
[3] importance_type= cover,某特征在作為划分屬性時對樣本的覆蓋度(某特征節點樣本的二階導數和再除以某特征出現總頻次)[4]importance_type=total_gain,同gain,average_over_splits=False,這里total_gain就是除以出現次數的gain
[5]importance_type=total_cover,同cover,average_over_splits=False,這里total_cover就是除以出現次數的gain
從構造函數中發現,xgboost sklearn API在計算特征重要性的時候默認importance_type="gain",而原始的get_score方法默認importance_type="weight"
def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
verbosity=1, silent=None, objective="reg:linear", booster='gbtree',
n_jobs=1, nthread=None, gamma=0, min_child_weight=1,
max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1,
colsample_bynode=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
base_score=0.5, random_state=0, seed=None, missing=None,
# 在這一步進行了聲明
importance_type="gain", **kwargs):
6.LightGBM
7.RF、Xgboost、ExtraTree每個選出topk特征,再進行融合
from sklearn import ensemble
from sklearn.model_selection import GridSearchCV
import xgboost as xgb
def get_top_k_feature(features,model,top_n_features):
feature_imp_sorted_rf = pd.DataFrame({'feature':features,'importance':model.best_estimator_.feature_importances_}).sort_values('importance',ascending='False')
features_top_n = feature_imp_sorted_rf.head(top_n_features)['feature']
return features_top_n
def ensemble_model_feature(X,Y,top_n_features):
features = list(X)
#隨機森林
rf = ensemble.RandomForestRegressor()
rf_param_grid = {'n_estimators':[900],'random_state':[2,4,6,8]}
rf_grid = GridSearchCV(rf,rf_param_grid,cv=10,verbose=1,n_jobs=25)
rf_grid.fit(X,Y)
top_n_features_rf = get_top_k_feature(features=features,model=rf_grid,top_n_features=top_n_features)
print('RF 選擇完畢')
#Adaboost
abr = ensemble.AdaBoostRegressor()
abr_grid = GridSearchCV(abr,rf_param_grid,cv=10,n_jobs=25)
abr_grid.fit(X,Y)
top_n_features_bgr = get_top_k_feature(features=features,model=abr_grid,top_n_features=top_n_features)
print('Adaboost選擇完畢')
#ExtraTree
etr = ensemble.ExtraTreesRegressor()
etr_grid = GridSearchCV(etr,rf_param_grid,cv=10,n_jobs=25)
etr_grid.fit(X,Y)
top_n_features_etr = get_top_k_feature(features=features,model=etr_grid,top_n_features=top_n_features)
print('ExtraTree選擇完畢')
#融合以上3個模型
features_top_n = pd.concat([top_n_features_rf,top_n_features_bgr,top_n_features_etr],ignore_index=True).drop_duplicates()
print(features_top_n)
print(len(features_top_n))
return features_top_n
參考文獻:
【1】樹模型特征重要性評估方法
【4】機器學習的特征重要性究竟是怎么算的(知乎)
【5】特征工程
