一、單個特征的EDA
-
對於 binary feature 和 categorical feature,train['feature_name'].value_counts().sort_index().plot(kind='bar')
-
對於 continuous numerical feature,
def cdf_plot(data_series): data_size = len(data_series) data_set = sorted(set(data_series)) bins = np.append(data_set, data_set[-1]+1) counts, bin_edges = np.histogram(data_series, bins=bins, density=False) counts = counts.astype(float) / data_size cdf = np.cumsum(counts) plt.plot(bin_edges[0:-1], cdf, linestyle='--', marker="o", color='b') plt.ylim((0,1)) plt.ylabel("CDF") plt.grid(True) plt.show()
二、對於類別特征的處理
https://github.com/scikit-learn-contrib/categorical-encoding
主要有三種方式
-
如果該類別特征的數值具有序數關系(ordinality),即數值之間是可以比較大小的,則可以直接將其映射為數值特征
-
獨熱編碼
最常見的處理方式。
若使用獨熱編碼,則可能僅將出現頻次較高的取值編碼。即設置一個閾值,數據中某個取值出現次數低於閾值時丟棄,或者編碼為一個特殊值【將出現次數少的所有取值都編碼成同一個值】
【可以使用 hash trick 來減少內存的使用】 -
統計編碼
最常見的統計編碼是計數統計特征。統計每個類別在訓練集(加上測試集)中的出現次數。
對樣本的標簽進行統計如Target Encoding 或者 Leave-One-Out Encoding,可能會產生信息泄露。
獨熱編碼后可能產生高維稀疏特征
- LR,線性 SVM 算法,學習每個特征對問題結果的影響程度,即與預測目標的線性關系。
- FM, FFM 算法,學習二階交叉特征對問題結果的影響程度。
- GBDT 等樹模型算法,可以學習到特征之間的更高階的表示。
- DeepFM,GBDT葉子結點 + LR(FFM),結合了低階和高階特征對問題結果的影響
關於 Target Encoding:
def add_noise(series, noise_level):
return series * (1 + noise_level * np.random.randn(len(series)))
def target_encode(trn_series=None, # Revised to encode validation series
val_series=None,
tst_series=None,
target=None,
min_samples_leaf=1,
smoothing=1,
noise_level=0):
"""
Smoothing is computed like in the following paper by Daniele Micci-Barreca
https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf
trn_series : training categorical feature as a pd.Series
tst_series : test categorical feature as a pd.Series
target : target data as a pd.Series
min_samples_leaf (int) : minimum samples to take category average into account
smoothing (int) : smoothing effect to balance categorical average vs prior
"""
assert len(trn_series) == len(target)
assert trn_series.name == tst_series.name
temp = pd.concat([trn_series, target], axis=1)
# Compute target mean
averages = temp.groupby(by=trn_series.name)[target.name].agg(["mean", "count"])
# Compute smoothing
smoothing = 1 / (1 + np.exp(-(averages["count"] - min_samples_leaf) / smoothing))
# Apply average function to all target data
prior = target.mean()
# The bigger the count the less full_avg is taken into account
averages[target.name] = prior * (1 - smoothing) + averages["mean"] * smoothing
averages.drop(["mean", "count"], axis=1, inplace=True)
# Apply averages to trn and tst series
ft_trn_series = pd.merge(
trn_series.to_frame(trn_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=trn_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_trn_series.index = trn_series.index
ft_val_series = pd.merge(
val_series.to_frame(val_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=val_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_val_series.index = val_series.index
ft_tst_series = pd.merge(
tst_series.to_frame(tst_series.name),
averages.reset_index().rename(columns={'index': target.name, target.name: 'average'}),
on=tst_series.name,
how='left')['average'].rename(trn_series.name + '_mean').fillna(prior)
# pd.merge does not keep the index so restore it
ft_tst_series.index = tst_series.index
return add_noise(ft_trn_series, noise_level), add_noise(ft_val_series, noise_level), add_noise(ft_tst_series, noise_level)
關於貝葉斯平滑:
import random
import numpy as np
import pandas as pd
import scipy.special as special
class HyperParam(object):
def __init__(self, alpha, beta):
self.alpha = alpha
self.beta = beta
def sample_from_beta(self, alpha, beta, num, imp_upperbound):
# 產生樣例數據
sample = np.random.beta(alpha, beta, num)
I = []
C = []
for click_ratio in sample:
imp = random.random() * imp_upperbound
# imp = imp_upperbound
click = imp * click_ratio
I.append(imp)
C.append(click)
return pd.Series(I), pd.Series(C)
def update_from_data_by_FPI(self, tries, success, iter_num, epsilon):
# 更新策略
for i in range(iter_num):
new_alpha, new_beta = self.__fixed_point_iteration(tries, success, self.alpha, self.beta)
if abs(new_alpha - self.alpha) < epsilon and abs(new_beta - self.beta) < epsilon:
break
self.alpha = new_alpha
self.beta = new_beta
def __fixed_point_iteration(self, tries, success, alpha, beta):
# 迭代函數
sumfenzialpha = 0.0
sumfenzibeta = 0.0
sumfenmu = 0.0
sumfenzialpha = (special.digamma(success + alpha) - special.digamma(alpha)).sum()
sumfenzibeta = (special.digamma(tries - success + beta) - special.digamma(beta)).sum()
sumfenmu = (special.digamma(tries + alpha + beta) - special.digamma(alpha + beta)).sum()
return alpha * (sumfenzialpha / sumfenmu), beta * (sumfenzibeta / sumfenmu)
三、特征工程與特征選擇
訓練GBDT或者RF,將訓練集中的特征的重要程度按從高到低排序
-
直接做交叉特征
將重要性程度高的特征進行乘法或除法計算。 -
將特征分為兩部分,其中一部分特征做訓練集,依次預測另一部分的每個特征的取值,將預測結果作為新的特征
-
Feature Aggregation
將重要性程度高的特征進行交叉統計
具體地講,每次從重要性程度高選出兩個特征,然后一個特征做分組變量,計算另一個特征的最值、均值、中位數、方差等。new_feature = features.groupby('feature1')['features].mean()
常見的特征選擇方法
-
窮盡搜索
優點是100%找到全集下面的最優子集,缺點是需要O(2^n)的時間復雜度 -
隨機選擇搜索
啟發式,每次選取一部分特征訓練,不斷循環。計算量較小 -
mRMR 特征選擇(最小冗余最大關聯特征選擇)
四、XGBoost 調參
-
初始化參數:
eta = 0.1,depth= 10,subsample=1.0,min_child_weight = 5,
col_sample_bytree = 0.5 (每棵樹構造時隨機采樣的特征的占比,該值的設置與數據的特征的數量有關)
除了eta為0.1外,其他參數的初始化由具體問題決定。
選擇合適的 objective 和 eval_metric。xgboost.train()中的obj和feval參數分別代表自定義的損失函數(目的函數)和評估函數。maximize參數表示是否對評估函數進行最大化 -
划分20%的數據作為驗證集,設置較大的num_rounds,當驗證集錯誤率開始上升時停止訓練。
- 調整depth
- 調整subsample - subsample ratio of the training instance
- 調整min_child_weight
- 調整colsample_bytree - subsample ratio of columns when constructing each tree.
- 最后,將eta調整到0.02,找到最適合的num_rounds
-
將通過以上步驟得到的參數作為baseline,再次基礎上做一些微小的改變,使模型盡量得接近局部最優解。
五、特征融合
https://mlwave.com/kaggle-ensembling-guide/
-
Voting - get the result voted most
-
Uniform
A lower correlation between ensemble model members seems to result in an increase in the error-correcting capability. -
Weighing
give a better model more weight -> The only way for the inferior models to overrule the best model (expert) is for them to collectively (and confidently) agree on an alternative.
-
-
Averaging - bagging, reduces overfit
-
averaging
average the submissions from multiple models -
rank averaging
first turn the predictions into ranks, then averaging these ranks.
-
-
stacking and blending
-
Stacked generalization
The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.
之前提到的ensemble方法都是定義一個公式或者方法來融合各種模型的預測結果。而stacking是通過另一個算法(分類器)來融合。 -
Blending
不做交叉驗證,而僅使用 holdout 數據。stacker 模型在 holdout 數據的預測結果上訓練。
相比 stacking 的優缺點
優點:簡單快速。不存在信息泄露【比如 holdout 10%,則第一層用 90% 的數據訓練,第二層用 10% 的數據訓練】
缺點:使用的訓練數據少,容易在holdout上過擬合,CV不夠valid -
Stacking with logistic regression
-
Stacking with non-linear algorithms
Popular non-linear algorithms for stacking are GBM, KNN, NN, RF and ET.
Non-linear stacking with the original features on multiclass problems gives surprising gains. -
Feature weighted linear stacking
先將提取后的特征用各個模型進行預測,然后使用一個線性的模型去學習出哪個個模型對於某些樣本來說是最優的,通過將各個模型的預測結果加權求和完成
-