stacking算法原理
1:對於Model1,將訓練集D分為k份,對於每一份,用剩余數據集訓練模型,然后預測出這一份的結果
2:重復上面步驟,直到每一份都預測出來。得到次級模型的訓練集
3:得到k份測試集,平均后得到次級模型的測試集
4: 對於Model2、Model3…..重復以上情況,得到M維數據
5:選定次級模型,進行訓練預測 ,一般這最后一層用的是LR。
優缺點:
優點:
1、 采用交叉驗證方法構造,穩健性強;
2、 可以結合多個模型判斷結果,進行次級訓練,效果好;
缺點:
1、構造復雜,難以得到相應規則,商用上難以解釋。
代碼:
import numpy as np
from sklearn.model_selection import KFold
def get_stacking(clf, x_train, y_train, x_test, n_folds=10):
"""
這個函數是stacking的核心,使用交叉驗證的方法得到次級訓練集
x_train, y_train, x_test 的值應該為numpy里面的數組類型 numpy.ndarray .
如果輸入為pandas的DataFrame類型則會把報錯"""
train_num, test_num = x_train.shape[0], x_test.shape[0]
second_level_train_set = np.zeros((train_num,))
second_level_test_set = np.zeros((test_num,))
test_nfolds_sets = np.zeros((test_num, n_folds))
kf = KFold(n_splits=n_folds)
for i,(train_index, test_index) in enumerate(kf.split(x_train)):
x_tra, y_tra = x_train[train_index], y_train[train_index]
x_tst, y_tst = x_train[test_index], y_train[test_index]
clf.fit(x_tra, y_tra)
second_level_train_set[test_index] = clf.predict(x_tst)
test_nfolds_sets[:,i] = clf.predict(x_test)
second_level_test_set[:] = test_nfolds_sets.mean(axis=1)
return second_level_train_set, second_level_test_set
#我們這里使用5個分類算法,為了體現stacking的思想,就不加參數了
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
rf_model = RandomForestClassifier()
adb_model = AdaBoostClassifier()
gdbc_model = GradientBoostingClassifier()
et_model = ExtraTreesClassifier()
svc_model = SVC()
#在這里我們使用train_test_split來人為的制造一些數據
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
train_x, test_x, train_y, test_y = train_test_split(iris.data, iris.target, test_size=0.2)
train_sets = []
test_sets = []
for clf in [rf_model, adb_model, gdbc_model, et_model, svc_model]:
train_set, test_set = get_stacking(clf, train_x, train_y, test_x)
train_sets.append(train_set)
test_sets.append(test_set)
meta_train = np.concatenate([result_set.reshape(-1,1) for result_set in train_sets], axis=1)
meta_test = np.concatenate([y_test_set.reshape(-1,1) for y_test_set in test_sets], axis=1)
#使用決策樹作為我們的次級分類器
from sklearn.tree import DecisionTreeClassifier
dt_model = DecisionTreeClassifier()
dt_model.fit(meta_train, train_y)
df_predict = dt_model.predict(meta_test)
print(df_predict)