前言
1,背景介紹
公共自行車低碳,環保,健康,並且解決了交通中“最后一公里”的痛點,在全國各個城市越來越受歡迎。本次練習的數據取自於兩個城市某街道上的幾處公共自行車停車樁。我們希望根據時間,天氣等信息,預測出該街區在一小時內的被借取的公共自行車的數量。
2,任務類型
回歸
3,數據文件說明
train.csv 訓練集 文件大小為273KB
test.csv 預測集 文件大小為179KB
sample_submit.csv 提交示例 文件大小為97KB
4,數據變量說明
訓練集中共有10000條樣本,預測集中有7000條樣本

5,評估方法
評價方法為RMSE(Root of Mean Squared Error)

6,完整代碼,請移步小編的GitHub
傳送門:請點擊我
數據預處理
1,觀察數據有沒有缺失值
print(train.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 7 columns): city 10000 non-null int64 hour 10000 non-null int64 is_workday 10000 non-null int64 weather 10000 non-null int64 temp_1 10000 non-null float64 temp_2 10000 non-null float64 wind 10000 non-null int64 dtypes: float64(2), int64(5) memory usage: 547.0 KB None
我們可以看到,共有10000個觀測值,沒有缺失值。
2,觀察每個變量的基礎描述信息
print(train.describe())
city hour ... temp_2 wind
count 10000.000000 10000.000000 ... 10000.000000 10000.000000
mean 0.499800 11.527500 ... 15.321230 1.248600
std 0.500025 6.909777 ... 11.308986 1.095773
min 0.000000 0.000000 ... -15.600000 0.000000
25% 0.000000 6.000000 ... 5.800000 0.000000
50% 0.000000 12.000000 ... 16.000000 1.000000
75% 1.000000 18.000000 ... 24.800000 2.000000
max 1.000000 23.000000 ... 46.800000 7.000000
[8 rows x 7 columns]
通過觀察可以得出一些猜測,如城市0 和城市1基本可以排除南方城市;整個觀測記錄時間跨度較長,還可能包含了一個長假期數據等等。
3,查看相關系數
(為了方便查看,絕對值低於0.2的就用nan替代)
corr = feature_data.corr()
corr[np.abs(corr) < 0.2] = np.nan
print(corr)
city hour is_workday weather temp_1 temp_2 wind
city 1.0 NaN NaN NaN NaN NaN NaN
hour NaN 1.0 NaN NaN NaN NaN NaN
is_workday NaN NaN 1.0 NaN NaN NaN NaN
weather NaN NaN NaN 1.0 NaN NaN NaN
temp_1 NaN NaN NaN NaN 1.000000 0.987357 NaN
temp_2 NaN NaN NaN NaN 0.987357 1.000000 NaN
wind NaN NaN NaN NaN NaN NaN 1.0
從相關性角度來看,用車的時間和當時的氣溫對借取數量y有較強的關系;氣溫和體感氣溫顯強正相關(共線性),這個和常識一致。
模型訓練及其結果展示
1,標桿模型:簡單線性回歸模型
該模型預測結果的RMSE為:39.132
# -*- coding: utf-8 -*-
# 引入模塊
from sklearn.linear_model import LinearRegression
import pandas as pd
# 讀取數據
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")
# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)
# 取出訓練集的y
y_train = train.pop('y')
# 建立線性回歸模型
reg = LinearRegression()
reg.fit(train, y_train)
y_pred = reg.predict(test)
# 若預測值是負數,則取0
y_pred = map(lambda x: x if x >= 0 else 0, y_pred)
# 輸出預測結果至my_LR_prediction.csv
submit['y'] = y_pred
submit.to_csv('my_LR_prediction.csv', index=False)
2,決策樹回歸模型
該模型預測結果的RMSE為:28.818
# -*- coding: utf-8 -*-
# 引入模塊
from sklearn.tree import DecisionTreeRegressor
import pandas as pd
# 讀取數據
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")
# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)
# 取出訓練集的y
y_train = train.pop('y')
# 建立最大深度為5的決策樹回歸模型
reg = DecisionTreeRegressor(max_depth=5)
reg.fit(train, y_train)
y_pred = reg.predict(test)
# 輸出預測結果至my_DT_prediction.csv
submit['y'] = y_pred
submit.to_csv('my_DT_prediction.csv', index=False)
3,Xgboost回歸模型
該模型預測結果的RMSE為:18.947
# -*- coding: utf-8 -*-
# 引入模塊
from xgboost import XGBRegressor
import pandas as pd
# 讀取數據
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
submit = pd.read_csv("sample_submit.csv")
# 刪除id
train.drop('id', axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)
# 取出訓練集的y
y_train = train.pop('y')
# 建立一個默認的xgboost回歸模型
reg = XGBRegressor()
reg.fit(train, y_train)
y_pred = reg.predict(test)
# 輸出預測結果至my_XGB_prediction.csv
submit['y'] = y_pred
submit.to_csv('my_XGB_prediction.csv', index=False)
![]()
4,Xgboost回歸模型調參過程
Xgboost的相關博客:請點擊我
參數調優的方法步驟一般情況如下:
-
1,選擇較高的學習速率(learning rate)。一般情況下,學習速率的值為0.1。但是對於不同的問題,理想的學習速率有時候會在0.05到0.3之間波動。選擇對應於此學習速率的理想決策樹數量。 Xgboost有一個很有用的函數“cv”,這個函數可以在每一次迭代中使用交叉驗證,並返回理想的決策樹數量。
-
2,對於給定的學習速率和決策樹數量,進行決策樹特定參數調優(max_depth,min_child_weight,gamma,subsample,colsample_bytree)。在確定一棵樹的過程中,我們可以選擇不同的參數。
-
3,Xgboost的正則化參數的調優。(lambda,alpha)。這些參數可以降低模型的復雜度,從而提高模型的表現。
-
4,降低學習速率,確定理想參數。
5,Xgboost使用GridSearchCV調參過程
5.1,Xgboost 的默認參數如下(在sklearn庫中的默認參數):
def __init__(self, max_depth=3, learning_rate=0.1, n_estimators=100,
silent=True, objective="rank:pairwise", booster='gbtree',
n_jobs=-1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
subsample=1, colsample_bytree=1, colsample_bylevel=1,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
base_score=0.5, random_state=0, seed=None, missing=None, **kwargs):
5.2,首先調n_estimators
def xgboost_parameter_tuning(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test1 = {
'n_estimators': range(100, 1000, 100)
}
gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor(
learning_rate=0.1, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
nthread=4, scale_pos_weight=1, seed=27),
param_grid=param_test1, iid=False, cv=5
)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
得到結果如下(所以我們選擇樹的個數為200):
{'n_estimators': 200}
0.9013685759002941
5.3,調參 max_depth和min_child_weight
(樹的最大深度,缺省值為3,范圍是[1, 正無窮),樹的深度越大,則對數據的擬合程度越高,但是通常取值為3-10)
(孩子節點中的最小的樣本權重和,如果一個葉子節點的樣本權重和小於min_child_weight則拆分過程結果)
下面我們對這兩個參數調優,是因為他們對最終結果由很大的影響,所以我直接小范圍微調。
def xgboost_parameter_tuning2(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test2 = {
'max_depth': range(3, 10, 1),
'min_child_weight': range(1, 6, 1),
}
gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200
), param_grid=param_test2, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
得到的結果如下:
{'max_depth': 5, 'min_child_weight': 5}
0.9030852081699604
我們對於數值進行較大跨度的48種不同的排列組合,可以看出理想的max_depth值為5,理想的min_child_weight值為5。
5.4,gamma參數調優
(gamma值使得算法更加conservation,且其值依賴於loss function,在模型中應該調參)
在已經調整好其他參數的基礎上,我們可以進行gamma參數的調優了。Gamma參數取值范圍可以很大,我這里把取值范圍設置為5,其實我們也可以取更精確的gamma值。
def xgboost_parameter_tuning3(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test3 = {
'gamma': [i/10.0 for i in range(0, 5)]
}
gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5
), param_grid=param_test3, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
結果如下:
{'gamma': 0.0}
0.9024876500236406
5.5,調整subsample 和 colsample_bytree 參數
(subsample 用於訓練模型的子樣本占整個樣本集合的比例,如果設置0.5則意味着XGBoost將隨機的從整個樣本集合中抽取出百分之50的子樣本建立模型,這樣能防止過擬合,取值范圍為(0, 1])
(在建立樹的時候對特征采樣的比例,缺省值為1,物質范圍為(0, 1])
下一步是嘗試不同的subsample 和colsample_bytree 參數。我們分兩個階段來進行這個步驟。這兩個步驟都取0.6,0.7,0.8,0.9 作為起始值。
def xgboost_parameter_tuning4(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test4 = {
'subsample': [i / 10.0 for i in range(6, 10)],
'colsample_bytree': [i / 10.0 for i in range(6, 10)]
}
gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5, gamma=0
), param_grid=param_test4, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
結果如下:
{'colsample_bytree': 0.9, 'subsample': 0.8}
0.9039011907271065
5.6,正則化參數調優
由於gamma函數提供了一種更加有效的降低過擬合的方法,大部分人很少會用到這個參數,但是我們可以嘗試用一下這個參數。
def xgboost_parameter_tuning5(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test5 = {
'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05]
}
gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5, gamma=0.0,
colsample_bytree=0.9, subsample=0.8), param_grid=param_test5, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
結果如下:
{'reg_alpha': 0.01}
0.899800819611995
5.6,匯總出我們搜索到的最佳參數,然后訓練
代碼如下:
def xgboost_train(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
params = {
'learning_rate': 0.1,
'n_estimators': 200,
'max_depth': 5,
'min_child_weight': 5,
'gamma': 0.0,
'colsample_bytree': 0.9,
'subsample': 0.8,
'reg_alpha': 0.01,
}
model = xgb.XGBRegressor(**params)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
submit = pd.read_csv(submitfile)
submit['y'] = model.predict(test_feature)
submit.to_csv('my_xgboost_prediction1.csv', index=False)
![]()
我們可以對比上面的結果,最終的結果為15.208,比直接使用xgboost提高了3.92.
最終所有代碼總結如下:
#_*_coding:utf-8_*_
import numpy as np
import pandas as pd
def load_data(trainfile, testfile):
traindata = pd.read_csv(trainfile)
testdata = pd.read_csv(testfile)
print(traindata.shape) #(10000, 9)
print(testdata.shape) #(7000, 8)
# print(traindata)
print(type(traindata))
feature_data = traindata.iloc[:, 1:-1]
label_data = traindata.iloc[:, -1]
test_feature = testdata.iloc[:, 1:]
return feature_data, label_data, test_feature
def xgboost_train(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
params = {
'learning_rate': 0.1,
'n_estimators': 200,
'max_depth': 5,
'min_child_weight': 5,
'gamma': 0.0,
'colsample_bytree': 0.9,
'subsample': 0.8,
'reg_alpha': 0.01,
}
model = xgb.XGBRegressor()
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
submit = pd.read_csv(submitfile)
submit['y'] = model.predict(test_feature)
submit.to_csv('my_xgboost_prediction.csv', index=False)
def xgboost_parameter_tuning1(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test1 = {
'n_estimators': range(100, 1000, 100)
}
gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor(
learning_rate=0.1, max_depth=5,
min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
nthread=4, scale_pos_weight=1, seed=27),
param_grid=param_test1, iid=False, cv=5
)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
def xgboost_parameter_tuning2(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test2 = {
'max_depth': range(3, 10, 1),
'min_child_weight': range(1, 6, 1),
}
gsearch1 = GridSearchCV(estimator= xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200
), param_grid=param_test2, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
def xgboost_parameter_tuning3(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test3 = {
'gamma': [i/10.0 for i in range(0, 5)]
}
gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5
), param_grid=param_test3, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
def xgboost_parameter_tuning4(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test4 = {
'subsample': [i / 10.0 for i in range(6, 10)],
'colsample_bytree': [i / 10.0 for i in range(6, 10)]
}
gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5,gamma=0.0
), param_grid=param_test4, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
def xgboost_parameter_tuning5(feature_data, label_data, test_feature, submitfile):
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test5 = {
'reg_alpha': [0, 0.001, 0.005, 0.01, 0.05]
}
gsearch1 = GridSearchCV(estimator=xgb.XGBRegressor(
learning_rate=0.1, n_estimators=200, max_depth=5, min_child_weight=5, gamma=0.0,
colsample_bytree=0.9, subsample=0.8), param_grid=param_test5, cv=5)
gsearch1.fit(X_train, y_train)
return gsearch1.best_params_, gsearch1.best_score_
if __name__ == '__main__':
trainfile = 'data/train.csv'
testfile = 'data/test.csv'
submitfile = 'data/sample_submit.csv'
feature_data, label_data, test_feature = load_data(trainfile, testfile)
xgboost_train(feature_data, label_data, test_feature, submitfile)
6,隨機森林回歸模型
該模型預測結果的RMSE為:18.947
#_*_coding:utf-8_*_
import numpy as np
import pandas as pd
def load_data(trainfile, testfile):
traindata = pd.read_csv(trainfile)
testdata = pd.read_csv(testfile)
feature_data = traindata.iloc[:, 1:-1]
label_data = traindata.iloc[:, -1]
test_feature = testdata.iloc[:, 1:]
return feature_data, label_data, test_feature
def random_forest_train(feature_data, label_data, test_feature, submitfile):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
model = RandomForestRegressor()
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
submit = pd.read_csv(submitfile)
submit['y'] = model.predict(test_feature)
submit.to_csv('my_random_forest_prediction.csv', index=False)
if __name__ == '__main__':
trainfile = 'data/train.csv'
testfile = 'data/test.csv'
submitfile = 'data/sample_submit.csv'
feature_data, label_data, test_feature = load_data(trainfile, testfile)
random_forest_train(feature_data, label_data, test_feature, submitfile)

7,隨機森林回歸模型調參過程
隨機森林的相關博客:請點擊我
首先,我們看一下隨機森林的調參過程

- 1,首先先調即不會增加模型復雜度,又對模型影響最大的參數n_estimators(學習曲線)
- 2,找到最佳值后,調max_depth(單個網格搜索,也可以使用學習曲線)
- (一般根據數據的大小來進行一個探視,當數據集很小的時候,可以采用1~10,或者1~20這樣的試探,但是對於大型數據來說罵我們應該嘗試30~50層深度(或許更深))
- 3,接下來依次對各個參數進行調參
- (注意,對於大型數據集,max_leaf_nodes可以嘗試從1000來構建,先輸入1000,每100個葉子一個區間,再逐漸縮小范圍;對於min_samples_split和min_samples_leaf,一般從他們的最小值開始向上增加10 或者20,面對高維度高樣本數據,如果不放心可以直接50+,對於大型數據可能需要200~300的范圍,如果調整的時候發現准確率無論如何都上不來,可以大膽放心的調試一個很大的數據,大力限制模型的復雜度。)
7.1 使用gridsearchcv探索n_estimators的最佳值
def random_forest_parameter_tuning1(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test1 = {
'n_estimators': range(10, 71, 10)
}
model = GridSearchCV(estimator=RandomForestRegressor(
min_samples_split=100, min_samples_leaf=20, max_depth=8, max_features='sqrt',
random_state=10), param_grid=param_test1, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
結果如下:
{'n_estimators': 70}
0.6573670183811001
這樣我們得到了最佳的弱學習器迭代次數,為70.。
7.2 對決策樹最大深度 max_depth 和內部節點再划分所需要的最小樣本數求最佳值
我們首先得到了最佳弱學習器迭代次數,接着我們對決策樹最大深度max_depth和內部節點再划分所需要最小樣本數min_samples_split進行網格搜索。
def random_forest_parameter_tuning2(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test2 = {
'max_depth': range(3, 14, 2),
'min_samples_split': range(50, 201, 20)
}
model = GridSearchCV(estimator=RandomForestRegressor(
n_estimators=70, min_samples_leaf=20, max_features='sqrt', oob_score=True,
random_state=10), param_grid=param_test2, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
結果為:
{'max_depth': 13, 'min_samples_split': 50}
0.7107311632187736
對於內部節點再划分所需要最小樣本數min_samples_split,我們暫時不能一起定下來,因為這個還和決策樹其他的參數存在關聯。
7.3 求內部節點再划分所需要的最小樣本數min_samples_split和葉子節點最小樣本數min_samples_leaf的最佳參數
下面我們對內部節點在划分所需要最小樣本數min_samples_split和葉子節點最小樣本數min_samples_leaf一起調參。
def random_forest_parameter_tuning3(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test3 = {
'min_samples_split': range(10, 90, 20),
'min_samples_leaf': range(10, 60, 10),
}
model = GridSearchCV(estimator=RandomForestRegressor(
n_estimators=70, max_depth=13, max_features='sqrt', oob_score=True,
random_state=10), param_grid=param_test3, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
結果如下:
{'min_samples_leaf': 10, 'min_samples_split': 10}
0.7648492269870218
7.4 求最大特征數max_features的最佳參數
def random_forest_parameter_tuning4(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test3 = {
'max_features': range(3, 9, 2),
}
model = GridSearchCV(estimator=RandomForestRegressor(
n_estimators=70, max_depth=13, min_samples_split=10, min_samples_leaf=10, oob_score=True,
random_state=10), param_grid=param_test3, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
結果如下:
{'max_features': 7}
0.881211719251515
7.5 匯總出我們搜索到的最佳參數,然后訓練
def random_forest_train(feature_data, label_data, test_feature, submitfile):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
params = {
'n_estimators': 70,
'max_depth': 13,
'min_samples_split': 10,
'min_samples_leaf': 10,
'max_features': 7
}
model = RandomForestRegressor(**params)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
submit = pd.read_csv(submitfile)
submit['y'] = model.predict(test_feature)
submit.to_csv('my_random_forest_prediction1.csv', index=False)
最終計算得到的結果如下:
我們發現,經過調參,結果由17.144 優化到16.251,效果相對Xgboost來說,不是很大。所以最終我們選擇Xgboost算法。
7.6 所有代碼如下:
#_*_coding:utf-8_*_
import numpy as np
import pandas as pd
def load_data(trainfile, testfile):
traindata = pd.read_csv(trainfile)
testdata = pd.read_csv(testfile)
feature_data = traindata.iloc[:, 1:-1]
label_data = traindata.iloc[:, -1]
test_feature = testdata.iloc[:, 1:]
return feature_data, label_data, test_feature
def random_forest_train(feature_data, label_data, test_feature, submitfile):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
params = {
'n_estimators': 70,
'max_depth': 13,
'min_samples_split': 10,
'min_samples_leaf': 10,
'max_features': 7
}
model = RandomForestRegressor(**params)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
submit = pd.read_csv(submitfile)
submit['y'] = model.predict(test_feature)
submit.to_csv('my_random_forest_prediction1.csv', index=False)
def random_forest_parameter_tuning1(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test1 = {
'n_estimators': range(10, 71, 10)
}
model = GridSearchCV(estimator=RandomForestRegressor(
min_samples_split=100, min_samples_leaf=20, max_depth=8, max_features='sqrt',
random_state=10), param_grid=param_test1, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
def random_forest_parameter_tuning2(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test2 = {
'max_depth': range(3, 14, 2),
'min_samples_split': range(50, 201, 20)
}
model = GridSearchCV(estimator=RandomForestRegressor(
n_estimators=70, min_samples_leaf=20, max_features='sqrt', oob_score=True,
random_state=10), param_grid=param_test2, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
def random_forest_parameter_tuning3(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test3 = {
'min_samples_split': range(10, 90, 20),
'min_samples_leaf': range(10, 60, 10),
}
model = GridSearchCV(estimator=RandomForestRegressor(
n_estimators=70, max_depth=13, max_features='sqrt', oob_score=True,
random_state=10), param_grid=param_test3, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
def random_forest_parameter_tuning4(feature_data, label_data, test_feature):
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(feature_data, label_data, test_size=0.23)
param_test4 = {
'max_features': range(3, 9, 2)
}
model = GridSearchCV(estimator=RandomForestRegressor(
n_estimators=70, max_depth=13, min_samples_split=10, min_samples_leaf=10, oob_score=True,
random_state=10), param_grid=param_test4, cv=5
)
model.fit(X_train, y_train)
# 對測試集進行預測
y_pred = model.predict(X_test)
# 計算准確率
MSE = mean_squared_error(y_test, y_pred)
RMSE = np.sqrt(MSE)
print(RMSE)
return model.best_score_, model.best_params_
if __name__ == '__main__':
trainfile = 'data/train.csv'
testfile = 'data/test.csv'
submitfile = 'data/sample_submit.csv'
feature_data, label_data, test_feature = load_data(trainfile, testfile)
random_forest_train(feature_data, label_data, test_feature, submitfile)
參考文獻:https://www.jianshu.com/p/748b6c35773d
