kaggle學習筆記
這部分的東西很碎,但是步驟是一樣的,因此先記住大概的,然后一點一點的添東西就好
導入數據
import pandas as pd
# Read the test data
test = pd.read_csv('test.csv')
# Print train and test columns.查看列名(變量名)
print('Train columns:', train.columns.tolist())
print('Test columns:', test.columns.tolist())
# Read the sample submission file
sample_submission = pd.read_csv('sample_submission.csv')
# Look at the head() of the sample submission
print(sample_submission.head())
submission
Public vs Private leaderboard
這里的public和private沒分太清呢
overfit
train :overfit:在訓練集的誤差大,而驗證集的誤差小,此時是訓練集的過擬合
train_test_split
這個是划分訓練集和測試集的
train_test_split函數可以將原始數據集按照一定比例划分訓練集和測試集對模型進行訓練
訓練集和測試集的誤差
要同時比較訓練集和測試集的誤差判斷是否overfiting
from sklearn.metrics import mean_squared_error
dtrain = xgb.DMatrix(data=train[['store', 'item']])
dtest = xgb.DMatrix(data=test[['store', 'item']])
# For each of 3 trained models
for model in [xg_depth_2, xg_depth_8, xg_depth_15]:
# Make predictions
train_pred = model.predict(dtrain)
test_pred = model.predict(dtest)
# Calculate metrics
mse_train =mean_squared_error(train['sales'], train_pred)
mse_test = mean_squared_error(test['sales'], test_pred)
print('MSE Train: {:.3f}. MSE Test: {:.3f}'.format(mse_train, mse_test))
<script.py> output:
MSE Train: 631.275. MSE Test: 558.522
MSE Train: 183.771. MSE Test: 337.337
MSE Train: 134.984. MSE Test: 355.534
自定義誤差函數
import numpy as np
# Import log_loss from sklearn
from sklearn.metrics import log_loss
# Define your own LogLoss function
def own_logloss(y_true, prob_pred):
# Find loss for each observation
terms = y_true * np.log(prob_pred) + (1 - y_true) * np.log(1 - prob_pred)
# Find mean over all observations
err = np.mean(terms)
return -err
print('Sklearn LogLoss: {:.5f}'.format(log_loss(y_classification_true, y_classification_pred)))
print('Your LogLoss: {:.5f}'.format(own_logloss(y_classification_true, y_classification_pred)))
EDA
PLOT
# Create hour feature
train['pickup_datetime'] = pd.to_datetime(train.pickup_datetime)
train['hour'] = train.pickup_datetime.dt.hour
# Find median fare_amount for each hour
hour_price = train.groupby('hour', as_index=False)['fare_amount'].median()
# Plot the line plot
plt.plot(hour_price['hour'], hour_price['fare_amount'], marker='o')
plt.xlabel('Hour of the day')
plt.ylabel('Median fare amount')
plt.title('Fare amount based on day time')
plt.xticks(range(24))
plt.show()
Local validation
Kfold
KFold交叉采樣:將訓練/測試數據集划分n_splits個互斥子集,每次只用其中一個子集當做測試集,剩下的(n_splits-1)作為訓練集,進行n_splits次實驗並得到n_splits個結果
# Import KFold
from sklearn.model_selection import KFold
# Create a KFold object
kf = KFold(n_splits=3, shuffle=True, random_state=123)
# Loop through each split
fold = 0
for train_index, test_index in kf.split(train):
# Obtain training and testing folds
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
print('Fold: {}'.format(fold))
print('CV train shape: {}'.format(cv_train.shape))
print('Medium interest listings in CV train: {}\n'.format(sum(cv_train.interest_level == 'medium')))
fold += 1
<script.py> output:
Fold: 0
CV train shape: (666, 9)
Medium interest listings in CV train: 175
Fold: 1
CV train shape: (667, 9)
Medium interest listings in CV train: 165
Fold: 2
CV train shape: (667, 9)
Medium interest listings in CV train: 162
data leakage
划分時間特征
# Create TimeSeriesSplit object
time_kfold = TimeSeriesSplit(n_splits=3)
# Sort train data by date
train = train.sort_values('date')
# Iterate through each split
fold = 0
for train_index, test_index in time_kfold.split(train):
cv_train, cv_test = train.iloc[train_index], train.iloc[test_index]
print('Fold :', fold)
print('Train date range: from {} to {}'.format(cv_train.date.min(), cv_train.date.max()))
print('Test date range: from {} to {}\n'.format(cv_test.date.min(), cv_test.date.max()))
fold += 1
<script.py> output:
Fold : 0
Train date range: from 2017-12-01 to 2017-12-08
Test date range: from 2017-12-08 to 2017-12-16
Fold : 1
Train date range: from 2017-12-01 to 2017-12-16
Test date range: from 2017-12-16 to 2017-12-24
Fold : 2
Train date range: from 2017-12-01 to 2017-12-24
Test date range: from 2017-12-24 to 2017-12-31
驗證集的誤差
from sklearn.model_selection import TimeSeriesSplit
import numpy as np
# Sort train data by date
train = train.sort_values('date')
# Initialize 3-fold time cross-validation
kf = TimeSeriesSplit(n_splits=3)
# Get MSE scores for each cross-validation split
mse_scores = get_fold_mse(train, kf)
print('Mean validation MSE: {:.5f}'.format(np.mean(mse_scores)))
feature engineering
Arithmetical features
numerical
數值特征,可以直接做算數運算,進行拼接
# 這樣做拼接的話是兩個特征相加
train['TotalArea'] = train['TotalBsmtSF'] + train['FirstFlrSF'] + train['SecondFlrSF']
Date features
提取時間特征
pd.to_datetime
# Concatenate train and test together
taxi = pd.concat([train, test])
# Convert pickup date to datetime object
taxi['pickup_datetime'] = pd.to_datetime(taxi['pickup_datetime'])
# 提取星期
# Create a day of week feature
taxi['dayofweek'] = taxi['pickup_datetime'].dt.dayofweek
# 提取小時
# Create an hour feature
taxi['hour'] = taxi['pickup_datetime'].dt.hour
# Split back into train and test
new_train = taxi[taxi['id'].isin(train['id'])]
new_test = taxi[taxi['id'].isin(test['id'])]
Categorical features特征編碼問題
是個大問題
label encoding
特征存在內在順序 (ordinal feature)
one hot encoding
特征無內在順序,category數量 < 4
target encoding (mean encoding, likelihood encoding, impact encoding)
特征無內在順序,category數量 > 4
beta target encoding
特征無內在順序,category數量 > 4, K-fold cross validation
不做處理(模型自動編碼)
CatBoost,lightgbm
文本(分類)特征
有序的分類特征
無序的分類特征
處理方式有主要的兩種,標簽編碼和獨熱編碼
Label encoding
# Concatenate train and test together
houses = pd.concat([train, test])
# Label encoder
對於一個有m個category的特征,經過label encoding以后,每個category會映射到0到m-1之間的一個數。label encoding適用於ordinal feature (特征存在內在順序)。
```r
#一般的實際案例是fit和transform分開的
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
# Create new features
houses['RoofStyle_enc'] = le.fit_transform(houses['RoofStyle'])
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
# Look at new features
print(houses[['RoofStyle', 'RoofStyle_enc', 'CentralAir', 'CentralAir_enc']].head())
<script.py> output:
RoofStyle RoofStyle_enc CentralAir CentralAir_enc
0 Gable 1 Y 1
1 Gable 1 Y 1
2 Gable 1 Y 1
3 Gable 1 Y 1
4 Gable 1 Y 1
one-hot
對於一個有m個category的特征,經過獨熱編碼(OHE)處理后,會變為m個二元特征,每個特征對應於一個category。這m個二元特征互斥,每次只有一個激活。
獨熱編碼解決了原始特征缺少內在順序的問題,但是缺點是對於high-cardinality categorical feature (category數量很多),編碼之后特征空間過大(此處可以考慮PCA降維),而且由於one-hot feature 比較unbalanced,樹模型里每次的切分增益較小,樹模型通常需要grow very deep才能得到不錯的精度。因此OHE一般用於category數量 <4的情況。
# Concatenate train and test together
houses = pd.concat([train, test])
# Label encode binary 'CentralAir' feature
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
houses['CentralAir_enc'] = le.fit_transform(houses['CentralAir'])
# Create One-Hot encoded features
ohe = pd.get_dummies(houses['RoofStyle'], prefix='RoofStyle')
# Concatenate OHE features to houses
houses = pd.concat([houses, ohe], axis=1)
# Look at OHE features
print(houses[[col for col in houses.columns if 'RoofStyle' in col]].head(3))
Target encoding
Mean target encoding
使用目標變量時,非常重要的一點是不要泄露任何驗證集的信息。
所有基於目標編碼的特征都應該在訓練集上計算,接着僅僅合並或連接驗證集和測試集。
即使驗證集中有目標變量,它不能用於任何編碼計算,否則會給出過於樂觀的驗證誤差估計。
- Calculate the mean on the train, apply to the test
- Split train into K folds. Calculate the out-of-fold mean for each fold, apply to this particular fold
預測變量編碼
def mean_target_encoding(train, test, target, categorical, alpha=5):
# Get the train feature
train_feature = train_mean_target_encoding(train, target, categorical, alpha)
# Get the test feature
test_feature = test_mean_target_encoding(train, test, target, categorical, alpha)
# Return new features to add to the model
return train_feature, test_feature
mean_target_encoding
這里理解的不太好。。。
k折交叉驗證
# Create 5-fold cross-validation
kf = KFold(n_splits=5, random_state=123, shuffle=True)
# For each folds split
for train_index, test_index in kf.split(bryant_shots):
cv_train, cv_test = bryant_shots.iloc[train_index], bryant_shots.iloc[test_index]
# Create mean target encoded feature
cv_train['game_id_enc'], cv_test['game_id_enc'] = mean_target_encoding(train=cv_train,
test=cv_test,
target='shot_made_flag',
categorical='game_id',
alpha=5)
# Look at the encoding
print(cv_train[['game_id', 'shot_made_flag', 'game_id_enc']].sample(n=1))
Missing data
處理缺失值
- xgboost和lightGBM不需要處理缺失值,因為可以自動處理
查看缺失值的數量
df.isnull().sum()
均值填充
# Import SimpleImputer
from sklearn.impute import SimpleImputer
# Create mean imputer
mean_imputer = SimpleImputer(strategy='mean')
# Price imputation
rental_listings[['price']] = mean_imputer.fit_transform(rental_listings[['price']])
Baseline model
這個我打算做一個實例,視頻這部分有點糊,不過kaggle官網上面確實有很多有用的baseline,有些流程是固定的,可以有一個大體思路之后繼續。