kaggle比賽實踐M5-baseline研讀(二)M5 LOFO Importance on GPU via Rapids/Xgboost


先說學習心得

通過這篇對特征重要性的baseline學習,我學習到了如下三個點:

1.feature_importance

2.一款GPU計算的開源框架rapids

3.回顧了xgb樹模型的生成過程

資源搬運如下:

https://www.kaggle.com/aerdem4/m5-lofo-importance-on-gpu-via-rapids-xgboost

希望大家過去看看這位大佬的其他分享。謝謝。同時大佬在利物浦的比賽中又復用了這個計算特征重要性的方法。

from  datetime import datetime, timedelta
import gc
import numpy as np, pandas as pd
import lightgbm as lgb

import cudf
import cu_utils.transform as cutran
cu_utils這個是這位kaggler手寫的輪子:
from numba import cuda, float32
import math


def cu_mean_transform(x, y_out):
    res = cuda.shared.array(1, dtype=float32)
    res[0] = 0
    cuda.syncthreads()

    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        cuda.atomic.add(res, 0, x[i])
    cuda.syncthreads()

    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        y_out[i] = res[0] / len(x)


def cu_max_transform(x, y_out):
    res = cuda.shared.array(1, dtype=float32)
    res[0] = -math.inf
    cuda.syncthreads()

    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        cuda.atomic.max(res, 0, x[i])
    cuda.syncthreads()

    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        y_out[i] = res[0]


def cu_min_transform(x, y_out):
    res = cuda.shared.array(1, dtype=float32)
    res[0] = math.inf
    cuda.syncthreads()

    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        cuda.atomic.min(res, 0, x[i])
    cuda.syncthreads()

    for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
        y_out[i] = res[0]


def get_cu_shift_transform(shift_by, null_val=-1):
    def cu_shift_transform(x, y_out):
        for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
            y_out[i] = null_val
            if 0 <= i-shift_by < len(x):
                y_out[i] = x[i-shift_by]
    return cu_shift_transform


def get_cu_rolling_mean_transform(window, null_val=-1):
    def cu_rolling_mean_transform(x, y_out):
        for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
            y_out[i] = 0
            if i >= window-1:
                for j in range(cuda.threadIdx.y, window, cuda.blockDim.y):
                    cuda.atomic.add(y_out, i, x[i-j])
                y_out[i] /= window
            else:
                y_out[i] = null_val
    return cu_rolling_mean_transform


def get_cu_rolling_max_transform(window, null_val=-1):
    def cu_rolling_max_transform(x, y_out):
        for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
            y_out[i] = -math.inf
            if i >= window-1:
                for j in range(cuda.threadIdx.y, window, cuda.blockDim.y):
                    cuda.atomic.max(y_out, i, x[i-j])
            else:
                y_out[i] = null_val
    return cu_rolling_max_transform


def get_cu_rolling_min_transform(window, null_val=-1):
    def cu_rolling_min_transform(x, y_out):
        for i in range(cuda.threadIdx.x, len(x), cuda.blockDim.x):
            y_out[i] = math.inf
            if i >= window-1:
                for j in range(cuda.threadIdx.y, window, cuda.blockDim.y):
                    cuda.atomic.min(y_out, i, x[i-j])
            else:
                y_out[i] = null_val
    return cu_rolling_min_transform
輪子

其實輪子這塊不沒怎么看懂!

h = 28 
max_lags = 57
tr_last = 1913
fday = datetime(2016,4, 25) 
FIRST_DAY = 1000
fday

下面是數據的預處理:

%%time

def create_df(start_day):
    prices = cudf.read_csv("/kaggle/input/m5-forecasting-accuracy/sell_prices.csv")
            
    cal = cudf.read_csv("/kaggle/input/m5-forecasting-accuracy/calendar.csv")
    cal["date"] = cal["date"].astype("datetime64[ms]")
    
    numcols = [f"d_{day}" for day in range(start_day,tr_last+1)]
    catcols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id']
    dt = cudf.read_csv("/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv", usecols = catcols + numcols)
    
    dt = cudf.melt(dt,
                  id_vars = catcols,
                  value_vars = [col for col in dt.columns if col.startswith("d_")],
                  var_name = "d",
                  value_name = "sales")
    
    dt = dt.merge(cal, on= "d")
    dt = dt.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"])
    
    return dt


df = create_df(FIRST_DAY)
df.head()

銷量的數據總共從d1~d1913,但是kaggler用d1000-1913的數據作為訓練數據。

可能是考慮時間太久遠會,人的生活習慣會改變,歷史行為的噪音過大。

def transform(data):
    
    nan_features = ['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
    for feature in nan_features:
        data[feature].fillna('unknown', inplace = True)
    
    data['id_encode'], _ = data["id"].factorize()
    
    cat = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']
    for feature in cat:
        data[feature], _ = data[feature].factorize()
    
    return data
        
        
df = transform(df)
df.head()

kaggler先對['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']這些字段進行缺失值填補,然后再最['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2']用類別編碼。

其實我感覺這種對event的特征處理不是很合理。

kv={}
for k in zip(calendar['event_name_1'].tolist(),calendar['event_type_1'].tolist(),calendar['event_name_2'].tolist(),calendar['event_type_2'].tolist()):
    if k not in kv:
        kv[k]=1
    else:
        kv[k]=kv[k]+1
kv
【個人代碼】對event特征研究

數據如下:

{(nan, nan, nan, nan): 1807,
 ('SuperBowl', 'Sporting', nan, nan): 6,
 ('ValentinesDay', 'Cultural', nan, nan): 6,
 ('PresidentsDay', 'National', nan, nan): 6,
 ('LentStart', 'Religious', nan, nan): 6,
 ('LentWeek2', 'Religious', nan, nan): 6,
 ('StPatricksDay', 'Cultural', nan, nan): 6,
 ('Purim End', 'Religious', nan, nan): 6,
 ('OrthodoxEaster', 'Religious', 'Easter', 'Cultural'): 1,
 ('Pesach End', 'Religious', nan, nan): 6,
 ('Cinco De Mayo', 'Cultural', nan, nan): 5,
 ("Mother's day", 'Cultural', nan, nan): 6,
 ('MemorialDay', 'National', nan, nan): 6,
 ('NBAFinalsStart', 'Sporting', nan, nan): 6,
 ('NBAFinalsEnd', 'Sporting', nan, nan): 4,
 ("Father's day", 'Cultural', nan, nan): 4,
 ('IndependenceDay', 'National', nan, nan): 5,
 ('Ramadan starts', 'Religious', nan, nan): 6,
 ('Eid al-Fitr', 'Religious', nan, nan): 5,
 ('LaborDay', 'National', nan, nan): 5,
 ('ColumbusDay', 'National', nan, nan): 5,
 ('Halloween', 'Cultural', nan, nan): 5,
 ('EidAlAdha', 'Religious', nan, nan): 5,
 ('VeteransDay', 'National', nan, nan): 5,
 ('Thanksgiving', 'National', nan, nan): 5,
 ('Christmas', 'National', nan, nan): 5,
 ('Chanukah End', 'Religious', nan, nan): 5,
 ('NewYear', 'National', nan, nan): 5,
 ('OrthodoxChristmas', 'Religious', nan, nan): 5,
 ('MartinLutherKingDay', 'National', nan, nan): 5,
 ('Easter', 'Cultural', nan, nan): 4,
 ('OrthodoxEaster', 'Religious', nan, nan): 3,
 ('OrthodoxEaster', 'Religious', 'Cinco De Mayo', 'Cultural'): 1,
 ('Easter', 'Cultural', 'OrthodoxEaster', 'Religious'): 1,
 ('NBAFinalsEnd', 'Sporting', "Father's day", 'Cultural'): 2}
節日計數

 

 我認為對event1,2分開編碼必然是不合理的。應該轉為onehot的編碼方式,也可以省去fillna.

特征工程:

def create_fea(data):

    for lag in [7, 28]:
        out_col = "lag_{}".format(str(lag))
        data[out_col] = data[["id", "sales"]].groupby("id", method='cudf').apply_grouped(cutran.get_cu_shift_transform(shift_by=lag),
                                                                      incols={"sales": 'x'},
                                                                      outcols=dict(y_out=np.float32),
                                                                      tpb=32)["y_out"]
    
        for window in [7, 28]:
            out_col = "rmean_{lag}_{window}".format(lag=lag, window=window)
            data[out_col] = data[["id", "lag_{}".format(lag)]].groupby("id", method='cudf').apply_grouped(cutran.get_cu_rolling_mean_transform(window),
                                                                          incols={"lag_{}".format(lag): 'x'},
                                                                          outcols=dict(y_out=np.float32),
                                                                          tpb=32)["y_out"]

    # time features
    data['date'] = data['date'].astype("datetime64[ms]")
    data['year'] = data['date'].dt.year
    data['month'] = data['date'].dt.month
    data['day'] = data['date'].dt.day
    data['dayofweek'] = data['date'].dt.weekday

    return data    

# define list of features
features = ['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id',
            'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 
            'snap_CA', 'snap_TX', 'snap_WI', 'sell_price', 
            'year', 'month', 'day', 'dayofweek',
            'lag_7', 'lag_28', 'rmean_7_7', 'rmean_7_28', 'rmean_28_7', 'rmean_28_28'
           ]


df = create_fea(df)

用滑窗構建特征,先lag,后對lag的值進行滑窗求均值。

類似於這樣:

使用xgboost計算特征重要性:

from lofo import LOFOImportance, Dataset, plot_importance
from sklearn.model_selection import KFold
import xgboost

sample_df = df.to_pandas().sample(frac=0.1, random_state=0)
sample_df.sort_values("date", inplace=True)

cv = KFold(n_splits=5, shuffle=False, random_state=0)

dataset = Dataset(df=sample_df, target="sales", features=features)

# define the validation scheme and scorer
params = {"objective": "count:poisson",
          "learning_rate" : 0.075,
          "max_depth": 6,
          'n_estimators': 200,
          'min_child_weight': 50,
          "tree_method": 'gpu_hist', "gpu_id": 0}
xgb_reg = xgboost.XGBRegressor(**params)
lofo_imp = LOFOImportance(dataset, cv=cv, scoring="neg_mean_squared_error", model=xgb_reg)

# get the mean and standard deviation of the importances in pandas format
importance_df = lofo_imp.get_importance()

繪圖如下:

plot_importance(importance_df, figsize=(12, 20))

 

 拿到特征重要度后,我們如何理解這里的特征重要度。

在訓練集中,減少某個特征導致最終准確性減少的值(base-lofo),減少的值越大就意味着這個特征越重要。

當此值為負時,便意味着這個去掉這個特征反而導致模型分數更優,那么這個特征是噪聲特征,應該被舍棄。

 

 

未完待續@@

 
       


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM