准備數據與訓練
calendar.csv數據集導入。
該數據數聚包含物品的售賣時間與物品類型
- date: The date in a “y-m-d” format.
- wm_yr_wk: The id of the week the date belongs to.
- weekday: The type of the day (Saturday, Sunday, …, Friday).
- wday: The id of the weekday, starting from Saturday.
- month: The month of the date.
- year: The year of the date.
- event_name_1: If the date includes an event, the name of this event.
- event_type_1: If the date includes an event, the type of this event.
- event_name_2: If the date includes a second event, the name of this event.
- event_type_2: If the date includes a second event, the type of this event.
- snap_CA, snap_TX, and snap_WI: A binary variable (0 or 1) indicating whether the stores of CA, TX or WI allow SNAPpurchases on the examined date. 1 indicates that SNAP purchases are allowed.
# Correct data types for "calendar.csv" calendarDTypes = {"event_name_1": "category", "event_name_2": "category", "event_type_1": "category", "event_type_2": "category", "weekday": "category", 'wm_yr_wk': 'int16', "wday": "int16", "month": "int16", "year": "int16", "snap_CA": "float32", 'snap_TX': 'float32', 'snap_WI': 'float32' } # Read csv file calendar = pd.read_csv("./calendar.csv", dtype = calendarDTypes) calendar["date"] = pd.to_datetime(calendar["date"]) calendar.head(10)
# Transform categorical features into integers for col, colDType in calendarDTypes.items(): if colDType == "category": calendar[col] = calendar[col].cat.codes.astype("int16") calendar[col] -= calendar[col].min() calendar.head(10)
- calendar[col].cat.codes.astype("int16") 這個是屬於簡單的編碼標簽類別編碼。后面我們嘗試改為one編碼試試
sell_prices.csv
File 2: “sell_prices.csv”
該數據數聚包含物品的每天每單位的售賣價格
- store_id: The id of the store where the product is sold.
- item_id: The id of the product.
- wm_yr_wk: The id of the week.
- sell_price: The price of the product for the given week/store. The price is provided per week (average across seven days). If not available, this means that the product was not sold during the examined week. Note that although prices are constant at weekly basis, they may change through time (both training and test set).
# Correct data types for "sell_prices.csv" priceDTypes = {"store_id": "category", "item_id": "category", "wm_yr_wk": "int16", "sell_price":"float32"} # Read csv file prices = pd.read_csv("./sell_prices.csv", dtype = priceDTypes) prices.head()
# Transform categorical features into integers for col, colDType in priceDTypes.items(): if colDType == "category": prices[col] = prices[col].cat.codes.astype("int16") prices[col] -= prices[col].min() prices.head()
sales_train_validation.csv
File 3: “sales_train.csv”
Contains the historical daily unit sales data per product and store.
- item_id: The id of the product.
- dept_id: The id of the department the product belongs to.
- cat_id: The id of the category the product belongs to.
- store_id: The id of the store where the product is sold.
- state_id: The State where the store is located.
- d_1, d_2, …, d_i, … d_1941: The number of units sold at day i, starting from 2011-01-29.
firstDay = 250 lastDay = 1913 # Use x sales days (columns) for training numCols = [f"d_{day}" for day in range(firstDay, lastDay+1)] # Define all categorical columns catCols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id'] # Define the correct data types for "sales_train_validation.csv" dtype = {numCol: "float32" for numCol in numCols} dtype.update({catCol: "category" for catCol in catCols if catCol != "id"}) [(k,v) for k,v in dtype.items()][:10]
# Read csv file ds = pd.read_csv("./sales_train_validation.csv", usecols = catCols + numCols, dtype = dtype) ds.head()
# Transform categorical features into integers for col in catCols: if col != "id": ds[col] = ds[col].cat.codes.astype("int16") ds[col] -= ds[col].min() ds = pd.melt(ds, id_vars = catCols, value_vars = [col for col in ds.columns if col.startswith("d_")], var_name = "d", value_name = "sales") # Merge "ds" with "calendar" and "prices" dataframe ds = ds.merge(calendar, on = "d", copy = False) ds = ds.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"], copy = False) ds.head()
其實3個數據表的關聯邏輯如下:
特征工程:
銷售額的特征工程
1.構造一個觀察窗口
dayLags = [7, 28] lagSalesCols = [f"lag_{dayLag}" for dayLag in dayLags] for dayLag, lagSalesCol in zip(dayLags, lagSalesCols): ds[lagSalesCol] = ds[["id","sales"]].groupby("id")["sales"].shift(dayLag)
這個是shift:見我之前的博客pandas實現hive的lag和lead函數 以及 first_value和last_value函數注意:shift相當於lag
windows = [7, 28] for window in windows: for dayLag, lagSalesCol in zip(dayLags, lagSalesCols): ds[f"rmean_{dayLag}_{window}"] = ds[["id", lagSalesCol]].groupby("id")[lagSalesCol].transform(lambda x: x.rolling(window).mean()) ds.head()
問題如下:
1.為什么要計算滯后的滾動平均值而不是實際值的滾動平均值?
使用目標變量的滯后值的原因是通過對同一模型的多次預測來減少自蔓延誤差的影響。 目的是預測每個系列提前28天。因此,要預測系列中的第一天,您可以使用整個系列的銷售(直到滯后1)。
但是,要預測第8天,您只有滯后 8的實際數據,而要預測整個系列直到滯后28的實際數據。比賽開始時人們所做的只是使用從落后28並應用回歸(例如lightGBM)。
這是最安全的選擇,因為它不需要使用“關於預測的預測”。同時,它限制了模型學習更接近於預測值的特征的能力。
即,它在預測第一天時表現不佳,可能會使用該系列中的最新值多於滯后28。此筆記本正在做的事情是在“預測結果”和使用最新的可用信息之間找到平衡。
使用基於具有一定季節性意義的滯后的特征(滯后 7)似乎會產生積極的結果,而只有兩個特征(滯后7和均方根 7_7)的自傳播誤差使過擬合問題得到控制。
日期的特征工程
dateFeatures = {"wday": "weekday", "week": "weekofyear", "month": "month", "quarter": "quarter", "year": "year", "mday": "day"} for featName, featFunc in dateFeatures.items(): if featName in ds.columns: ds[featName] = ds[featName].astype("int16") else: ds[featName] = getattr(ds["date"].dt, featFunc).astype("int16")

<class 'pandas.core.frame.DataFrame'> Int64Index: 42372682 entries, 0 to 42372681 Data columns (total 31 columns): id object item_id int16 dept_id int16 store_id int16 cat_id int16 state_id int16 d object sales float32 date datetime64[ns] wm_yr_wk int16 weekday int16 wday int16 month int16 year int16 event_name_1 int16 event_type_1 int16 event_name_2 int16 event_type_2 int16 snap_CA float32 snap_TX float32 snap_WI float32 sell_price float32 lag_7 float32 lag_28 float32 rmean_7_7 float32 rmean_28_7 float32 rmean_7_28 float32 rmean_28_28 float32 week int16 quarter int16 mday int16 dtypes: datetime64[ns](1), float32(11), int16(17), object(2) memory usage: 4.3+ GB
移除無關列(特征)
# Remove all rows with NaN value ds.dropna(inplace = True) # Define columns that need to be removed unusedCols = ["id", "date", "sales","d", "wm_yr_wk", "weekday"] trainCols = ds.columns[~ds.columns.isin(unusedCols)] X_train = ds[trainCols] y_train = ds["sales"] y_train.head()
切分訓練集和測試集
np.random.seed(777) # Define categorical features catFeats = ['item_id', 'dept_id','store_id', 'cat_id', 'state_id'] + \ ["event_name_1", "event_name_2", "event_type_1", "event_type_2"] validInds = np.random.choice(X_train.index.values, 2_000_000, replace = False) trainInds = np.setdiff1d(X_train.index.values, validInds) trainData = lgb.Dataset(X_train.loc[trainInds], label = y_train.loc[trainInds], categorical_feature = catFeats, free_raw_data = False) validData = lgb.Dataset(X_train.loc[validInds], label = y_train.loc[validInds], categorical_feature = catFeats, free_raw_data = False)
GC:
del ds, X_train, y_train, validInds, trainInds gc.collect()
訓練模型
這里是baseline的提供者直接給的代碼
params = { "objective" : "poisson", "metric" :"rmse", "force_row_wise" : True, "learning_rate" : 0.075, "sub_row" : 0.75, "bagging_freq" : 1, "lambda_l2" : 0.1, "metric": ["rmse"], 'verbosity': 1, 'num_iterations' : 1200, 'num_leaves': 128, "min_data_in_leaf": 100, }
訓練:
# Train LightGBM model m_lgb = lgb.train(params, trainData, valid_sets = [validData], verbose_eval = 20)
模型保存:
# Save the model m_lgb.save_model("model.lgb")
預測:
測試集day > 1913
# Last day used for training trLast = 1913 # Maximum lag day maxLags = 57 # Create dataset for predictions def create_ds(): startDay = trLast - maxLags numCols = [f"d_{day}" for day in range(startDay, trLast + 1)] catCols = ['id', 'item_id', 'dept_id','store_id', 'cat_id', 'state_id'] dtype = {numCol:"float32" for numCol in numCols} dtype.update({catCol: "category" for catCol in catCols if catCol != "id"}) ds = pd.read_csv("./sales_train_validation.csv", usecols = catCols + numCols, dtype = dtype) for col in catCols: if col != "id": ds[col] = ds[col].cat.codes.astype("int16") ds[col] -= ds[col].min() for day in range(trLast + 1, trLast+ 28 +1): ds[f"d_{day}"] = np.nan ds = pd.melt(ds, id_vars = catCols, value_vars = [col for col in ds.columns if col.startswith("d_")], var_name = "d", value_name = "sales") ds = ds.merge(calendar, on = "d", copy = False) ds = ds.merge(prices, on = ["store_id", "item_id", "wm_yr_wk"], copy = False) return ds def create_features(ds): dayLags = [7, 28] lagSalesCols = [f"lag_{dayLag}" for dayLag in dayLags] for dayLag, lagSalesCol in zip(dayLags, lagSalesCols): ds[lagSalesCol] = ds[["id","sales"]].groupby("id")["sales"].shift(dayLag) windows = [7, 28] for window in windows: for dayLag, lagSalesCol in zip(dayLags, lagSalesCols): ds[f"rmean_{dayLag}_{window}"] = ds[["id", lagSalesCol]].groupby("id")[lagSalesCol].transform(lambda x: x.rolling(window).mean()) dateFeatures = {"wday": "weekday", "week": "weekofyear", "month": "month", "quarter": "quarter", "year": "year", "mday": "day"} for featName, featFunc in dateFeatures.items(): if featName in ds.columns: ds[featName] = ds[featName].astype("int16") else: ds[featName] = getattr(ds["date"].dt, featFunc).astype("int16")
最后:
fday = datetime(2016,4, 25) alphas = [1.028, 1.023, 1.018] weights = [1/len(alphas)] * len(alphas) sub = 0. for icount, (alpha, weight) in enumerate(zip(alphas, weights)): te = create_ds() cols = [f"F{i}" for i in range(1,29)] for tdelta in range(0, 28): day = fday + timedelta(days=tdelta) print(tdelta, day) tst = te[(te['date'] >= day - timedelta(days=maxLags)) & (te['date'] <= day)].copy() create_features(tst) tst = tst.loc[tst['date'] == day , trainCols] te.loc[te['date'] == day, "sales"] = alpha * m_lgb.predict(tst) # magic multiplier by kyakovlev te_sub = te.loc[te['date'] >= fday, ["id", "sales"]].copy() te_sub["F"] = [f"F{rank}" for rank in te_sub.groupby("id")["id"].cumcount()+1] te_sub = te_sub.set_index(["id", "F" ]).unstack()["sales"][cols].reset_index() te_sub.fillna(0., inplace = True) te_sub.sort_values("id", inplace = True) te_sub.reset_index(drop=True, inplace = True) te_sub.to_csv(f"submission_{icount}.csv",index=False) if icount == 0 : sub = te_sub sub[cols] *= weight else: sub[cols] += te_sub[cols]*weight print(icount, alpha, weight) sub2 = sub.copy() sub2["id"] = sub2["id"].str.replace("validation$", "evaluation") sub = pd.concat([sub, sub2], axis=0, sort=False) sub.to_csv("submission.csv",index=False)
結果:

0 2016-04-25 00:00:00 1 2016-04-26 00:00:00 2 2016-04-27 00:00:00 3 2016-04-28 00:00:00 4 2016-04-29 00:00:00 5 2016-04-30 00:00:00 6 2016-05-01 00:00:00 7 2016-05-02 00:00:00 8 2016-05-03 00:00:00 9 2016-05-04 00:00:00 10 2016-05-05 00:00:00 11 2016-05-06 00:00:00 12 2016-05-07 00:00:00 13 2016-05-08 00:00:00 14 2016-05-09 00:00:00 15 2016-05-10 00:00:00 16 2016-05-11 00:00:00 17 2016-05-12 00:00:00 18 2016-05-13 00:00:00 19 2016-05-14 00:00:00 20 2016-05-15 00:00:00 21 2016-05-16 00:00:00 22 2016-05-17 00:00:00 23 2016-05-18 00:00:00 24 2016-05-19 00:00:00 25 2016-05-20 00:00:00 26 2016-05-21 00:00:00 27 2016-05-22 00:00:00 0 1.028 0.3333333333333333 0 2016-04-25 00:00:00 1 2016-04-26 00:00:00 2 2016-04-27 00:00:00 3 2016-04-28 00:00:00 4 2016-04-29 00:00:00 5 2016-04-30 00:00:00 6 2016-05-01 00:00:00 7 2016-05-02 00:00:00 8 2016-05-03 00:00:00 9 2016-05-04 00:00:00 10 2016-05-05 00:00:00 11 2016-05-06 00:00:00 12 2016-05-07 00:00:00 13 2016-05-08 00:00:00 14 2016-05-09 00:00:00 15 2016-05-10 00:00:00 16 2016-05-11 00:00:00 17 2016-05-12 00:00:00 18 2016-05-13 00:00:00 19 2016-05-14 00:00:00 20 2016-05-15 00:00:00 21 2016-05-16 00:00:00 22 2016-05-17 00:00:00 23 2016-05-18 00:00:00 24 2016-05-19 00:00:00 25 2016-05-20 00:00:00 26 2016-05-21 00:00:00 27 2016-05-22 00:00:00 1 1.023 0.3333333333333333 0 2016-04-25 00:00:00 1 2016-04-26 00:00:00 2 2016-04-27 00:00:00 3 2016-04-28 00:00:00 4 2016-04-29 00:00:00 5 2016-04-30 00:00:00 6 2016-05-01 00:00:00 7 2016-05-02 00:00:00 8 2016-05-03 00:00:00 9 2016-05-04 00:00:00 10 2016-05-05 00:00:00 11 2016-05-06 00:00:00 12 2016-05-07 00:00:00 13 2016-05-08 00:00:00 14 2016-05-09 00:00:00 15 2016-05-10 00:00:00 16 2016-05-11 00:00:00 17 2016-05-12 00:00:00 18 2016-05-13 00:00:00 19 2016-05-14 00:00:00 20 2016-05-15 00:00:00 21 2016-05-16 00:00:00 22 2016-05-17 00:00:00 23 2016-05-18 00:00:00 24 2016-05-19 00:00:00 25 2016-05-20 00:00:00 26 2016-05-21 00:00:00 27 2016-05-22 00:00:00 2 1.018 0.3333333333333333
未完結。。。明天寫改善思路