Kaggle競賽 —— 房價預測 (House Prices)


完整代碼見kaggle kernelGithub


比賽頁面:https://www.kaggle.com/c/house-prices-advanced-regression-techniques

這個比賽總的情況就是給你79個特征然后根據這些預測房價 (SalePrice),這其中既有離散型也有連續性特征,而且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個文件,里面對各個特征的含義進行了描述,理解了其中內容后對於大部分缺失值就都能順利插補了。

參加比賽首先要做的事是了解其評價指標,如果一開始就搞錯了到最后可能就白費功夫了-。- House Prices的評估指標是均方根誤差 (RMSE),這是常見的用於回歸問題的指標 :

\[\sqrt{\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{N}} \]


我目前的得分是0.11421


對我的分數提升最大的主要有兩塊:

  • 特征工程 : 主要為離散型變量的排序賦值,特征組合和PCA
  • 模型融合 : 主要為加權平均和Stacking

將在下文中一一說明。


目錄:

  1. 探索性可視化(Exploratory Visualization)
  2. 數據清洗(Data Cleaning)
  3. 特征工程(Feature Engineering)
  4. 基本建模&評估(Basic Modeling & Evaluation)
  5. 參數調整(Hyperparameters Tuning)
  6. 集成方法(Ensemble Methods)



探索性可視化(Exploratory Visualization)

由於原始特征較多,這里只選擇建造年份 (YearBuilt) 來進行可視化:

plt.figure(figsize=(15,8))
sns.boxplot(train.YearBuilt, train.SalePrice)

一般認為新房子比較貴,老房子比較便宜,從圖上看大致也是這個趨勢,由於建造年份 (YearBuilt) 這個特征存在較多的取值 (從1872年到2010年),直接one hot encoding會造成過於稀疏的數據,因此在特征工程中會將其進行數字化編碼 (LabelEncoder) 。






數據清洗 (Data Cleaning)

這里主要的工作是處理缺失值,首先來看各特征的缺失值數量:

aa = full.isnull().sum()
aa[aa>0].sort_values(ascending=False)
PoolQC          2908
MiscFeature     2812
Alley           2719
Fence           2346
SalePrice       1459
FireplaceQu     1420
LotFrontage      486
GarageQual       159
GarageCond       159
GarageFinish     159
GarageYrBlt      159
GarageType       157
BsmtExposure      82
BsmtCond          82
BsmtQual          81
BsmtFinType2      80
BsmtFinType1      79
MasVnrType        24
MasVnrArea        23
MSZoning           4
BsmtFullBath       2
BsmtHalfBath       2
Utilities          2
Functional         2
Electrical         1
BsmtUnfSF          1
Exterior1st        1
Exterior2nd        1
TotalBsmtSF        1
GarageCars         1
BsmtFinSF2         1
BsmtFinSF1         1
KitchenQual        1
SaleType           1
GarageArea         1

如果我們仔細觀察一下data_description里面的內容的話,會發現很多缺失值都有跡可循,比如上表第一個PoolQC,表示的是游泳池的質量,其值缺失代表的是這個房子本身沒有游泳池,因此可以用 “None” 來填補。

下面給出的這些特征都可以用 “None” 來填補:

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]
for col in cols1:
    full[col].fillna("None", inplace=True)

下面的這些特征多為表示XX面積,比如 TotalBsmtSF 表示地下室的面積,如果一個房子本身沒有地下室,則缺失值就用0來填補。

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]
for col in cols:
    full[col].fillna(0, inplace=True)

LotFrontage這個特征與LotAreaCut和Neighborhood有比較大的關系,所以這里用這兩個特征分組后的中位數進行插補。

full['LotFrontage']=full.groupby(['LotAreaCut','Neighborhood'])['LotFrontage'].transform(lambda x: x.fillna(x.median()))






特征工程 (Feature Engineering)

離散型變量的排序賦值

對於離散型特征,一般采用pandas中的get_dummies進行數值化,但在這個比賽中光這樣可能還不夠,所以下面我采用的方法是按特征進行分組,計算該特征每個取值下SalePrice的平均數和中位數,再以此為基准排序賦值,下面舉個例子:

MSSubClass這個特征表示房子的類型,將數據按其分組:

full.groupby(['MSSubClass'])[['SalePrice']].agg(['mean','median','count'])

按表中進行排序:

          '180' : 1
          '30' : 2   '45' : 2
          '190' : 3, '50' : 3, '90' : 3,
          '85' : 4, '40' : 4, '160' : 4
          '70' : 5, '20' : 5, '75' : 5, '80' : 5, '150' : 5
          '120': 6, '60' : 6

我總共大致排了20多個特征,具體見完整代碼。


特征組合

將原始特征進行組合通常能產生意想不到的效果,然而這個數據集中原始特征有很多,不可能所有都一一組合,所以這里先用Lasso進行特征篩選,選出較重要的一些特征進行組合。

lasso=Lasso(alpha=0.001)
lasso.fit(X_scaled,y_log)
FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns)

FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))
plt.xticks(rotation=90)
plt.show()


最終加了這些特征,這其中也包括了很多其他的各種嘗試:

class add_feature(BaseEstimator, TransformerMixin):
    def __init__(self,additional=1):
        self.additional = additional
    
    def fit(self,X,y=None):
        return self
    
    def transform(self,X):
        if self.additional==1:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]   
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
            
        else:
            X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"]   
            X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"]
            
            X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"]
            X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"]
            X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"]
            X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"]
            X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"]
            X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"]
            X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"]
            X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"]
            X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"]
            
            X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"]
            X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"]
            X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"]
            X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"]
            X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"]
            X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"]
            
           
            X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"]
            X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"]
            X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]
            X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"]

            return X

PCA

PCA是非常重要的一環,對於最終分數的提升很大。因為我新增的這些特征都是和原始特征高度相關的,這可能導致較強的多重共線性 (Multicollinearity) ,而PCA恰可以去相關性。因為這里使用PCA的目的不是降維,所以 n_components 用了和原來差不多的維度,這是我多方實驗的結果,即前面加XX特征,后面再降到XX維。

pca = PCA(n_components=410)

X_scaled=pca.fit_transform(X_scaled)
test_X_scaled = pca.transform(test_X_scaled)






基本建模&評估(Basic Modeling & Evaluation)

首先定義RMSE的交叉驗證評估指標:

def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5))
    return rmse

使用了13個算法和5折交叉驗證來評估baseline效果:

  • LinearRegression

  • Ridge

  • Lasso

  • Random Forrest

  • Gradient Boosting Tree

  • Support Vector Regression

  • Linear Support Vector Regression

  • ElasticNet

  • Stochastic Gradient Descent

  • BayesianRidge

  • KernelRidge

  • ExtraTreesRegressor

  • XgBoost

names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]
for name, model in zip(names, models):
    score = rmse_cv(model, X_scaled, y_log)
    print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

結果如下, 總的來說樹模型普遍不如線性模型,可能還是因為get_dummies后帶來的數據稀疏性,不過這些模型都是沒調過參的。
LR: 1026870159.526766, 488528070.4534
Ridge: 0.117596, 0.0091
Lasso: 0.121474, 0.0060
RF: 0.140764, 0.0052
GBR: 0.124154, 0.0072
SVR: 0.112727, 0.0047
LinSVR: 0.121564, 0.0081
Ela: 0.111113, 0.0059
SGD: 0.159686, 0.0092
Bay: 0.110577, 0.0060
Ker: 0.109276, 0.0055
Extra: 0.136668, 0.0073
Xgb: 0.126614, 0.0070

接下來建立一個調參的方法,應時刻牢記評估指標是RMSE,所以打印出的分數也要是RMSE。

class grid():
    def __init__(self,model):
        self.model = model
    
    def grid_get(self,X,y,param_grid):
        grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error")
        grid_search.fit(X,y)
        print(grid_search.best_params_, np.sqrt(-grid_search.best_score_))
        grid_search.cv_results_['mean_test_score'] = np.sqrt(-grid_search.cv_results_['mean_test_score'])
        print(pd.DataFrame(grid_search.cv_results_)[['params','mean_test_score','std_test_score']])

舉例Lasso的調參:

grid(Lasso()).grid_get(X_scaled,y_log,{'alpha': [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],'max_iter':[10000]})
{'max_iter': 10000, 'alpha': 0.0005} 0.111296607965
                                 params  mean_test_score  std_test_score
0  {'max_iter': 10000, 'alpha': 0.0003}         0.111869        0.001513
1  {'max_iter': 10000, 'alpha': 0.0002}         0.112745        0.001753
2  {'max_iter': 10000, 'alpha': 0.0004}         0.111463        0.001392
3  {'max_iter': 10000, 'alpha': 0.0005}         0.111297        0.001339
4  {'max_iter': 10000, 'alpha': 0.0007}         0.111538        0.001284
5  {'max_iter': 10000, 'alpha': 0.0006}         0.111359        0.001315
6  {'max_iter': 10000, 'alpha': 0.0009}         0.111915        0.001206
7  {'max_iter': 10000, 'alpha': 0.0008}         0.111706        0.001229

經過漫長的多輪測試,最后選擇了這六個模型:

lasso = Lasso(alpha=0.0005,max_iter=10000)
ridge = Ridge(alpha=60)
svr = SVR(gamma= 0.0004,kernel='rbf',C=13,epsilon=0.009)
ker = KernelRidge(alpha=0.2 ,kernel='polynomial',degree=3 , coef0=0.8)
ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)
bay = BayesianRidge()






集成方法 (Ensemble Methods)

加權平均

根據權重對各個模型加權平均:

class AverageWeight(BaseEstimator, RegressorMixin):
	def __init__(self,mod,weight):
    	self.mod = mod
    	self.weight = weight
    
	def fit(self,X,y):
    	self.models_ = [clone(x) for x in self.mod]
    	for model in self.models_:
        	model.fit(X,y)
    	return self

	def predict(self,X):
    	w = list()
    	pred = np.array([model.predict(X) for model in self.models_])
    	# for every data point, single model prediction times weight, then add them together
    	for data in range(pred.shape[1]):
        	single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)]
        	w.append(np.sum(single))
    	return w
weight_avg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])
score = rmse_cv(weight_avg,X_scaled,y_log)
print(score.mean())           # 0.10768459878025885

分數為0.10768,比任何單個模型都好。

然而若只用SVR和Kernel Ridge兩個模型,則效果更好,看來是其他幾個模型拖后腿了。。

weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5])
score = rmse_cv(weight_avg,X_scaled,y_log)
print(score.mean())           # 0.10668349587195189

Stacking

Stacking的原理見下圖:




如果是像圖中那樣的兩層stacking,則是第一層5個模型,第二層1個元模型。第一層模型的作用是訓練得到一個\(\mathbb{R}^{n×m}\)的特征矩陣來用於輸入第二層模型訓練,其中n為訓練數據行數,m為第一層模型個數。

class stacking(BaseEstimator, RegressorMixin, TransformerMixin):
    def __init__(self,mod,meta_model):
        self.mod = mod
        self.meta_model = meta_model
        self.kf = KFold(n_splits=5, random_state=42, shuffle=True)
        
    def fit(self,X,y):
        self.saved_model = [list() for i in self.mod]
        oof_train = np.zeros((X.shape[0], len(self.mod)))
        
        for i,model in enumerate(self.mod):
            for train_index, val_index in self.kf.split(X,y):
                renew_model = clone(model)
                renew_model.fit(X[train_index], y[train_index])
                self.saved_model[i].append(renew_model)
                oof_train[val_index,i] = renew_model.predict(X[val_index])
        
        self.meta_model.fit(oof_train,y)
        return self
    
    def predict(self,X):
        whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1) 
                                      for single_model in self.saved_model]) 
        return self.meta_model.predict(whole_test)
    
    def get_oof(self,X,y,test_X):
        oof = np.zeros((X.shape[0],len(self.mod)))
        test_single = np.zeros((test_X.shape[0],5))
        test_mean = np.zeros((test_X.shape[0],len(self.mod)))
        for i,model in enumerate(self.mod):
            for j, (train_index,val_index) in enumerate(self.kf.split(X,y)):
                clone_model = clone(model)
                clone_model.fit(X[train_index],y[train_index])
                oof[val_index,i] = clone_model.predict(X[val_index])
                test_single[:,j] = clone_model.predict(test_X)
            test_mean[:,i] = test_single.mean(axis=1)
        return oof, test_mean

最開始我用get_oof的方法將第一層模型的特征矩陣提取出來,再和原始特征進行拼接,最后的cv分數下降到了0.1018,然而在leaderboard上的分數卻變差了,看來這種方法會導致過擬合。

X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled)
X_train_add = np.hstack((a,X_train_stack))          
X_test_add = np.hstack((test_X_scaled,X_test_stack))
print(rmse_cv(stack_model,X_train_add,b).mean())    # 0.101824682747

最后的結果提交,我用了Lasso,Ridge,SVR,Kernel Ridge,ElasticNet,BayesianRidge作為第一層模型,Kernel Ridge作為第二層模型。

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)
stack_model.fit(a,b)
pred = np.exp(stack_model.predict(test_X_scaled))

result=pd.DataFrame({'Id':test.Id, 'SalePrice':pred})
result.to_csv("submission.csv",index=False)





免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM