Pytorch 4.9 實戰Kaggle比賽:預測房價


實戰Kaggle比賽:預測房價

我們要使用\(Bart de Cock\)於2011年收集 \([DeCock, 2011]\), 涵蓋了 \(2006-2010\) 年期間亞利桑那州埃姆斯市的房價。來預測房價。

step.1下載數據集

我們有兩種方式下載數據集

  1. 第一種方式是注冊一個 \(Kaggle\) 帳號然后直接去 \(Kaggle\) 的官網上下載: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (注意要FQ)
  2. 參考李沐\(《Dive into deep learning》\)第四章第十節,使用python代碼下載: http://zh.d2l.ai/chapter_multilayer-perceptrons/kaggle-house-price.html#id2 (注意使用jupyter下載)

step2.加載數據集(這里的相對路徑取決於你所下載的地方!)

import numpy as np 
import pandas as pd  
import torch 
from torch import nn 
from d2l import torch as d2l   

train_data = pd.read_csv("../data/kaggle_house_pred_train.csv") 
test_data = pd.read_csv("../data/kaggle_house_pred_test.csv")  
train_data.shape , test_data.shape # 數據大小 ((1460, 81), (1459, 80))
# 查看我們的數據集長啥樣,以便於預處理
train_data
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1455 1456 60 RL 62.0 7917 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal
1456 1457 20 RL 85.0 13175 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 2 2010 WD Normal
1457 1458 70 RL 66.0 9042 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv Shed 2500 5 2010 WD Normal
1458 1459 20 RL 68.0 9717 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2010 WD Normal
1459 1460 20 RL 75.0 9937 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal

step3.做數據的預處理:剔除無用數據,數據標准化,缺失值設置為0

# 我們預測的是房價,波動會很大,我們要標准化數據,就像正態分布的標准化一樣 
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:])) 

這里利用concat合並兩個數據,但是其中剔除訓練和測試數據的第一列是因為第一列是索引編號,對於訓練沒有任何幫助,剔除訓練數據最后一列是因為要和測試數據對齊。訓練:(1460, 81),測試:(1459, 80)-->(2919, 79) 前面數據去掉首行和最后一行只是為了對齊。

# 若無法獲得測試數據,則可根據訓練數據計算均值和標准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index # 獲得我們數據的index,也就是上面標注的'MSSubClass', 'LotFrontage' ......
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))

# 在標准化數據之后,所有均值消失,因此我們可以將缺失值設置為0
all_features[numeric_features] = all_features[numeric_features].fillna(0)

# “Dummy_na=True”將“na”(缺失值)視為有效的特征值,並為其創建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True) 
all_features.shape # (2919, 331) 

step4.將清洗好的數據轉換成torch的數據

# 從pandas格式中提取NumPy格式,並將其轉換為張量表示用於訓練。 
n_train= train_data.shape[0] 
train_features = torch.tensor(all_features[:n_train].values,dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(
    train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32) # 其實就是最后一列的數據

step5.定義損失函數,k-折交叉驗證,訓練函數...

loss = nn.MSELoss() 
in_features = train_features.shape[1] # 因為已經進行數據預處理了,所以我們得從新計算下列數
def get_net():
    net = nn.Sequential(nn.Linear(in_features,1))
    return net
# 對於房價,我們得關注其相對質量(y- y^hat)/y,而不是絕對質量 y-y_hat   
def log_rmse(net, features, labels):
    # 為了在取對數時進一步穩定該值,將小於1的值設置為1
    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    rmse = torch.sqrt(loss(torch.log(clipped_preds),
                           torch.log(labels)))
    return rmse.item()
# 定義訓練函數 
def train(net, train_features, train_labels, test_features, test_labels,
          num_epochs, learning_rate, weight_decay, batch_size):
    train_ls, test_ls = [], [] 
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    # 這里使用的是Adam優化算法 和SGD差不多,但是相對於SGD對於學習率更不敏感 
    optimizer = torch.optim.Adam(net.parameters(),
                                 lr = learning_rate,
                                 weight_decay = weight_decay) 
    for epoch in range(num_epochs): 
        for X,y in train_iter: 
            optimizer.zero_grad()
            l=loss(net(X),y) 
            l.backward()  
            optimizer.step()  
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))
    return train_ls, test_ls
# 定義K 折交叉驗證       
def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k  # 整除 -->得到每一折的大小 fold_size  
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(start= j * fold_size,stop= (j + 1) * fold_size) # python內置的切片的方法
        X_part, y_part = X[idx, :], y[idx] # 具體用法看下面 Q&A
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)  # 上下縫合 
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid

def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
           batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
                                   weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        if i == 0:
            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
                     xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
                     legend=['train', 'valid'], yscale='log')
        print(f'折{i + 1},訓練log rmse{float(train_ls[-1]):f}, '
              f'驗證log rmse{float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

step6.模型選擇

k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
                          weight_decay, batch_size)
print(f'{k}-折驗證: 平均訓練log rmse: {float(train_l):f}, '
      f'平均驗證log rmse: {float(valid_l):f}')

折1,訓練log rmse0.170172, 驗證log rmse0.157228
折2,訓練log rmse0.162814, 驗證log rmse0.191097
折3,訓練log rmse0.163801, 驗證log rmse0.168365
折4,訓練log rmse0.168104, 驗證log rmse0.154445
折5,訓練log rmse0.162938, 驗證log rmse0.182995
5-折驗證: 平均訓練log rmse: 0.165566, 平均驗證log rmse: 0.170826

Q&A

Q1: pd.concat的用法? 拼接數據。

  • objs: series,dataframe或者是panel構成的序列lsit
  • axis: 需要合並鏈接的軸,0是行,1是列
  • join:連接的方式 inner,或者outer
import pandas as pd  
import numpy as np  

#構造數據
df1=pd.DataFrame([['A11','A12','A13','A14'],['A21','A22','A23','A24'],['A31','A32','A33','A34'],['A41','A42','A43','A44']],columns=list('ABCD'))
df2=pd.DataFrame([['B11','B12','B13','B14'],['B21','B22','B23','B24'],['B31','B32','B33','B34'],['B41','B42','B43','B44']],columns=list('ABCD'))
df3=pd.DataFrame([['C11','C12','C13','C14'],['C21','C22','C23','C24'],['C31','C32','C33','C34'],['C41','C42','C43','C44']],columns=list('ABCD'))
df4=pd.DataFrame([['D11','D12','D13','D14'],['D21','D22','D23','D24'],['D31','D32','D33','D34']],columns=list('ABCD'))

frames = [df1,df2,df3]

# 默認參數下是縱向相接 axis=0 
pd.concat(objs= frames,axis= 1)
A B C D A B C D A B C D
0 A11 A12 A13 A14 B11 B12 B13 B14 C11 C12 C13
1 A21 A22 A23 A24 B21 B22 B23 B24 C21 C22 C23
2 A31 A32 A33 A34 B31 B32 B33 B34 C31 C32 C33
3 A41 A42 A43 A44 B41 B42 B43 B44 C41 C42 C43
# 當行數或者列數不相同的時候,連接起來的表格如何 ,會出現NaN這種缺空的值 
pd.concat([df1,df4],axis=1)
A B C D A B C D
0 A11 A12 A13 A14 D11 D12 D13
1 A21 A22 A23 A24 D21 D22 D23
2 A31 A32 A33 A34 D31 D32 D33
3 A41 A42 A43 A44 NaN NaN NaN

**Q2:** torch.clamp()用法? 在區間內取值,大於或者小於取min或者max。

將輸入input張量每個元素的夾緊到區間 [min,max],並返回結果到一個新張量。

Q3:slice() 函數? slice() 函數實現切片對象,主要用在切片操作函數里的參數傳遞。

Q4: log_rmse 這個自定義函數看不懂?
對於房價來說,我們不能夠使用絕對誤差,我們應該取相對誤差, \(\frac{\hat{y}}{y}\) , 但是為了避免計算的復雜度,我們決定使用對數,也就是 \(|\log y - \log \hat{y}|≤δ\) 轉換為 \(e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta\) , 這使得預測價格的對數與真實標簽價格的對數之間出現以下均方根誤差:

\[\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log y_i -\log \hat{y}_i\right)^2} \]

我們定義的函數中是先取對數再進行均方根誤差運算的,所以我們得把相對房價在區間內取值,也就是 torch.clamp() 這個用法。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM