實戰Kaggle比賽:預測房價
我們要使用\(Bart de Cock\)於2011年收集 \([DeCock, 2011]\), 涵蓋了 \(2006-2010\) 年期間亞利桑那州埃姆斯市的房價。來預測房價。
step.1下載數據集
我們有兩種方式下載數據集
- 第一種方式是注冊一個 \(Kaggle\) 帳號然后直接去 \(Kaggle\) 的官網上下載: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data (注意要FQ)
- 參考李沐\(《Dive into deep learning》\)第四章第十節,使用python代碼下載: http://zh.d2l.ai/chapter_multilayer-perceptrons/kaggle-house-price.html#id2 (注意使用jupyter下載)
step2.加載數據集(這里的相對路徑取決於你所下載的地方!)
import numpy as np
import pandas as pd
import torch
from torch import nn
from d2l import torch as d2l
train_data = pd.read_csv("../data/kaggle_house_pred_train.csv")
test_data = pd.read_csv("../data/kaggle_house_pred_test.csv")
train_data.shape , test_data.shape # 數據大小 ((1460, 81), (1459, 80))
# 查看我們的數據集長啥樣,以便於預處理
train_data
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1455 | 1456 | 60 | RL | 62.0 | 7917 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal |
1456 | 1457 | 20 | RL | 85.0 | 13175 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 2 | 2010 | WD | Normal |
1457 | 1458 | 70 | RL | 66.0 | 9042 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | Shed | 2500 | 5 | 2010 | WD | Normal |
1458 | 1459 | 20 | RL | 68.0 | 9717 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2010 | WD | Normal |
1459 | 1460 | 20 | RL | 75.0 | 9937 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal |
step3.做數據的預處理:剔除無用數據,數據標准化,缺失值設置為0
# 我們預測的是房價,波動會很大,我們要標准化數據,就像正態分布的標准化一樣
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
這里利用concat合並兩個數據,但是其中剔除訓練和測試數據的第一列是因為第一列是索引編號,對於訓練沒有任何幫助,剔除訓練數據最后一列是因為要和測試數據對齊。訓練:(1460, 81),測試:(1459, 80)-->(2919, 79) 前面數據去掉首行和最后一行只是為了對齊。
# 若無法獲得測試數據,則可根據訓練數據計算均值和標准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index # 獲得我們數據的index,也就是上面標注的'MSSubClass', 'LotFrontage' ......
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std()))
# 在標准化數據之后,所有均值消失,因此我們可以將缺失值設置為0
all_features[numeric_features] = all_features[numeric_features].fillna(0)
# “Dummy_na=True”將“na”(缺失值)視為有效的特征值,並為其創建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True)
all_features.shape # (2919, 331)
step4.將清洗好的數據轉換成torch的數據
# 從pandas格式中提取NumPy格式,並將其轉換為張量表示用於訓練。
n_train= train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values,dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(
train_data.SalePrice.values.reshape(-1, 1), dtype=torch.float32) # 其實就是最后一列的數據
step5.定義損失函數,k-折交叉驗證,訓練函數...
loss = nn.MSELoss()
in_features = train_features.shape[1] # 因為已經進行數據預處理了,所以我們得從新計算下列數
def get_net():
net = nn.Sequential(nn.Linear(in_features,1))
return net
# 對於房價,我們得關注其相對質量(y- y^hat)/y,而不是絕對質量 y-y_hat
def log_rmse(net, features, labels):
# 為了在取對數時進一步穩定該值,將小於1的值設置為1
clipped_preds = torch.clamp(net(features), 1, float('inf'))
rmse = torch.sqrt(loss(torch.log(clipped_preds),
torch.log(labels)))
return rmse.item()
# 定義訓練函數
def train(net, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
train_ls, test_ls = [], []
train_iter = d2l.load_array((train_features, train_labels), batch_size)
# 這里使用的是Adam優化算法 和SGD差不多,但是相對於SGD對於學習率更不敏感
optimizer = torch.optim.Adam(net.parameters(),
lr = learning_rate,
weight_decay = weight_decay)
for epoch in range(num_epochs):
for X,y in train_iter:
optimizer.zero_grad()
l=loss(net(X),y)
l.backward()
optimizer.step()
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls
# 定義K 折交叉驗證
def get_k_fold_data(k, i, X, y):
assert k > 1
fold_size = X.shape[0] // k # 整除 -->得到每一折的大小 fold_size
X_train, y_train = None, None
for j in range(k):
idx = slice(start= j * fold_size,stop= (j + 1) * fold_size) # python內置的切片的方法
X_part, y_part = X[idx, :], y[idx] # 具體用法看下面 Q&A
if j == i:
X_valid, y_valid = X_part, y_part
elif X_train is None:
X_train, y_train = X_part, y_part
else:
X_train = torch.cat([X_train, X_part], 0) # 上下縫合
y_train = torch.cat([y_train, y_part], 0)
return X_train, y_train, X_valid, y_valid
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
batch_size):
train_l_sum, valid_l_sum = 0, 0
for i in range(k):
data = get_k_fold_data(k, i, X_train, y_train)
net = get_net()
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
weight_decay, batch_size)
train_l_sum += train_ls[-1]
valid_l_sum += valid_ls[-1]
if i == 0:
d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
legend=['train', 'valid'], yscale='log')
print(f'折{i + 1},訓練log rmse{float(train_ls[-1]):f}, '
f'驗證log rmse{float(valid_ls[-1]):f}')
return train_l_sum / k, valid_l_sum / k
step6.模型選擇
k, num_epochs, lr, weight_decay, batch_size = 5, 100, 5, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
weight_decay, batch_size)
print(f'{k}-折驗證: 平均訓練log rmse: {float(train_l):f}, '
f'平均驗證log rmse: {float(valid_l):f}')
折1,訓練log rmse0.170172, 驗證log rmse0.157228
折2,訓練log rmse0.162814, 驗證log rmse0.191097
折3,訓練log rmse0.163801, 驗證log rmse0.168365
折4,訓練log rmse0.168104, 驗證log rmse0.154445
折5,訓練log rmse0.162938, 驗證log rmse0.182995
5-折驗證: 平均訓練log rmse: 0.165566, 平均驗證log rmse: 0.170826
Q&A
Q1: pd.concat的用法? 拼接數據。
- objs: series,dataframe或者是panel構成的序列lsit
- axis: 需要合並鏈接的軸,0是行,1是列
- join:連接的方式 inner,或者outer
import pandas as pd
import numpy as np
#構造數據
df1=pd.DataFrame([['A11','A12','A13','A14'],['A21','A22','A23','A24'],['A31','A32','A33','A34'],['A41','A42','A43','A44']],columns=list('ABCD'))
df2=pd.DataFrame([['B11','B12','B13','B14'],['B21','B22','B23','B24'],['B31','B32','B33','B34'],['B41','B42','B43','B44']],columns=list('ABCD'))
df3=pd.DataFrame([['C11','C12','C13','C14'],['C21','C22','C23','C24'],['C31','C32','C33','C34'],['C41','C42','C43','C44']],columns=list('ABCD'))
df4=pd.DataFrame([['D11','D12','D13','D14'],['D21','D22','D23','D24'],['D31','D32','D33','D34']],columns=list('ABCD'))
frames = [df1,df2,df3]
# 默認參數下是縱向相接 axis=0
pd.concat(objs= frames,axis= 1)
A | B | C | D | A | B | C | D | A | B | C | D |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | A11 | A12 | A13 | A14 | B11 | B12 | B13 | B14 | C11 | C12 | C13 |
1 | A21 | A22 | A23 | A24 | B21 | B22 | B23 | B24 | C21 | C22 | C23 |
2 | A31 | A32 | A33 | A34 | B31 | B32 | B33 | B34 | C31 | C32 | C33 |
3 | A41 | A42 | A43 | A44 | B41 | B42 | B43 | B44 | C41 | C42 | C43 |
# 當行數或者列數不相同的時候,連接起來的表格如何 ,會出現NaN這種缺空的值
pd.concat([df1,df4],axis=1)
A | B | C | D | A | B | C | D |
---|---|---|---|---|---|---|---|
0 | A11 | A12 | A13 | A14 | D11 | D12 | D13 |
1 | A21 | A22 | A23 | A24 | D21 | D22 | D23 |
2 | A31 | A32 | A33 | A34 | D31 | D32 | D33 |
3 | A41 | A42 | A43 | A44 | NaN | NaN | NaN |
**Q2:** torch.clamp()用法? 在區間內取值,大於或者小於取min或者max。
將輸入input張量每個元素的夾緊到區間 [min,max],並返回結果到一個新張量。
Q3:slice() 函數? slice() 函數實現切片對象,主要用在切片操作函數里的參數傳遞。
Q4: log_rmse 這個自定義函數看不懂?
對於房價來說,我們不能夠使用絕對誤差,我們應該取相對誤差, \(\frac{\hat{y}}{y}\) , 但是為了避免計算的復雜度,我們決定使用對數,也就是 \(|\log y - \log \hat{y}|≤δ\) 轉換為 \(e^{-\delta} \leq \frac{\hat{y}}{y} \leq e^\delta\) , 這使得預測價格的對數與真實標簽價格的對數之間出現以下均方根誤差:
我們定義的函數中是先取對數再進行均方根誤差運算的,所以我們得把相對房價在區間內取值,也就是 torch.clamp() 這個用法。