李宏毅機器學習HW1（梯度下降）

本文轉載自查看原文 2020-09-19 09:40 513 ML / DL

問題引入

作業所給的數據是某地的觀測記錄，每個月取前20天的數據，觀測數據共有18個指標，每小時記錄這18個指標的值，共記錄12個月。

現在從剩下的資料中取出連續的9小時的觀測數據，請預測第10個小時的PM2.5指標的值。

數據處理

先將csv文件內容讀入進來，首先需要注意的是RAINFALL指標還有NR，得把它替換成0。

def read_csv():
     df = pd.read_csv(CSV_file_path)
     df = df.iloc[:, 3:]                    # 前面3列的內容沒用
     df.replace('NR', 0, inplace=True)      # 將降雨量中的NR替換為0
     return df

接下來生成測試集，因為最后預測數據的特征是連續9小時的觀測數據，所以我們每次取連續的10小時觀測數據，前9小時的數據作為特征，最后一小時的PM2.5指標觀測值作為標簽值。

為方便生成數據，先將數據作如下處理：

一共生成12行這樣的數據，注意每個月不能連起來，因為每個月只取了前20天，並不連續。

接下來生成數據即可，代碼如下：

def get_train_data(df):
    data = df.to_numpy()
    month_data = {}

    for month in range(12):
        sample = np.empty([18, 480])       # 一共18個觀測指標，24*20=480
        for day in range(20):              # 將每個月的20天數據連接起來
            sample[:, day*24:(day+1)*24] = data[month*20*18+day*18:month*20*18+(day+1)*18, :]
        month_data[month] = sample

    x_set = np.empty((12*471, 18*9))       # 每10小時可取出一組數據，共471組，一共12個月，所以總的數據量為12*471，屬性值一共有18*9
    y_set = np.empty((12*471, 1))

    for month in range(12):
        for day in range(20):
            for hour in range(24):
                if day == 19 and hour > 14:
                    continue
                x_set[month * 471 + day * 24 + hour, :] = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1, -1)   # 將數據　　重組成一行
                y_set[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9]   # 只取第9行的PM2.5觀測值
    return x_set, y_set

為了能獲得更好的效果，在梯度下降之前先進行標准化處理。

這里使用的公式為z-score 標准化：

$x_{i}^{r} = \frac{x_{i}^{r}-m_{i}}{\sigma_i}$

其中$x_{i}^{r} $表示第$r$組數據中的第$i$個值，$m_{i}$表示所有樣本中第$i$個值的均值，$\sigma_i$表示所有樣本中第$i$個值的標准差。

def normalization(x_set):
    x_mean = np.mean(x_set, axis=0)
    x_std = np.std(x_set, axis=0)
    for row in range(len(x_set)):
        for col in range(len(x_set[0])):
            if x_std[col] != 0:               # 標准差為0表示數據基本無波動
                x_set[row][col] = (x_set[row][col] - x_mean[col])/x_std[col]
    return x_set, x_mean, x_std

在得到訓練數據后，我們再拆分一部分數據出來作為驗證集。

def split_data(x_set, y_set):
    x_train_set = x_set[: math.floor(len(x_set) * 0.8), :]
    y_train_set = y_set[: math.floor(len(y_set) * 0.8), :]
    x_validation = x_set[math.floor(len(x_set) * 0.8):, :]
    y_validation = y_set[math.floor(len(y_set) * 0.8):, :]
    return x_train_set, y_train_set, x_validation, y_validation

梯度下降

我們首先假設線性回歸的函數為：

$H(x)=w_{0}+w_{1}x_{1}+w_{2}x_{2}+···+w_{n}x_{n}$

損失函數使用均方誤差，李宏毅老師的示例作業里使用了均方根誤差，但是代碼中算梯度的式子我沒看懂，還請懂的人可以告訴我下，所以這里我就用了均方誤差。

$L(w)=\frac{1}{2m}\sum_{i=1}^{m}(H(x^{(i)})-y^{(i)})^{2}$

現在對參數$w_{j}$求偏導，即

$\frac{\partial L(w)}{w_{j}}=\frac{1}{m}\sum_{i=1}^{m}(H(x^{(i)})-y^{(i)})x_{j}^{(i)}$

這式子是可以轉換成矩陣運算的，具體的請看下面的代碼。

在得到梯度之后，需要不斷更新$w$參數的值，在這里使用李宏毅老師所講的Adagrad方法，公式為：

$w^{t+1} = w^{t}-\frac{\eta^{t}}{\sigma^{t}}g^{t}$

其中$\eta^{t}=\frac{\eta}{\sqrt{t+1}}$，這可以使得我們一開始以盡量快得到速度靠近目標，然后逐漸減小學習率，$g^{t}=\frac{\partial L(w)}{w}$，$\sigma^{t}=\sqrt{\frac{1}{t+1}\sum_{i=0}^{t}(g^{i})^2}$。

經過化簡計算，即為：

$w^{t+1} = w^{t}-\frac{\eta}{\sqrt{\sum_{i=0}^{t}(g^{i})^2}}g^{t}$

def training(x_train_set, y_train_set):
    dim = 18 * 9 + 1      # w參數的維度，+1是可以把b也當成一個w
    w = np.ones([dim, 1])
    b = np.ones([len(x_train_set), 1])
    x = np.concatenate((b, x_train_set), axis=1).astype(float)  # 將b初始化為1，加載樣本屬性值的最前面

    learning_rate = 100      # 學習率
    iter_time = 20000        # 迭代次數
    adagrad = np.zeros([dim, 1])
    eps = 0.0000000001       # 新的學習率是learning_rate/sqrt(sum_of_pre_grads**2),而adagrad=sum_of_grads**2,所以處在分母上而迭代時adagrad可能為0，所以加上一個極小數，使其不除0

    for i in range(iter_time):
        loss = np.sum(np.power(np.dot(x, w) - y_train_set, 2)) / len(x_train_set)/2    # 均方誤差
        if i % 100 == 0:                                                               # 每迭代100次輸出loss值
            print(str(i)+':'+str(loss))
        gradient = np.dot(x.T, np.dot(x, w) - y_train_set)/len(x_train_set)            # 計算梯度
        adagrad += gradient ** 2                                                       # 累加adagrad值
        w = w - learning_rate * gradient / np.sqrt(adagrad+eps)                        # 更新參數
    return w

驗證預測

將訓練得到的$w$參數在之前拆分得到的驗證集上進行計算。

def validation(w, x_validation, y_validation):
    b = np.ones([len(x_validation), 1])
    x = np.concatenate((b, x_validation), axis=1).astype(float)
    loss = np.sum(np.power(np.dot(x, w) - y_validation, 2)) / len(x_validation)/2
    print('驗證的loss為' + str(loss))

最后進行預測，讀入test.csv文件，也和訓練數據一樣，先處理一下數據，然后直接np.dot(x,w)即可。

def predict(w, x_mean, x_std):
    df = pd.read_csv(test_file_path, header=None)
    df = df.iloc[:, 2:]
    df.replace('NR', 0, inplace=True)
    data = df.to_numpy()
    pre_data = np.empty((240, 18*9))
    for i in range(240):
        pre_data[i, :] = data[18*i:18*(i+1), :].reshape(1, -1)
    for row in range(len(pre_data)):             # 需要標准化，而且均值和標准差需要使用之前的
        for col in range(len(pre_data[0])):
            if x_std[col] != 0:
                pre_data[row][col] = (pre_data[row][col] - x_mean[col]) / x_std[col]
    b = np.ones([len(pre_data), 1])
    pre_data = np.concatenate((b, pre_data), axis=1).astype(float)
    result = np.dot(pre_data, w)

    file = open('result.csv', 'w')
    for i in range(240):
        file.write('id_' + str(i) + ',' + str(result[i][0]))
        file.write('\n')
    file.close()

預測的結果為：

完整代碼

import numpy as np
import pandas as pd
import math

CSV_file_path = './train.csv'
test_file_path = './test.csv'


def read_csv():
    df = pd.read_csv(CSV_file_path)
    df = df.iloc[:, 3:]
    df.replace('NR', 0, inplace=True)      # 將降雨量中的NR替換為0
    return df


def get_train_data(df):
    data = df.to_numpy()
    month_data = {}

    for month in range(12):
        sample = np.empty([18, 480])       # 一共18個觀測指標，24*20=480
        for day in range(20):              # 將每個月的20天數據連接起來
            sample[:, day*24:(day+1)*24] = data[month*20*18+day*18:month*20*18+(day+1)*18, :]
        month_data[month] = sample

    x_set = np.empty((12*471, 18*9))       # 每10小時可取出一組數據，共471組，一共12個月，所以總的數據量為12*471，屬性值一共有18*9
    y_set = np.empty((12*471, 1))

    for month in range(12):
        for day in range(20):
            for hour in range(24):
                if day == 19 and hour > 14:
                    continue
                x_set[month * 471 + day * 24 + hour, :] = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1, -1)   # 將數據重組成一行
                y_set[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9]   # 只取第9行的PM2.5觀測值
    return x_set, y_set


def normalization(x_set):
    x_mean = np.mean(x_set, axis=0)
    x_std = np.std(x_set, axis=0)
    for row in range(len(x_set)):
        for col in range(len(x_set[0])):
            if x_std[col] != 0:               # 標准差為0表示數據基本無波動
                x_set[row][col] = (x_set[row][col] - x_mean[col])/x_std[col]
    return x_set, x_mean, x_std


def split_data(x_set, y_set):
    x_train_set = x_set[: math.floor(len(x_set) * 0.8), :]
    y_train_set = y_set[: math.floor(len(y_set) * 0.8), :]
    x_validation = x_set[math.floor(len(x_set) * 0.8):, :]
    y_validation = y_set[math.floor(len(y_set) * 0.8):, :]
    return x_train_set, y_train_set, x_validation, y_validation


def training(x_train_set, y_train_set):
    dim = 18 * 9 + 1      # w參數的維度，+1是可以把b也當成一個w
    w = np.ones([dim, 1])
    b = np.ones([len(x_train_set), 1])
    x = np.concatenate((b, x_train_set), axis=1).astype(float)  # 將b初始化為1，加載樣本屬性值的最前面

    learning_rate = 100      # 學習率
    iter_time = 20000        # 迭代次數
    adagrad = np.zeros([dim, 1])
    eps = 0.0000000001       # 新的學習率是learning_rate/sqrt(sum_of_pre_grads**2),而adagrad=sum_of_grads**2,所以處在分母上而迭代時adagrad可能為0，所以加上一個極小數，使其不除0

    for i in range(iter_time):
        loss = np.sum(np.power(np.dot(x, w) - y_train_set, 2)) / len(x_train_set)/2    # 均方誤差
        if i % 100 == 0:                                                               # 每迭代100次輸出loss值
            print(str(i)+':'+str(loss))
        gradient = np.dot(x.T, np.dot(x, w) - y_train_set)/len(x_train_set)            # 計算梯度
        adagrad += gradient ** 2                                                       # 累加adagrad值
        w = w - learning_rate * gradient / np.sqrt(adagrad+eps)                        # 更新參數
    return w


def validation(w, x_validation, y_validation):
    b = np.ones([len(x_validation), 1])
    x = np.concatenate((b, x_validation), axis=1).astype(float)
    loss = np.sum(np.power(np.dot(x, w) - y_validation, 2)) / len(x_validation)/2
    print('驗證的loss為' + str(loss))


def predict(w, x_mean, x_std):
    df = pd.read_csv(test_file_path, header=None)
    df = df.iloc[:, 2:]
    df.replace('NR', 0, inplace=True)
    data = df.to_numpy()
    pre_data = np.empty((240, 18*9))
    for i in range(240):
        pre_data[i, :] = data[18*i:18*(i+1), :].reshape(1, -1)
    for row in range(len(pre_data)):             # 需要標准化，而且均值和標准差需要使用之前的
        for col in range(len(pre_data[0])):
            if x_std[col] != 0:
                pre_data[row][col] = (pre_data[row][col] - x_mean[col]) / x_std[col]
    b = np.ones([len(pre_data), 1])
    pre_data = np.concatenate((b, pre_data), axis=1).astype(float)
    result = np.dot(pre_data, w)

    file = open('result.csv', 'w')
    for i in range(240):
        file.write('id_' + str(i) + ',' + str(result[i][0]))
        file.write('\n')
    file.close()


if __name__ == '__main__':
    df = read_csv()
    x_set, y_set = get_train_data(df)
    x_set, x_mean, x_std = normalization(x_set)
    x_train_set, y_train_set, x_validation, y_validation = split_data(x_set, y_set)
    w = training(x_train_set, y_train_set)
    validation(w, x_validation, y_validation)
    predict(w, x_mean, x_std)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Logistic回歸預測收入----台大李宏毅機器學習作業2(HW2) 【筆記】機器學習 - 李宏毅 -- ELMO、BERT、GPT 【筆記】機器學習 - 李宏毅 -- Transformer 【345】機器學習入門 - 李宏毅機器學習筆記李宏毅機器學習筆記5：CNN卷積神經網絡李宏毅機器學習2020筆記（五）RNN、LSTM 台大李宏毅機器學習公開課2020版登陸B站李宏毅機器學習2020筆記（一）Introduction&Regression 李宏毅2021春機器學習課程筆記——Introduction of Machine/Deep Learning