Softmax:原理及python實現


Table of Contents

SoftMax回歸概述

  與邏輯回歸類似,Softmax回歸也是用於解決分類問題。不同的是,邏輯回歸主要用於解決二分類問題,多分類問題需要通過OvO、MvM、OvR等策略來解決,softmax回歸則是直接用於解決多分類問題。

標簽編碼

  在二分類問題中,可以直接使用{0,1}來標注標簽\(y\),但是在多分類問題中,我們需要尋找其他的表示方法。對於類別,{嬰兒,兒童,青少年,青年人,中年人,老年人} ,很自然的想到使用{1,2,3,4,5,6}來標注標簽,對於這個例子當然是合適的,因為各個類別之間有明顯的順序關系,這也是有意義的。但是對於{鉛筆,鋼筆,簽字筆}這個例子,直接使用帶有順序的數字標簽是不合理的。因此,通常情況下,可以選擇Onehot編碼:

\[y \in \{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}. \]

算法思路

  與邏輯回歸類似,softmax也是基於線性回歸,對於每一個樣本,對每一個類別,預測出一個數值,然后使用softmax函數,將其轉換成“概率”,然后在所有類別中,選擇預測“概率”最大的值作為預測類別。

\[\vec{o_i} = W\vec{x_i} \]

用非矩陣表示就是:
\begin{split}\begin{aligned}
o_1 &= x_1 w_{11} + x_2 w_{12} + x_3 w_{13} + x_4 w_{14} + b_1,\
o_2 &= x_1 w_{21} + x_2 w_{22} + x_3 w_{23} + x_4 w_{24} + b_2,\
o_3 &= x_1 w_{31} + x_2 w_{32} + x_3 w_{33} + x_4 w_{34} + b_3.
\end{aligned}\end{split}

  用神經網絡圖可以更清晰的表示這個過程:

image

  預測出的結果也就是向量\(\vec{o}\),其元素值是在整個實數空間的,因此,使用softmax對其進行變換,使其轉化為可以理解為概率的形式:

\[\hat{y_i} = Softmax(\vec{o_i}) \quad {其中}\quad \hat{y}_j = \frac{\exp(o_j)}{\sum\limits_{a=1}^k \exp(o_a)}\quad j = 1,2,3...k \]

  舉個例子,比如\(\vec{o_i}={(1,2,3)}\),則\(\hat{y_i}=(\frac e{e+e^2+e^3},\frac {e^2}{e+e^2+e^3},\frac {e^3}{e+e^2+e^3})\),可以看出,向量\(\hat{y_i}\)的各元素和為1,因此可以理解為樣本\(i\)被分類為三個類別所對應的概率,我們選擇概率最大的類別作為最終的分類類別。

SoftMax的損失函數及其優化

損失函數

  上一節介紹了Softmax的基本思路,現在要解決的問題就是如何通過計算獲得參數矩陣\(W\)和參數向量\(\vec{b}\)。假設我們已經獲取了樣本容量為\(m\),特征數為\(n\)的樣本矩陣\(X_{m\times n}\),以及對應的標簽矩陣\(Y_{m\times k}\),其中,分類類別數為\(k\)。與邏輯回歸類似,由MLE可得損失函數為

\[L=\prod\limits_{i=1}^mP(\vec{y_i}|\vec{x_i}) \]

對數損失函數為

\[-lnL = \sum\limits_{i=1}^m-lnP(\vec{y_i}|\vec{x_i})=\sum\limits_{i=1}^ml(\vec{y_i},\hat {y}_i) \]

其中

\[l(\vec y_i,\hat y_i)=-\vec y_i ln\hat y_i=-\sum\limits_{j=1}^k y_j^{(i)}ln(\hat y_j^{(i)}) \]

  \(\vec y_i\)是第\(i\)個樣本的\(label\)向量,\(\hat y_j^{(i)}\)是第\(i\)個樣本的預測向量的第\(j\)項。

  可以看出,對數損失函數中的條件概率其實是我們預測出的概率向量在對應的Onehot為1的位置的概率值,可以將其巧妙的表示為\(l(y_i,\hat y_i)\)的表達式。

損失函數的求導

  對\(l(y_i,\hat y_i)\)做如下化簡:

\[\begin{split}l(y_i,\hat y_i)&=-\sum\limits_{j=1}^k y_jln\frac{exp(o_j)}{\sum\limits_{a=1}^k exp(o_a)}\\ &=\sum\limits_{j=1}^k y_jln\frac{\sum\limits_{a=1}^k exp(o_a)}{exp(o_j)}\\ &=\sum\limits_{j=1}^k y_jln{\sum\limits_{a=1}^k exp(o_a)}-{\sum\limits_{j=1}^k y_jo_j}\\ &=ln{\sum\limits_{a=1}^k exp(o_a)}-{y_bo_b}\end{split} \]

  則有

\[\frac {\partial l(y_i,\hat y_i)}{\partial o_j}=\frac{exp(o_j)}{\sum\limits_{a=1}^kexp(o_a)}-y_j=Softmax(o_j)-y_j \]

改寫為向量形式就是

\[\frac {\partial l(y_i,\hat y_i)}{\partial \vec o_i}=Softmax(\vec o_i)-\vec y_i \]

由於

\[dl_i = tr(\frac{\partial l_i}{\partial \vec o_i}^T d\vec o_i)= tr(\frac{\partial l_i}{\partial \vec o_i}^T dW\vec x_i)= tr((\frac{\partial l_i}{\partial \vec o_i}\vec x_i^T)^T dW) \]

\[\frac{\partial l_i}{\partial W}=\frac{\partial l_i}{\partial \vec o_i}\vec x_i^T=[Softmax(W\vec x_i)-\vec y_i]x_i^T \]

\[\frac {\partial (-lnL)}{\partial W}=\sum\limits_{i=1}^m[Softmax(W\vec x_i)-\vec y_i]x_i^T=[Softmax(WX^T)-y^T]X \]

\[W=W-\alpha[Softmax(WX^T)-y^T]X \]

其中

\[X=\begin{bmatrix} x_1^T \\ x_2^T \\...\\x_m^T\end{bmatrix},y=\begin{bmatrix} y_1^T \\ y_2^T\\ ...\\y_m^T\end{bmatrix} \]

\(x_i,y_i均為列向量\)

Softmax實現

圖片數據集

  這里使用李沐老師課程里用到的圖片數據集。

import matplotlib.pyplot as plt
%matplotlib inline
import torch
import torchvision
from torch.utils import data
from torchvision import transforms
import warnings
warnings.filterwarnings('ignore')
# 通過ToTensor實例將圖像數據從PIL類型變換成32位浮點數格式
# 並除以255使得所有像素的數值均在0到1之間
trans = transforms.ToTensor()
mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True,
                                                transform=trans,
                                                download=True)
mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False,
                                               transform=trans, download=True)

  下載完成數據集后,可以看出訓練集中有60000條數據,mnist_train[0]可以獲取第一張圖片的信息,它包含兩項,第一項是圖片的矩陣數據,是一個[1,28,28]的矩陣,第二項是label標簽,可以使用plt.imshow來查看圖片。

len(mnist_train)
60000
len(mnist_train[0])
2
mnist_train[0][0].shape
torch.Size([1, 28, 28])
mnist_train[0][1]
9
plt.imshow(mnist_train[0][0][0])
plt.show()

image

sklearn實現

  sklearn中LogisticsRegression類中,參數multi_class設置為multinomial時,使用的就是softmax回歸。

from sklearn.linear_model import LogisticRegression

# 數據獲取
X_train,y_train = next(iter(data.DataLoader(mnist_train,batch_size=len(mnist_train))))
X_train = X_train.reshape((len(mnist_train),-1))
X_test,y_test = next(iter(data.DataLoader(mnist_test,batch_size=len(mnist_test))))
X_test = X_test.reshape((len(mnist_test),-1))

# 模型訓練
soft_sk = LogisticRegression(multi_class='multinomial').fit(X_train.numpy(),y_train.numpy())

# 評分
soft_sk.score(X_train,y_train),soft_sk.score(X_test,y_test)
(0.8659833333333333, 0.8438)

python從零實現

  使用梯度下降按照第二節中的方法優化參數,由於涉及計算Softmax很容易溢出,因此設置了很小的學習率。但是,性能一直無法優化到80%,希望有大佬指教一下。

import numpy as np
from torch.utils import data
import random
from torchvision import transforms
import torchvision
import pandas as pd

class Softmax:
    def __init__(self, X, y, batch_size=5, epoch=3, alpha=0.00001):
        self.features = np.array(np.insert(X, 0, 1, axis=1))
        self.labels_original = y
        self.labels = pd.get_dummies(self.labels_original).values
        self.batch = batch_size
        self.epoch = epoch
        self.alpha = alpha
        self.n_class = len(y.unique())
        self.n_features = self.features.shape[1]
        self.W = np.random.normal(0, 0.01, (self.n_class, self.n_features))

    def softmax(self, X):
        X = np.array(X)
        X = X - X.max()
        return np.exp(X)/np.sum(np.exp(X), axis=1, keepdims=True)

    def data_iter(self):
        range_list = np.arange(self.features.shape[0])
        random.shuffle(range_list)
        for i in range(0, len(range_list), self.batch):
            batch_indices = range_list[i:min(i + self.batch, len(range_list))]
            yield self.features[batch_indices], self.labels[batch_indices]

    def fit(self):
        for i in range(self.epoch):
            for X, y in self.data_iter():
                self.W -= self.alpha * np.matmul((self.softmax(np.matmul(self.W, X.T))-y.T), X)

    def predict(self, X_pre):
        X_pre = np.array(np.insert(X_pre, 0, 1, axis=1))
        return np.argmax(self.softmax(np.matmul(self.W, X_pre.T)), axis=0)

    def score(self, y_true, y_pre):
        return np.sum(np.ravel(y_true) == np.ravel(y_pre))/len(y_true)


def main():
    trans = transforms.ToTensor()
    mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True,
                                                    transform=trans,
                                                    download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False,
                                                   transform=trans, download=True)
    X_train, y_train = next(iter(data.DataLoader(mnist_train, batch_size=len(mnist_train))))
    X_train = X_train.reshape((len(mnist_train), -1))
    X_test, y_test = next(iter(data.DataLoader(mnist_test, batch_size=len(mnist_test))))
    X_test = X_test.reshape((len(mnist_test), -1))

    soft_max = Softmax(X_train, y_train)
    soft_max.fit()
    y_train_pre = soft_max.predict(X_train)
    y_test_pre = soft_max.predict(X_test)
    print(f"訓練集准確度:{soft_max.score(y_train,y_train_pre)}")
    print(f"測試集准確度:{soft_max.score(y_test, y_test_pre)}")


if __name__ == '__main__':
    main()
訓練集准確度:0.6503333333333333
測試集准確度:0.6419

使用pytorch的實現

import torch
import torchvision
from torch.utils import data
from torch import nn
from torchvision import transforms


class SoftmaxPytorch:
    def __init__(self, X, y, batch_size=256, epoch=5, lr=0.1):
        self.features = torch.tensor(X)
        self.labels = torch.tensor(y).reshape(-1, 1)
        self.batch = batch_size
        self.epoch = epoch
        self.lr = lr
        self.n_features = self.features.shape[1]
        self.n_class = len(self.labels.unique())
        self.loss = nn.CrossEntropyLoss()
        self.net = nn.Sequential(nn.Flatten(), nn.Linear(self.n_features, self.n_class))
        self.trainer = torch.optim.SGD(self.net.parameters(), self.lr)

    def data_iter(self):
        dataset = data.TensorDataset(self.features, self.labels)
        return data.DataLoader(dataset, self.batch, shuffle=True)

    def init_weights(self, model):
        if type(model) == nn.Linear:
            nn.init.normal_(model.weight, std=0.01)

    def fit(self):
        self.net.apply(self.init_weights)
        for i in range(self.epoch):
            for X, y in self.data_iter():
                y_hat = self.net(X)
                l = self.loss(y_hat, y.ravel())
                self.trainer.zero_grad()
                l.sum().backward()
                self.trainer.step()
            print(f'epoch:{i},loss:{self.loss(self.net(self.features), self.labels.ravel())}')

    def predict(self, X_pre):
        y_hat = self.net(X_pre)
        y_pre = torch.argmax(y_hat, axis=1)
        return y_pre

    def score(self, y_hat, y_true):
        return sum(y_hat.type(y_true.dtype).ravel() == y_true.ravel())/len(y_true)


def main():
    trans = transforms.ToTensor()
    mnist_train = torchvision.datasets.FashionMNIST(root="./data", train=True,
                                                    transform=trans,
                                                    download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="./data", train=False,
                                                   transform=trans, download=True)
    X_train, y_train = next(iter(data.DataLoader(mnist_train, batch_size=len(mnist_train))))
    X_train = X_train.reshape((len(mnist_train), -1))
    X_test, y_test = next(iter(data.DataLoader(mnist_test, batch_size=len(mnist_test))))
    X_test = X_test.reshape((len(mnist_test), -1))

    sf = SoftmaxPytorch(X_train, y_train)
    sf.fit()
    y_train_pre = sf.predict(X_train)
    train_score = sf.score(y_train_pre, y_train)
    
    y_test_pre = sf.predict(X_test)
    test_score = sf.score(y_test_pre, y_test)
    print(f'訓練集准確率:{train_score}')
    print(f'測試集准確率:{test_score}')


if __name__ == '__main__':
    main()
epoch:0,loss:0.6310750842094421
epoch:1,loss:0.5460468530654907
epoch:2,loss:0.5175894498825073
epoch:3,loss:0.49569806456565857
epoch:4,loss:0.473165899515152
訓練集准確率:0.84211665391922
測試集准確率:0.8271999955177307


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM