一切從邏輯回歸開始

本文轉載自查看原文 2016-04-27 21:45 2253 交叉熵/ 機器學習/ 邏輯回歸

JSong @2016.06.13

本系列文章不適合入門，是作者綜合各方資源和個人理解而得. 另外最好有數學基礎, 因為數學人一言不合就會上公式.

簡單模型的魅力在於它能從各個角度去欣賞. 邏輯回歸是最簡單的二分類模型之一，實際應用中二分類最常見，如判定是否是垃圾郵件，是否是人臉，是否值得借貸等, 而概率模型對於這類問題有得天獨厚的優勢. 本文將從各個角度來理解邏輯回歸，並指出它是一個概率模型、對數線性模型、交叉熵模型.

我們先從線性回歸談起。考察 m 個變量和 y 之間的線性關系：

\[y\sim x_1+x_2+\cdots +x_m \]

根據要求我們需要找到 m+1 個回歸系數 θ_i，使得

\[\min ||y-\hat{y}||_{F},\quad \hat{y}=h_{\theta}(x)=\sum_{i=0}^m \theta_i x_i=\theta^T x \]

其中 x_0=1.

1. 線性代數下的回歸模型

如果每個變量都有 n 個樣本，即 x_i ∈ R^n ，則上述問題等價於求解線性方程組：

\[X\theta =y, \quad X=(x_0,x_1,\ldots,x_m) \]

一般來講，上述方程有唯一解(最小二乘法)：

\[\theta=X^{+}y=(X^TX)^{-1}X^Ty \]

講到這，車就可以開始上路了。

2. 二分類問題簡介

分類模型比較常見，例如是否是垃圾郵件等等。而最簡單的分類就是二分類，且多分類一定程度也可以轉化成二分類問題。給定一組有標簽的訓練數據

\[(X,y)=\{(x^{i}\in \mathbb{R}^n, y^{i}\in \{0,1\}): i=1,2,\ldots,m\} \]

在下圖中我們隨機給了一組散點圖，紅色代表y=1，藍色代表y=0.

我們的目的就是找到一種模型將這兩類樣本點分開(當然實際應用中可能沒有這么好的情況。這里只是給一個Demo)。最直接的想法就是找一個超平面

\[\theta^T x=0 \]

使得紅色點和藍色點分別分布於超平面的兩端. 此時對於任意一個樣本點 x^{(i)}, 不失一般性

\[\theta^T x^{(i)} \]

可以代表樣本點到該超平面的距離。因為紅色樣本點在超平面上方，即紅點對應距離為正數，同理藍點對應的距離為負數。為避免區分紅點或者藍點，我們可以用

\[(2y-1)\theta^T x \]

來代替，此時只要被正確分類，該距離則為正數，否則為負數。於是可得相應的損失函數為

\[\max(0,(2y-1)\theta^T x) \]

綜合一下可得最簡單的線性分類模型為：

\[\arg_\theta \min \sum_{i=1}^{m} \max(0,(2y_i-1)\theta^T x^{(i)}) \]

這個損失函數有很多不好的地方，而且也不可導。接下來我們介紹更好的邏輯回歸模型，其背后有很多的解釋。

2.1 邏輯回歸模型

令

\[\theta^T x=\theta_0 +\sum_{i=1}^{m} \theta_i x_i \]

且我們假定

\[P(y=1|x; \theta)=h_{\theta}(x)=g(\theta^T x)=\frac{1}{1+e^{-\theta^T x}} \]

\[P(y=0|x; \theta)=1-h_{\theta}(x) \]

其中

\[g(z)=\frac{1}{1+e^{-z}} \]

是 logistic 分布函數，它的圖像和密度函數

\[e^{-z}/(1+e^{-z})^2 \]

圖像見下圖. 原則上任何連續的分布函數都可以，其把實數映射到0到1之間. 至於為啥選擇logistic函數, 我們之后再討論.

這里我們采用極大似然估計. 由於y取值的特殊性, 上面的模型假設等價於

\[P(y|x;\theta)=(h_{\theta}(x))^y \,(1-h_{\theta}(x))^{1-y} \]

這樣我們便有似然函數：

\[l(\theta)=\log L(\theta)=\log \prod_{i=1}^{m} p(y^{(i)}|x^{(i)};\theta) \]

\[=\sum_{i=1}^m [y^{(i)} \log h_{\theta}(x^{(i)})+ (1-y^{(i)}) \log (1-h_{\theta}(x^{(i)})] \]

於是得到邏輯回歸模型為

\[\arg_{\theta} \max l(\theta) \]

2.2 幾何意義（超平面分類）

重新考慮模型假設, 我們可以得到

\[\theta^T x=\log \frac{p}{1-p}, \quad p=P(y=1|x) \]

同樣，邏輯回歸模型也是用超平面來分類的. 且相應的極大似然估計可以寫成

\[\min -l(\theta)= \sum_{i=1}^{m} \log[(2y-1)(y-h_{\theta}(x))]=\sum_{i=1}^{m} \log (1+e^{(2y-1)\theta^T x}) \]

也即相應的損失函數為

\[\log(1+\exp((2y-1)\theta^T x)) \]

圖像見下方. 由統計假設可知，給定一個樣本點，如果它被正確分類，則

\[(2y-1)(y-h_{\theta}(x)) \]

應該越小越好, 如y=1時,h_θ(x)越接近1越好. 相應的 (2y−1)θ^Tx 也是越小越好. 又根據第一節的討論我們有, 當某樣本點被正確分類時

\[(2y-1)\theta^T x= \mbox{distance of sample point and hyperplane}\,\, \theta^T x=0 \]

此處

\[\log(1+\exp(x)) \quad and \quad \max(0,x) \]

的作用類似，但連續性更好. 另外有些文章會寫邏輯回歸的損失函數為

\[\log(1+\exp(-y\cdot \theta^T x)) \]

這是因為該損失函數中的 y 取值范圍為-1和1.

2.3 線性代數意義

在上一節，我們給出了表達式

\[\theta^T x=\log \frac{p}{1-p}, \quad p=P(y=1|x) \]

可以看出它相當於把 y 映射到了[0,1]區間。而且 p/(1-p) 是事件發生與事件不發生的概率之比，稱為事件的發生比 (the odds of experiencing an event), 簡稱為 odds 。

2.4 信息論解釋(交叉熵模型)

令隨機變量

\[p\in \{y,1-y\}, \quad q \in\{\hat{y},1-\hat{y}\} \]

其中 \hat{y} 為模型擬合得到的y, 則它們之間的交叉熵 (cross entropy) 為

\[H(p,q)= -\sum_i p_i \log q_i =-[y\log \hat{y}+(1-y)\log(1-\hat{y})] \]

我們知道熵常用於度量一個隨機變量所包含的信息量，而交叉熵可以用來度量兩個隨機變量之間的相似性, 從上面式子可以看出邏輯回歸的極大似然等價於最小化交叉熵.

事實上真正等價的是交叉熵與極大似然估計，有興趣的同學可以自己證明。

2.5 神經網絡模型

這個就不多說了，邏輯回歸是一個最簡單的神經網絡模型

2.6 梯度下降法參數求解

按照極大似然的來求導：

\[\bigtriangledown_{\theta} l(\theta)=\sum_{i=1}^{m} (y^{(i)}-h_{\theta}(x^{(i)}))x^{(i)}= X^T[Y-h_{\theta}(X)] \]

其中 x^{(i)} 在 X 中作為行向量存儲. 於是可得參數 θ 的迭代式：

\[\theta:=\theta+\alpha X^T[Y-h_{\theta}(X)] \]

其中 α 為步長, 另外因為是最大化 l(θ) ，所以是沿着梯度的正方向尋找。

1.7 python代碼

import numpy as np
import matplotlib.pyplot as plt
import time
%pylab inline

# calculate the sigmoid function
def sigmoid(inX):
    return 1.0 / (1 + np.exp(-inX))



def trainLogRegres(train_x, train_y, opts):
    # train a logistic regression model using some optional optimize algorithm
    # train_x is a mat datatype, each row stands for one sample
    # train_y is mat datatype too, each row is the corresponding label
    # opts is optimize option include step and maximum number of iterations

    # calculate training time
    startTime = time.time()
    train_x=np.asmatrix(train_x)
    train_y=np.asmatrix(train_y)
    numSamples, numFeatures =train_x.shape
    alpha = opts['alpha']; maxIter = opts['maxIter']
    weights = np.ones((numFeatures, 1))
    for k in range(maxIter):
        err = train_y - sigmoid(train_x * weights)
        weights = weights + alpha * train_x.T * err
    print 'Congratulations, training complete! Took %fs!' % (time.time() - startTime)
    return weights


# show your trained logistic regression model only available with 2-D data
def showLogRegres(weights, train_x, train_y):
    # notice: train_x and train_y is mat datatype
    numSamples, numFeatures = train_x.shape
    if numFeatures != 3:
        print "Sorry! I can not draw because the dimension of your data is not 2!"
        return 1

    # draw all samples   
    idx1=np.asarray(train_y)==1
    idx2=np.asarray(train_y)==0
    plt.plot(x[:,1][idx1].T,x[:,2][idx1].T,'or')
    plt.plot(x[:,1][idx2].T,x[:,2][idx2].T,'ob')
    # draw the classify line
    min_x = min(train_x[:, 1])[0, 0]
    max_x = max(train_x[:, 1])[0, 0]
    weights = weights.getA()  # convert mat to array
    y_min_x = float(-weights[0] - weights[1] * min_x) / weights[2]
    y_max_x = float(-weights[0] - weights[1] * max_x) / weights[2]
    plt.plot([min_x, max_x], [y_min_x, y_max_x], '-g')
    plt.xlabel('X1'); plt.ylabel('X2')

# 測試樣例
len_samples=500
x=np.random.randn(len_samples,2)
idx1=x[:,0]+x[:,1]>0.5
y=np.zeros((len_samples,1))
y[idx1,0]=1
opt={'alpha':0.1,'maxIter':1000}
x=np.hstack((np.ones((len_samples,1)),x))
x=np.asmatrix(x)
y=np.asmatrix(y)
w=trainLogRegres(x,y,opt)
print w
showLogRegres(w,x,y)

輸出如下：

Populating the interactive namespace from numpy and matplotlib
Congratulations, training complete! Took 0.047000s!
[[-17.37216188]
 [ 34.35470115]
 [ 34.98381818]]

這只是最簡單的樣例，在實際應用中我們還可以添加懲罰項

\[\min_{w,c} \|w\|_2+ C\sum_{i=1}^{m} \log(1+\exp(-y^{(i)}(w\cdot x^{(i)}+c))) \]

或者 L_1 范數

\[\min_{w,c} \|w\|_1+ C\sum_{i=1}^{m} \log(1+\exp(-y^{(i)}(w\cdot x^{(i)}+c))) \]

在樣本很稀疏的時候，懲罰項很有用. 另外當樣本量很大，我們又極其要求速度的時候，梯度下降法也要改進，換成SGD（stochastic gradient descent ）、擬牛頓法、AGD等等

專注寫文二十年，原文首發於公眾號，歡迎關注

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 邏輯回歸邏輯回歸邏輯回歸局部加權回歸與邏輯回歸 sklearn中的邏輯回歸邏輯回歸算法邏輯回歸原理小結邏輯回歸（分類算法）邏輯回歸原理推導邏輯回歸代碼demo