首先要清楚,邏輯回歸是一種分類算法。它是在線性回歸模型的基礎上,使用Sigmoid函數,將線性模型的預測結果轉變為離散變量,從而用於處理分類問題。
1 邏輯回歸原理
以二分類為例,說明邏輯回歸的工作原理。由線性回歸小結基礎,不難得出線性回歸的假設函數\(h_{\theta }^{'}\left ( x \right )\),在邏輯回歸中,使用Sigmoid函數使得\(h_{\theta }^{'}\left ( x \right )\)的值在[0,1]區間內。
一般Sigmoid函數表示為:
繪圖展示Sigmoid函數:
#Sigmoid function
from scipy.special import expit
def sigmoid(z):
return 1/(1 +np.exp(-z))
def h(theta, X):
return expit(np.dot(X, theta))
#check sigmoid function
myx = np.arange(-10, 10, 0.01)
plt.plot(myx, expit(myx))
plt.title('Sigmoid Function')
plt.show()
可以看到,Sigmoid函數當\(z\)趨於正無窮時,\(g\left ( z \right )\)趨於1,當\(z\)趨於負無窮時,\(g\left ( z \right )\)趨於0。
對於邏輯回歸,令\(z=h_{\theta }^{'}\left ( x \right )\),則有\(g\left ( z \right )=g\left ( h_{\theta }^{'}\left ( x \right ) \right )=\frac{1}{1+e^{-h_{\theta }^{'}\left ( x \right )}}\),其中,$h_{\theta }^{'}\left ( x \right )=x\theta $。因此,邏輯回歸的模型為:
結合Sigmoid函數,理解邏輯回歸模型如何實現分類問題的。此時,\(h_{\theta }\left ( x \right )\)為模型輸出,把它當作某一分類的概率值,賦予其概率含義。當\(h_{\theta }\left ( x \right )>0.5\)時,\(x\theta >0\),則\(y\)為1,當\(h_{\theta }\left ( x \right )<0.5\)時,\(x\theta <0\),則\(y\)為0,如此實現了分類問題。
邏輯回歸模型的矩陣形式為:
因此,\(y\)取1的概率等於\(h_{\theta }\left ( x \right )\),取0的概率為\(1-h_{\theta }\left ( x \right )\),即:
似然函數:
對數似然函數:
損失函數:
損失函數的矩陣形式為:
邏輯回歸的優化目標:
2 邏輯回歸算法
由於\(\frac{\partial J(\theta )}{\partial \theta }=X^{T}(h_{\theta }(X)-Y)\)
梯度下降法中$\theta $的迭代公式為:
3 邏輯回歸代碼實現
3.1 訓練算法:使用梯度上升找到最佳參數
#Logistic 回歸梯度上升優化算法
from numpy import *
def loadDataSet():
dataMat = []; labelMat = []
fr = open('testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
labelMat.append(int(lineArr[2]))
return dataMat,labelMat
def sigmoid(inX):
return 1.0/(1+exp(-inX))
def gradAscent(dataMatIn, classLabels):
dataMatrix = mat(dataMatIn)
labelMat = mat(classLabels).transpose()
m,n = shape(dataMatrix)
alpha = 0.001
maxCycles = 500
weights = ones((n,1))
for k in range(maxCycles):
h = sigmoid(dataMatrix*weights)
error = (labelMat - h)
weights = weights + alpha*dataMatrix.transpose()*error
return weights
dataArr, labelMat = loadDataSet()
print(gradAscent(dataArr, labelMat))
#輸出
[[ 4.12414349]
[ 0.48007329]
[-0.6168482 ]]
4.2 分析數據:畫出決策邊界
#畫出數據集和Logistic回歸最佳擬合直線的函數
#繪制圖像
import matplotlib.pyplot as plt
def plotBestFit(wei):
weights = wei.getA()
dataMat,labelMat=loadDataSet()
dataArr=array(dataMat)
n=shape(dataMat)[0] #樣本數目
xcord1=[]; ycord1=[]
xcord2=[]; ycord2=[]
for i in range(n):
if int(labelMat[i])==1:
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(xcord1,ycord1,s=20,c='r',marker='s')
ax.scatter(xcord2,ycord2,s=20,c='g')
x = arange(-3.0, 3.0, 0.1) # 直線x坐標的取值范圍
y = (-weights[0] - weights[1] * x) / weights[2] # 直線方程
plt.title('DataSet')
ax.plot(x, y)
plt.xlabel('X1');
plt.ylabel('X2');
plt.show()
dataArr, labelMat = loadDataSet()
weights = gradAscent(dataArr, labelMat)
plotBestFit(weights)
梯度上升算法在500次迭代后得到的Logistic回歸最佳擬合直線:
4.3 訓練算法:隨機梯度上升
梯度上升算法在每次更新回歸系數時都需要遍歷整個數據集,該方法在處理100個左右的數據集時尚可,但如果有數十億樣本和成千上萬的特征,那么該方法的計算復雜度就太高了。一種改進方法是一次僅用一個樣本點來更新回歸系數,該方法稱為隨機梯度上升算法。
#隨機梯度上升算法
def stocGradAscent0(dataMatrix, classLabels):
m,n = shape(dataMatrix)
alpha = 0.01
weights = ones(n)
for i in range(m):
h = sigmoid(sum(dataMatrix[i]*weights))
error = classLabels[i] - h
weights = weights + alpha*error*dataMatrix[i]
return weights
def plotBestFit(wei):
dataMat,labelMat=loadDataSet()
dataArr=array(dataMat)
n=shape(dataMat)[0] #樣本數目
xcord1=[]; ycord1=[]
xcord2=[]; ycord2=[]
for i in range(n):
if int(labelMat[i])==1:
xcord1.append(dataArr[i,1])
ycord1.append(dataArr[i,2])
else:
xcord2.append(dataArr[i,1])
ycord2.append(dataArr[i,2])
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(xcord1,ycord1,s=20,c='r',marker='s')
ax.scatter(xcord2,ycord2,s=20,c='g')
x = arange(-3.0, 3.0, 0.1) # 直線x坐標的取值范圍
y = (-weights[0] - weights[1] * x) / weights[2] # 直線方程
plt.title('DataSet')
ax.plot(x, y)
plt.xlabel('X1');
plt.ylabel('X2');
plt.show()
dataMat, labelMat = loadDataSet()
weights = stocGradAscent0(array(dataMat), labelMat)
plotBestFit(weights)
分類器錯分了三分之一的樣本:
直接比較隨機梯度上升和梯度上升的代碼結果是不公平的,后者的結果是在整個數據集上迭代了500次才得到的。一個判斷優化算法優劣的可靠方法是看它是否收斂,也就是說參數是否達到了穩定值,是否還會不斷地變化?對此,我們在程序清單5-3中隨機梯度上升算法上做了些修改,使其在整個數據集上運行200次。最終繪制的三個回歸系數的變化情況
#調整梯度上升和隨機梯度上升函數
def gradAscent(dataMatIn,classlabels):
dataMatrix=mat(dataMatIn)
labelMat=mat(classlabels).T
m,n=shape(dataMatrix)
alpha=0.01
maxCycles=500
weights=ones((n,1))
weights_array=array([])
for k in range(maxCycles):
h=sigmoid(dataMatrix*weights)
error=labelMat-h
weights=weights+alpha*dataMatrix.T*error
weights_array=append(weights_array,weights) #一行
weights_array=weights_array.reshape(maxCycles,n)
return weights.getA(),weights_array
def stocGradAscent1(dataMatrix,classLabels,numIter=150):
m,n=shape(dataMatrix)
weights=ones(n)
weights_array=array([])
for j in range(numIter):
dataIndex=list(range(m))
for i in range(m):
alpha=4/(1.0+j+i)+0.01
randIndex=int(random.uniform(0,len(dataIndex)))
h=sigmoid(sum(dataMatrix[randIndex]*weights))
error=classLabels[randIndex]-h
weights=weights+alpha*error*dataMatrix[randIndex]
weights_array=append(weights_array,weights,axis=0)
del(dataIndex[randIndex])
weights_array=weights_array.reshape(numIter*m,n)
return weights,weights_array
#繪制回歸系數與迭代次數的關系
plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False
def plotWeights(weights_array1,weights_array2):
#畫布分成三行兩列
fig,axs=plt.subplots(nrows=3,ncols=2,sharex=False,sharey=False,figsize=(20,10))
x1=arange(0,len(weights_array1),1)
#繪制w0與迭代次數的關系
axs[0][0].plot(x1,weights_array1[:,0])
axs0_title_text=axs[0][0].set_title('梯度上升算法:回歸系數與迭代次數關系')
axs0_ylabel_text=axs[0][0].set_ylabel('W0')
plt.setp(axs0_title_text,size=20,weight='bold',color='black')
plt.setp(axs0_ylabel_text,size=20,weight='bold',color='black')
#繪制w1與迭代次數關系
axs[1][0].plot(x1,weights_array1[:,1])
axs1_ylabel_text=axs[1][0].set_ylabel('W1')
plt.setp(axs1_ylabel_text,size=20,weight='bold',color='black')
#繪制w2與迭代次數關系
axs[2][0].plot(x1,weights_array1[:,2])
axs2_xlabel_text=axs[2][0].set_title('迭代次數')
axs2_ylabel_text=axs[2][0].set_ylabel('W2')
plt.setp(axs2_xlabel_text,size=20,weight='bold',color='black')
plt.setp(axs2_ylabel_text,size=20,weight='bold',color='black')
x2=arange(0,len(weights_array2),1)
#繪制w0與迭代次數關系
axs[0][1].plot(x2,weights_array2[:,0])
axs0_title_text=axs[0][1].set_title('隨機梯度上升算法:回歸系數與迭代次數關系')
axs0_ylabel_text=axs[0][1].set_ylabel('W0')
plt.setp(axs0_title_text,size=20,weight='bold',color='black')
plt.setp(axs0_ylabel_text,size=20,weight='bold',color='black')
#繪制w1與迭代次數關系
axs[1][1].plot(x2,weights_array2[:,1])
axs1_ylabel_text=axs[1][1].set_ylabel('W1')
plt.setp(axs1_ylabel_text,size=20,weight='bold',color='black')
#繪制w2與迭代次數關系
axs[2][1].plot(x2,weights_array2[:,2])
axs2_xlabel_text=axs[2][1].set_title('迭代次數')
axs2_ylabel_text=axs[2][1].set_ylabel('W2')
plt.setp(axs2_xlabel_text,size=20,weight='bold',color='black')
plt.setp(axs2_ylabel_text,size=20,weight='bold',color='black')
plt.show()
# 繪圖
dataMat, labelMat = loadDataSet()
weights1, weights_array1 = gradAscent(dataMat, labelMat)
weights2, weights_array2 = stocGradAscent1(array(dataMat), labelMat)
plotWeights(weights_array1, weights_array2)
#改進的隨機梯度上升算法
import random
def stocGradAscent1(dataMatrix,classLabels,numIter=150):
m,n=shape(dataMatrix)
weights=ones(n)
for j in range(numIter):
dataIndex=list(range(m))
for i in range(m):
alpha=4/(1.0+j+i)+0.01 #每次調整alpha值
randIndex=int(random.uniform(0,len(dataIndex))) #隨機選取樣本,可以減少周期性波動
h=sigmoid(sum(dataMatrix[randIndex]*weights)) #梯度上升這里是向量,隨機梯度上升數據類型都是數值
error=classLabels[randIndex]-h
weights=weights+alpha*error*dataMatrix[randIndex]
del(dataIndex[randIndex])
return weights
dataMat, labelMat = loadDataSet()
weights = stocGradAscent1(array(dataMat), labelMat)
plotBestFit(weights)