Logistic回歸之基於最優化方法的最佳回歸系數確定


 

之前學習Java的時候,用過一個IDE叫做EditPlus,雖然他敲代碼的高亮等體驗度不及eclipse,但是打開軟件特別快捷,現在也用他讀python特別方便。

 

訓練算法::使用梯度上升找到最佳參數

之前看過吳恩達的視頻的同學們,聽得比較多的就是梯度下降算法,但是梯度上升算法和梯度下降算法本質是是一樣的,只是梯度計算的時候加減號不一樣罷了。

 

 1 def loadDataSet():
 2     dataMat = []; labelMat = []
 3     fr = open('testSet.txt')
 4     for line in fr.readlines():
 5         lineArr = line.strip().split()
 6         dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
 7         labelMat.append(int(lineArr[2]))
 8     return dataMat,labelMat
 9 
10 def sigmoid(inX):
11     return 1.0/(1+exp(-inX))
12 
13 def gradAscent(dataMatIn, classLabels):
14     dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
15     labelMat = mat(classLabels).transpose() #convert to NumPy matrix
16     m,n = shape(dataMatrix)
17     alpha = 0.001
18     maxCycles = 500
19     weights = ones((n,1))
20     for k in range(maxCycles):              #heavy on matrix operations
21         h = sigmoid(dataMatrix*weights)     #matrix mult
22         error = (labelMat - h)              #vector subtraction
23         weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
24     return weights

第一個函數打開testSet。txt並逐行讀取,每行前兩個值分別是x1和x2,第三個值是對應的類別標簽。為了方便計算,該函數還將x0的值設為1.0

第二個函數是sigmoid函數,x為0時,函數值為0.5,x增大時,函數值將不斷增大逼近1。

第三個函數有兩個參數,第一個是2維數組,每列代表不同的特征,每行代表每個訓練樣本。我們采用100個樣本的簡單數據集它包含兩個特征x1,x2,再加上第0維特征x0,所以dataMatln里面存放的是100*3的矩陣。

 

 

分析數據:畫出決策邊界

 1 def plotBestFit(weights):
 2     import matplotlib.pyplot as plt
 3     dataMat,labelMat=loadDataSet()
 4     dataArr = array(dataMat)
 5     n = shape(dataArr)[0] 
 6     xcord1 = []; ycord1 = []
 7     xcord2 = []; ycord2 = []
 8     for i in range(n):
 9         if int(labelMat[i])== 1:
10             xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
11         else:
12             xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
13     fig = plt.figure()
14     ax = fig.add_subplot(111)
15     ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
16     ax.scatter(xcord2, ycord2, s=30, c='green')
17     x = arange(-3.0, 3.0, 0.1)
18     y = (-weights[0]-weights[1]*x)/weights[2]
19     ax.plot(x, y)
20     plt.xlabel('X1'); plt.ylabel('X2');
21     plt.show()

 

>>> from numpy import *
>>> reload(logRegres)
<module 'logRegres' from 'D:\Python27\logRegres.pyc'>
>>> weights=logRegres.gradAscent(dataArr,labelMat)
>>> logRegres.plotBestFit(weights.getA())

 

訓練算法:隨機梯度上升

梯度上升算法在每次更新回歸系數時都需要遍歷整個數據集。改進的方法是一次僅使用一個樣本點來更新回歸系數,該方法稱為隨機梯度上升算法。由於可以在樣本到來時對分類器進行增量式更新,因而隨機梯度上升算法是一個在線學習算法。與在線學習相對應,一次處理所有數據被稱作是批處理。

1 def stocGradAscent0(dataMatrix, classLabels):
2     m,n = shape(dataMatrix)
3     alpha = 0.01
4     weights = ones(n)   #initialize to all ones
5     for i in range(m):
6         h = sigmoid(sum(dataMatrix[i]*weights))
7         error = classLabels[i] - h
8         weights = weights + alpha * error * dataMatrix[i]
9     return weights

 

>>> from numpy import *
>>> reload(logRegres)
<module 'logRegres' from 'D:\Python27\logRegres.pyc'>
>>> dataArr,labelMat=logRegres.loadDataSet()
>>> weights=logRegres.stocGradAscent0(array(dataArr),labelMat)
>>> logRegres.plotBestFit(weights)

 

 

改進的隨機梯度上升算法

 1 def stocGradAscent1(dataMatrix, classLabels, numIter=150):
 2     m,n = shape(dataMatrix)
 3     weights = ones(n)   #initialize to all ones
 4     for j in range(numIter):
 5         dataIndex = range(m)
 6         for i in range(m):
 7             alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not 
 8             randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
 9             h = sigmoid(sum(dataMatrix[randIndex]*weights))
10             error = classLabels[randIndex] - h
11             weights = weights + alpha * error * dataMatrix[randIndex]
12             del(dataIndex[randIndex])
13     return weights

 增加了亮出代碼來進行改進。一方面,alpha在每次迭代的時候都會調整,雖然alpha會隨着迭代次數不斷減小,但永遠不會減小到0,因為存在一個常數項。

另一方面,通過隨機選取樣本來更新回歸系數。

>>> dataArr,labelMat=logRegres.loadDataSet()
>>> weights=logRegres.stocGradAscent1(array(dataArr),labelMat)
>>> logRegres.plotBestFit(weights)

 

 

 從疝氣病症預測病馬的死亡率

 1 def classifyVector(inX, weights):
 2     prob = sigmoid(sum(inX*weights))
 3     if prob > 0.5: return 1.0
 4     else: return 0.0
 5 
 6 def colicTest():
 7     frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt')
 8     trainingSet = []; trainingLabels = []
 9     for line in frTrain.readlines():
10         currLine = line.strip().split('\t')
11         lineArr =[]
12         for i in range(21):
13             lineArr.append(float(currLine[i]))
14         trainingSet.append(lineArr)
15         trainingLabels.append(float(currLine[21]))
16     trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)
17     errorCount = 0; numTestVec = 0.0
18     for line in frTest.readlines():
19         numTestVec += 1.0
20         currLine = line.strip().split('\t')
21         lineArr =[]
22         for i in range(21):
23             lineArr.append(float(currLine[i]))
24         if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]):
25             errorCount += 1
26     errorRate = (float(errorCount)/numTestVec)
27     print "the error rate of this test is: %f" % errorRate
28     return errorRate
29 
30 def multiTest():
31     numTests = 10; errorSum=0.0
32     for k in range(numTests):
33         errorSum += colicTest()
34     print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))

第一個函數,如果sigmoid值大於0.5函數返回1,否則返回0.

第二個函數,用於打開測試集和訓練集,並對數據進行格式化處理的函數。

第三個函數,調用第二個函數10次並求結果的平均值。



                   .-' _..`.                  /  .'_.'.'                 | .' (.)`.                 ;'   ,_   `. .--.__________.'    ;  `.;-'|  ./               /|  |               / `..'`-._  _____, ..'     / | |     | |\ \    / /| |     | | \ \   / / | |     | |  \ \  /_/  |_|     |_|   \_\ |__\  |__\    |__\  |__\


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM