之前學習Java的時候,用過一個IDE叫做EditPlus,雖然他敲代碼的高亮等體驗度不及eclipse,但是打開軟件特別快捷,現在也用他讀python特別方便。
訓練算法::使用梯度上升找到最佳參數
之前看過吳恩達的視頻的同學們,聽得比較多的就是梯度下降算法,但是梯度上升算法和梯度下降算法本質是是一樣的,只是梯度計算的時候加減號不一樣罷了。
1 def loadDataSet(): 2 dataMat = []; labelMat = [] 3 fr = open('testSet.txt') 4 for line in fr.readlines(): 5 lineArr = line.strip().split() 6 dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])]) 7 labelMat.append(int(lineArr[2])) 8 return dataMat,labelMat 9 10 def sigmoid(inX): 11 return 1.0/(1+exp(-inX)) 12 13 def gradAscent(dataMatIn, classLabels): 14 dataMatrix = mat(dataMatIn) #convert to NumPy matrix 15 labelMat = mat(classLabels).transpose() #convert to NumPy matrix 16 m,n = shape(dataMatrix) 17 alpha = 0.001 18 maxCycles = 500 19 weights = ones((n,1)) 20 for k in range(maxCycles): #heavy on matrix operations 21 h = sigmoid(dataMatrix*weights) #matrix mult 22 error = (labelMat - h) #vector subtraction 23 weights = weights + alpha * dataMatrix.transpose()* error #matrix mult 24 return weights
第一個函數打開testSet。txt並逐行讀取,每行前兩個值分別是x1和x2,第三個值是對應的類別標簽。為了方便計算,該函數還將x0的值設為1.0
第二個函數是sigmoid函數,x為0時,函數值為0.5,x增大時,函數值將不斷增大逼近1。
第三個函數有兩個參數,第一個是2維數組,每列代表不同的特征,每行代表每個訓練樣本。我們采用100個樣本的簡單數據集它包含兩個特征x1,x2,再加上第0維特征x0,所以dataMatln里面存放的是100*3的矩陣。
分析數據:畫出決策邊界
1 def plotBestFit(weights): 2 import matplotlib.pyplot as plt 3 dataMat,labelMat=loadDataSet() 4 dataArr = array(dataMat) 5 n = shape(dataArr)[0] 6 xcord1 = []; ycord1 = [] 7 xcord2 = []; ycord2 = [] 8 for i in range(n): 9 if int(labelMat[i])== 1: 10 xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) 11 else: 12 xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) 13 fig = plt.figure() 14 ax = fig.add_subplot(111) 15 ax.scatter(xcord1, ycord1, s=30, c='red', marker='s') 16 ax.scatter(xcord2, ycord2, s=30, c='green') 17 x = arange(-3.0, 3.0, 0.1) 18 y = (-weights[0]-weights[1]*x)/weights[2] 19 ax.plot(x, y) 20 plt.xlabel('X1'); plt.ylabel('X2'); 21 plt.show()
>>> from numpy import * >>> reload(logRegres) <module 'logRegres' from 'D:\Python27\logRegres.pyc'> >>> weights=logRegres.gradAscent(dataArr,labelMat) >>> logRegres.plotBestFit(weights.getA())
訓練算法:隨機梯度上升
梯度上升算法在每次更新回歸系數時都需要遍歷整個數據集。改進的方法是一次僅使用一個樣本點來更新回歸系數,該方法稱為隨機梯度上升算法。由於可以在樣本到來時對分類器進行增量式更新,因而隨機梯度上升算法是一個在線學習算法。與在線學習相對應,一次處理所有數據被稱作是批處理。
1 def stocGradAscent0(dataMatrix, classLabels): 2 m,n = shape(dataMatrix) 3 alpha = 0.01 4 weights = ones(n) #initialize to all ones 5 for i in range(m): 6 h = sigmoid(sum(dataMatrix[i]*weights)) 7 error = classLabels[i] - h 8 weights = weights + alpha * error * dataMatrix[i] 9 return weights
>>> from numpy import * >>> reload(logRegres) <module 'logRegres' from 'D:\Python27\logRegres.pyc'> >>> dataArr,labelMat=logRegres.loadDataSet() >>> weights=logRegres.stocGradAscent0(array(dataArr),labelMat) >>> logRegres.plotBestFit(weights)
改進的隨機梯度上升算法
1 def stocGradAscent1(dataMatrix, classLabels, numIter=150): 2 m,n = shape(dataMatrix) 3 weights = ones(n) #initialize to all ones 4 for j in range(numIter): 5 dataIndex = range(m) 6 for i in range(m): 7 alpha = 4/(1.0+j+i)+0.0001 #apha decreases with iteration, does not 8 randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant 9 h = sigmoid(sum(dataMatrix[randIndex]*weights)) 10 error = classLabels[randIndex] - h 11 weights = weights + alpha * error * dataMatrix[randIndex] 12 del(dataIndex[randIndex]) 13 return weights
增加了亮出代碼來進行改進。一方面,alpha在每次迭代的時候都會調整,雖然alpha會隨着迭代次數不斷減小,但永遠不會減小到0,因為存在一個常數項。
另一方面,通過隨機選取樣本來更新回歸系數。
>>> dataArr,labelMat=logRegres.loadDataSet() >>> weights=logRegres.stocGradAscent1(array(dataArr),labelMat) >>> logRegres.plotBestFit(weights)
從疝氣病症預測病馬的死亡率
1 def classifyVector(inX, weights): 2 prob = sigmoid(sum(inX*weights)) 3 if prob > 0.5: return 1.0 4 else: return 0.0 5 6 def colicTest(): 7 frTrain = open('horseColicTraining.txt'); frTest = open('horseColicTest.txt') 8 trainingSet = []; trainingLabels = [] 9 for line in frTrain.readlines(): 10 currLine = line.strip().split('\t') 11 lineArr =[] 12 for i in range(21): 13 lineArr.append(float(currLine[i])) 14 trainingSet.append(lineArr) 15 trainingLabels.append(float(currLine[21])) 16 trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000) 17 errorCount = 0; numTestVec = 0.0 18 for line in frTest.readlines(): 19 numTestVec += 1.0 20 currLine = line.strip().split('\t') 21 lineArr =[] 22 for i in range(21): 23 lineArr.append(float(currLine[i])) 24 if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[21]): 25 errorCount += 1 26 errorRate = (float(errorCount)/numTestVec) 27 print "the error rate of this test is: %f" % errorRate 28 return errorRate 29 30 def multiTest(): 31 numTests = 10; errorSum=0.0 32 for k in range(numTests): 33 errorSum += colicTest() 34 print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))
第一個函數,如果sigmoid值大於0.5函數返回1,否則返回0.
第二個函數,用於打開測試集和訓練集,並對數據進行格式化處理的函數。
第三個函數,調用第二個函數10次並求結果的平均值。
.-' _..`. / .'_.'.' | .' (.)`. ;' ,_ `. .--.__________.' ; `.;-'| ./ /| | / `..'`-._ _____, ..' / | | | |\ \ / /| | | | \ \ / / | | | | \ \ /_/ |_| |_| \_\ |__\ |__\ |__\ |__\