logistic regression是分類算法中非常重要的算法,也是非常基礎的算法。logistic regression從整體上考慮樣本預測的精度,用判別學習模型的條件似然進行參數估計,假設樣本遵循iid,參數估計時保證每個樣本的預測值接近真實值的概率最大化。這樣的結果,只能是犧牲一部分的精度來換取另一部分的精度。而svm從局部出發,假設有一個分類平面,找出所有距離分類平面的最近的點(support vector,數量很少),讓這些點到平面的距離最大化,那么這個分類平面就是最佳分類平面。從這個角度來看待兩個算法,可以得出logistic regression的精度肯定要低於后者。今天主要寫logistic regression的Python代碼。logistic regression的推導過程比較簡單:
第一個公式是條件似然函數估計,意思是指定未知常量theta(;表示頻率學派),對於每個輸入feature vector x(i),產生y(i)的概率都最大,取對數是為了求導方便。第二個公式是sigmoid函數的導數,在這里推導出具體的導函數(推導過程非常簡單,復合函數求導法則),第三個公式是求出的梯度(實際為偏導數組成的向量,梯度與方向導數同方向時取得最大值,相反時取得最小值)。公式3的線性意義為:對於容量為m的樣本(矩陣mxn),權重為nx1的列向量,每一行的樣本數據與權重向量相乘求得預測值,m行樣本的預測值組成mx1的列向量(其數值為exp(-inX)),其實就是mxn的樣本矩陣左乘權重向量nx1,因為的矩陣的本質是線性變換,相當於每一行樣本數據投影到權重向量。得到預測值的列向量后,與真實值向量(需要轉置)做差值得到新的列向量。最后樣本矩陣倒置,然后左乘這個向量得到nx1的權重向量(梯度)。這個過程是計算批量梯度。當似然函數取得最大時,就是損失函數最小,所以二者是相反的關系,最后結果也就是梯度上升法。
對於邏輯回歸的損失函數,也可以從另外一個角度來理解。上面的公式我們看到損失函數和最大似然函數是相反的關系,一般情況下,logistic regression的loss function 可以采用交叉熵的形式,然后取mean。
SGD(stochastic gradient descent)在計算的過程中,每次迭代不用計算所用樣本,而是隨機選取一個樣本進行梯度上升(下降),在工程實踐中可以滿足精度要求,而且時間復雜度比批量要低。對於SGD而言,學習速率alpha的選取,隨着循環次數增加,剛開始的時候,梯度下降應該比較快,到后期的時候,可能會出現在某一個值徘徊的情況,而且下降速率會越來越慢。所以選取一個比較合理的study rate,應該是先選取到的樣本study rate相對較高,后面的相對較小,這樣比較合理。
在訓練出模型后,預測時應該使用"五折交叉驗證理論"尋找出最優模型出來(調節參數的過程),這個過程一般用RMSE(均方根誤差)來衡量,並且可以定義一個基准均方根誤差。在用最優模型預測時取得的RMSE與基准RMSE對比,分析數據結果。
本文主要探討logistic regression的SGD算法,並且不考慮正則化。事實上,在高維度的情況下,泛化能力或者正則化技術非常關鍵。在90年代有學者提出lasso后,至今有很多方法實踐了lasso,比如LARS,CD……。下一篇博客將探討lasso技術,並且動手實踐CD算法。接下來,上傳最近寫的SGD Python代碼,首先是引入模塊:logisticRegression.py,這里面定義了兩個class:LogisticRegressionWithSGD,LRModel,還有全局函數RMSE,loadDataSet和sigmoid函數。后面是測試代碼,主要是參數調優。
logisticRegression.py:

1 ''' 2 Created on 2017/03/15 3 Logistic Regression without lasso 4 @author: XueQiang Tong 5 ''' 6 ''' 7 this algorithm include SGD,batch gradient descent except lasso regularization,So the generalization 8 ability is relatively weak, follow-up and then write a CD algorithm for lasso. 9 ''' 10 from numpy import * 11 import matplotlib.pyplot as plt 12 import numpy as np; 13 14 # load dataset into ndarray(numpy) 15 def loadDataSet(filepath, seperator="\t"): 16 with open(filepath) as fr: 17 lines = fr.readlines(); 18 num_samples = len(lines); 19 dimension = len(lines[0].strip().split(seperator)); 20 dataMat = np.zeros((num_samples, dimension)); 21 labelMat = []; 22 23 index = 0; 24 for line in lines: 25 sample = line.strip().split(); 26 feature = list(map(np.float32, sample[:-1])); 27 dataMat[index, 1:] = feature; 28 dataMat[index, 0] = 1.0; 29 labelMat.append(float(sample[-1])); 30 index += 1; 31 32 return dataMat, array(labelMat); 33 34 #sigmoid function 35 def sigmoid(inX): 36 return 1.0/(1+exp(-inX)) 37 38 # compute rmse 39 def RMSE(true,predict): 40 true_predict = zip(true,predict); 41 sub = []; 42 for r,p in true_predict: 43 sub.append(math.pow((r-p),2)); 44 return math.sqrt(mean(sub)); 45 46 def maxstep(dataMat): 47 return 2 / abs(np.linalg.det(np.dot(dataMat.transpose(),dataMat))); 48 49 class LRModel: 50 def __init__(self,weights = np.empty(10),alpha = 0.01,iter = 150): 51 self.weights = weights; 52 self.alpha = alpha; 53 self.iter = iter; 54 55 #predict 56 def predict(self,inX): 57 prob = sigmoid(dot(inX,self.weights)) 58 if prob > 0.5: return 1.0 59 else: return 0.0 60 61 def getWeights(self): 62 return self.weights; 63 64 def __getattr__(self, item): 65 if item == 'weights': 66 return self.weights; 67 68 class LogisticRegressionWithSGD: 69 # batch gradient descent 70 @classmethod 71 def batchGradDescent(cls,data,maxCycles = 500,alpha = 0.001): 72 dataMatrix = mat(data[0]) # convert to NumPy matrix 73 labelMat = mat(data[1]).transpose() # convert to NumPy matrix 74 m,n = shape(dataMatrix) 75 weights = ones((n,1)) 76 for k in range(maxCycles): 77 h = sigmoid(dataMatrix*weights) # compute predict_value 78 error = (labelMat - h) # compute deviation 79 weights = weights + alpha * dataMatrix.transpose()* error # update weight 80 81 model = LRModel(weights = weights,alpha = alpha,iter = maxCycles); 82 return model; 83 84 ''' 85 # draw samples of the scatter plot to view the distribution of sample points 86 @classmethod 87 def viewScatterPlot(cls,weights,dataMat,labelMat): 88 dataArr = array(dataMat) 89 n = shape(dataArr)[0] 90 xcord1 = []; ycord1 = [] 91 xcord2 = []; ycord2 = [] 92 for i in range(n): 93 if int(labelMat[i])== 1: 94 xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2]) 95 else: 96 xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2]) 97 fig = plt.figure() 98 ax = fig.add_subplot(111) 99 ax.scatter(xcord1, ycord1, s=30, c='red', marker='s',label='1') 100 ax.scatter(xcord2, ycord2, s=30, c='green',label='0') 101 plt.legend(); 102 x = arange(-3.0, 3.0, 0.1) 103 y = (-weights[0]-weights[1]*x)/weights[2] 104 ax.plot(mat(x), mat(y)) 105 plt.xlabel('X1'); plt.ylabel('X2'); 106 plt.show() 107 ''' 108 @classmethod 109 def train_nonrandom(cls,data,alpha = 0.01): 110 dataMatrix = data[0]; 111 classLabels = data[1]; 112 m,n = shape(dataMatrix); 113 weights = ones(n) #initialize to all ones 114 for i in range(m): 115 h = sigmoid(sum(dataMatrix[i]*weights)); 116 error = classLabels[i] - h; 117 weights = weights + alpha * error * dataMatrix[i]; 118 119 model = LRModel(weights = weights,alpha = alpha); 120 return model; 121 122 @classmethod 123 def train_random(cls,data, numIter=150): 124 dataMatrix = data[0]; 125 classLabels = data[1]; 126 m,n = shape(dataMatrix); 127 weights = ones(n); #initialize to all ones 128 indices = np.arange(len(dataMatrix)) 129 for iter in range(numIter): 130 np.random.shuffle(indices) 131 for index in indices: 132 alpha = 4 / (1.0 + iter + index) + 0.0001 #apha decreases with iteration, does not 133 h = sigmoid(sum(dataMatrix[index]*weights)); 134 error = classLabels[index] - h; 135 weights = weights + alpha * error * array(dataMatrix[index]); 136 137 model = LRModel(weights = weights,iter = numIter); 138 return model; 139 140 def colicTest(self): 141 data = loadDataSet('G:\\testSet.txt'); 142 trainWeights = self.stoGradDescent_random(data); 143 dataMat = data[0]; 144 labels = data[1]; 145 errnums = 0; 146 for index in range(dataMat.shape[0]): 147 preVal = self.predict(dataMat[index,:],trainWeights); 148 if(preVal != labels[index]): 149 errnums += 1; 150 print('error rate:%.2f' % (errnums/dataMat.shape[0])); 151 return errnums; 152 153 def multiTest(self): 154 numTests = 10; errorSum=0.0 155 for k in range(numTests): 156 errorSum += self.colicTest() 157 print("after %d iterations the average error rate is: %.2f" % (numTests, errorSum/numTests))
測試代碼,把最優模型保存在npy文件里,以后使用的時候,直接取出來,不用再訓練了。

1 from logisticRegression import *; 2 from numpy import *; 3 import math; 4 5 def main(): 6 dataMat, labels = loadDataSet('G:\\testSet.txt'); 7 num_samples = dataMat.shape[0]; 8 9 num_trains = int(num_samples * 0.6); 10 num_validations = int(num_samples * 0.2); 11 num_tests = int(num_samples * 0.2); 12 13 data_trains = dataMat[:num_trains, :]; 14 data_validations = dataMat[num_trains:(num_trains + num_validations), :]; 15 data_tests = dataMat[(num_trains + num_validations):, :]; 16 17 label_trains = labels[:num_trains]; 18 label_validations = labels[num_trains:(num_trains + num_validations)]; 19 label_tests = labels[(num_trains + num_validations):]; 20 ''' 21 minrmse = (1 << 31) - 1; 22 bestModel = LRModel(); 23 iterList = [10, 20, 30, 80]; 24 25 for iter in iterList: 26 model = LogisticRegressionWithSGD.train_random((data_trains, label_trains), numIter=iter); 27 preVals = zeros(num_validations); 28 for i in range(num_validations): 29 preVals[i] = model.predict(data_validations[i, :]); 30 31 rmse = RMSE(label_validations, preVals); 32 if rmse < minrmse: 33 minrmse = rmse; 34 bestModel = model; 35 36 print(bestModel.iter, bestModel.weights, minrmse); 37 save('D:\\Python\\models\\weights.npy',bestModel.weights);''' 38 39 #用最佳模型預測測試集的評分,並計算和實際評分之間的均方根誤差 40 weights = load('D:\\Python\\models\\weights.npy'); 41 42 LogisticRegressionWithSGD.viewScatterPlot(weights,dataMat,labels);#顯示散點圖 43 model = LRModel(); 44 model.weights = weights; 45 46 preVals = zeros(num_tests); 47 for i in range(num_tests): 48 preVals[i] = model.predict(data_tests[i,:]); 49 50 51 testRMSE = RMSE(label_tests,preVals); #預測產生的均方根誤差 52 #用基准偏差衡量最佳模型在測試數據上的預測精度 53 tavMean = mean(hstack((label_trains,label_validations))); 54 baseRMSE = math.sqrt(mean((label_tests - tavMean) ** 2)) #基准均方根誤差 55 56 improvement = abs(testRMSE - baseRMSE) / baseRMSE * 100; 57 58 print("The best model improves the base line by %% %1.2f" % (improvement)); 59 60 if __name__ == '__main__': 61 main();
運行結果:
樣本點的散點圖:

最佳迭代次數,權重以及RMSE: 10 [ 12.08509707 1.4723024 -1.86595103] 0.0
The best model improves the base line by % 55.07
另外,由於訓練過程中是隨機選取樣本點,所以迭代次數相同的情況下,權重以及RMSE有可能不同,我們要的是RMSE最小的模型!
算法比較簡單,但是用Python的nump庫實施的時候,有很多注意的細節,只有經過自己仔細的理論推導然后再代碼實施后,才能算基本掌握了一個算法。寫代碼及優化的過程是很費時的,后續還要改進算法,深化理論研究,並且堅持理論與編程結合,切不可眼高手低!