一、kNN算法分析
K最近鄰(k-Nearest Neighbor,KNN)分類算法可以說是最簡單的機器學習算法了。它采用測量不同特征值之間的距離方法進行分類。它的思想很簡單:如果一個樣本在特征空間中的k個最相似(即特征空間中最鄰近)的樣本中的大多數屬於某一個類別,則該樣本也屬於這個類別。
KNN算法中,所選擇的鄰居都是已經正確分類的對象。該方法在定類決策上只依據最鄰近的一個或者幾個樣本的類別來決定待分樣本所屬的類別。由於KNN方法主要靠周圍有限的鄰近的樣本,而不是靠判別類域的方法來確定所屬類別的,因此對於類域的交叉或重疊較多的待分樣本集來說,KNN方法較其他方法更為適合。
該算法在分類時有個主要的不足是,當樣本不平衡時,如一個類的樣本容量很大,而其他類樣本容量很小時,有可能導致當輸入一個新樣本時,該樣本的K個鄰居中大容量類的樣本占多數。因此可以采用權值的方法(和該樣本距離小的鄰居權值大)來改進。該方法的另一個不足之處是計算量較大,因為對每一個待分類的文本都要計算它到全體已知樣本的距離,才能求得它的K個最近鄰點。目前常用的解決方法是事先對已知樣本點進行剪輯,事先去除對分類作用不大的樣本。該算法比較適用於樣本容量比較大的類域的自動分類,而那些樣本容量較小的類域采用這種算法比較容易產生誤分[參考機器學習十大算法]。
總的來說就是我們已經存在了一個帶標簽的數據庫,然后輸入沒有標簽的新數據后,將新數據的每個特征與樣本集中數據對應的特征進行比較,然后算法提取樣本集中特征最相似(最近鄰)的分類標簽。一般來說,只選擇樣本數據庫中前k個最相似的數據。最后,選擇k個最相似數據中出現次數最多的分類。其算法描述如下:
1)計算已知類別數據集中的點與當前點之間的距離;
2)按照距離遞增次序排序;
3)選取與當前點距離最小的k個點;
4)確定前k個點所在類別的出現頻率;
5)返回前k個點出現頻率最高的類別作為當前點的預測分類。
代碼:
######################################### # kNN: k Nearest Neighbors # Input: inX: vector to compare to existing dataset (1xN) # dataSet: size m data set of known vectors (NxM) # labels: data set labels (1xM vector) # k: number of neighbors to use for comparison # Output: the most popular class label ######################################### from numpy import * import operator import os from Canvas import Line # classify using kNN def kNNClassify(newInput, dataSet, labels, k): numSamples = dataSet.shape[0] # shape[0] stands for the num of row ## step 1: calculate Euclidean distance # tile(A, reps): Construct an array by repeating A reps times # the following copy numSamples rows for dataSet diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise squaredDiff = diff ** 2 # squared for the subtract squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row distance = squaredDist ** 0.5 ## step 2: sort the distance # argsort() returns the indices that would sort an array in a ascending order sortedDistIndices = argsort(distance) classCount = {} # define a dictionary (can be append element) for i in xrange(k): ## step 3: choose the min k distance voteLabel = labels[sortedDistIndices[i]] ## step 4: count the times labels occur # when the key voteLabel is not in dictionary classCount, get() # will return 0 classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 ## step 5: the max voted class will return maxCount = 0 for key, value in classCount.items(): if value > maxCount: maxCount = value maxIndex = key return maxIndex # convert image to vector def img2vector(filename): rows = 32 cols = 32 imgVector = zeros((1, rows * cols)) fileIn = open(filename) for row in xrange(rows): lineStr = fileIn.readline() for col in xrange(cols): imgVector[0, row * 32 + col] = int(lineStr[col]) return imgVector # load dataSet def loadDataSet(): ## step 1: Getting training set print "---Getting training set..." dataSetDir = 'F:/eclipse/workspace/KnnTest/' trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set numSamples = len(trainingFileList) train_x = zeros((numSamples, 1024)) train_y = [] for i in xrange(numSamples): filename = trainingFileList[i] # get train_x train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename) # get label from file name such as "1_18.txt" label = int(filename.split('_')[0]) # return 1 train_y.append(label) ## step 2: Getting testing set print "---Getting testing set..." testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set numSamples = len(testingFileList) test_x = zeros((numSamples, 1024)) test_y = [] for i in xrange(numSamples): filename = testingFileList[i] # get train_x test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename) # get label from file name such as "1_18.txt" label = int(filename.split('_')[0]) # return 1 test_y.append(label) return train_x, train_y, test_x, test_y # test hand writing class def testHandWritingClass(): ## step 1: load data print "step 1: load data..." train_x, train_y, test_x, test_y = loadDataSet() ## step 2: training... print "step 2: training..." pass ## step 3: testing print "step 3: testing..." numTestSamples = test_x.shape[0] matchCount = 0 for i in xrange(numTestSamples): predict = kNNClassify(test_x[i], train_x, train_y, 3) if predict == test_y[i]: matchCount += 1 accuracy = float(matchCount) / numTestSamples ## step 4: show the result print "step 4: show the result..." print 'The classify accuracy is: %.2f%%' % (accuracy * 100)
另外創建一個腳本knnTest.py
import KNN KNN.testHandWritingClass()