KNN分類算法--python實現


一、kNN算法分析

       K最近鄰(k-Nearest Neighbor,KNN)分類算法可以說是最簡單的機器學習算法了。它采用測量不同特征值之間的距離方法進行分類。它的思想很簡單:如果一個樣本在特征空間中的k個最相似(即特征空間中最鄰近)的樣本中的大多數屬於某一個類別,則該樣本也屬於這個類別。

      KNN算法中,所選擇的鄰居都是已經正確分類的對象。該方法在定類決策上只依據最鄰近的一個或者幾個樣本的類別來決定待分樣本所屬的類別。由於KNN方法主要靠周圍有限的鄰近的樣本,而不是靠判別類域的方法來確定所屬類別的,因此對於類域的交叉或重疊較多的待分樣本集來說,KNN方法較其他方法更為適合。

       該算法在分類時有個主要的不足是,當樣本不平衡時,如一個類的樣本容量很大,而其他類樣本容量很小時,有可能導致當輸入一個新樣本時,該樣本的K個鄰居中大容量類的樣本占多數。因此可以采用權值的方法(和該樣本距離小的鄰居權值大)來改進。該方法的另一個不足之處是計算量較大,因為對每一個待分類的文本都要計算它到全體已知樣本的距離,才能求得它的K個最近鄰點。目前常用的解決方法是事先對已知樣本點進行剪輯,事先去除對分類作用不大的樣本。該算法比較適用於樣本容量比較大的類域的自動分類,而那些樣本容量較小的類域采用這種算法比較容易產生誤分[參考機器學習十大算法]。

      總的來說就是我們已經存在了一個帶標簽的數據庫,然后輸入沒有標簽的新數據后,將新數據的每個特征與樣本集中數據對應的特征進行比較,然后算法提取樣本集中特征最相似(最近鄰)的分類標簽。一般來說,只選擇樣本數據庫中前k個最相似的數據。最后,選擇k個最相似數據中出現次數最多的分類。其算法描述如下:

1)計算已知類別數據集中的點與當前點之間的距離;

2)按照距離遞增次序排序;

3)選取與當前點距離最小的k個點;

4)確定前k個點所在類別的出現頻率;

5)返回前k個點出現頻率最高的類別作為當前點的預測分類。

代碼:

#########################################
# kNN: k Nearest Neighbors

# Input:      inX: vector to compare to existing dataset (1xN)
#             dataSet: size m data set of known vectors (NxM)
#             labels: data set labels (1xM vector)
#             k: number of neighbors to use for comparison 
            
# Output:     the most popular class label
#########################################

from numpy import *
import operator
import os
from Canvas import Line


# classify using kNN
def kNNClassify(newInput, dataSet, labels, k):
    numSamples = dataSet.shape[0] # shape[0] stands for the num of row

    ## step 1: calculate Euclidean distance
    # tile(A, reps): Construct an array by repeating A reps times
    # the following copy numSamples rows for dataSet
    diff = tile(newInput, (numSamples, 1)) - dataSet # Subtract element-wise
    squaredDiff = diff ** 2 # squared for the subtract
    squaredDist = sum(squaredDiff, axis = 1) # sum is performed by row
    distance = squaredDist ** 0.5

    ## step 2: sort the distance
    # argsort() returns the indices that would sort an array in a ascending order
    sortedDistIndices = argsort(distance)

    classCount = {} # define a dictionary (can be append element)
    for i in xrange(k):
        ## step 3: choose the min k distance
        voteLabel = labels[sortedDistIndices[i]]

        ## step 4: count the times labels occur
        # when the key voteLabel is not in dictionary classCount, get()
        # will return 0
        classCount[voteLabel] = classCount.get(voteLabel, 0) + 1

    ## step 5: the max voted class will return
    maxCount = 0
    for key, value in classCount.items():
        if value > maxCount:
            maxCount = value
            maxIndex = key

    return maxIndex    

# convert image to vector
def  img2vector(filename):
     rows = 32
     cols = 32
     imgVector = zeros((1, rows * cols))
     
     fileIn = open(filename)
     for row in xrange(rows):
         lineStr = fileIn.readline()
         for col in xrange(cols):
             imgVector[0, row * 32 + col] = int(lineStr[col])

     return imgVector

# load dataSet
def loadDataSet():
    ## step 1: Getting training set
    print "---Getting training set..."
    dataSetDir = 'F:/eclipse/workspace/KnnTest/'
    trainingFileList = os.listdir(dataSetDir + 'trainingDigits') # load the training set
    numSamples = len(trainingFileList)

    train_x = zeros((numSamples, 1024))
    train_y = []
    for i in xrange(numSamples):
        filename = trainingFileList[i]

        # get train_x
        train_x[i, :] = img2vector(dataSetDir + 'trainingDigits/%s' % filename) 

        # get label from file name such as "1_18.txt"
        label = int(filename.split('_')[0]) # return 1
        train_y.append(label)

    ## step 2: Getting testing set
    print "---Getting testing set..."
    testingFileList = os.listdir(dataSetDir + 'testDigits') # load the testing set
    numSamples = len(testingFileList)
    test_x = zeros((numSamples, 1024))
    test_y = []
    for i in xrange(numSamples):
        filename = testingFileList[i]

        # get train_x
        test_x[i, :] = img2vector(dataSetDir + 'testDigits/%s' % filename) 

        # get label from file name such as "1_18.txt"
        label = int(filename.split('_')[0]) # return 1
        test_y.append(label)

    return train_x, train_y, test_x, test_y

# test hand writing class
def testHandWritingClass():
    ## step 1: load data
    print "step 1: load data..."
    train_x, train_y, test_x, test_y = loadDataSet()

    ## step 2: training...
    print "step 2: training..."
    pass

    ## step 3: testing
    print "step 3: testing..."
    numTestSamples = test_x.shape[0]
    matchCount = 0
    for i in xrange(numTestSamples):
        predict = kNNClassify(test_x[i], train_x, train_y, 3)
        if predict == test_y[i]:
            matchCount += 1
    accuracy = float(matchCount) / numTestSamples

    ## step 4: show the result
    print "step 4: show the result..."
    print 'The classify accuracy is: %.2f%%' % (accuracy * 100)

另外創建一個腳本knnTest.py

import KNN
KNN.testHandWritingClass()

其中數據集下載鏈接為:http://download.csdn.net/detail/zouxy09/6610571


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM