例子:
- 求未知電影屬於什么類型:
算法介紹:
步驟:
- 為了判斷未知實例的類別,以所有已知類別的實例作為參照
- 選擇參數K
- 計算未知實例與所有已知實例的距離
- 選擇最近K個已知實例
- 根據少數服從多數的投票法則(majority-voting),讓未知實例歸類為K個最鄰近樣本中最多數的類別
細節:
- 關於K的選擇
- 關於距離的衡量方法:
其他距離衡量:余弦值(cos), 相關度 (correlation), 曼哈頓距離 (Manhattan distance)
算法優點:
- 簡單。
- 易於理解。
- 容易實現。
- 通過對K的選擇可具備丟噪音數據的健壯性。
算法缺點:
- 需要大量空間儲存所有已知實例。
- 算法復雜度高(需要比較所有已知實例與要分類的實例)。
- 當其樣本分布不平衡時,比如其中一類樣本過大(實例數量過多)占主導的時候,新的未知實例容易被歸類為這個主導樣本,因為這類樣本實例的數量過大,但這個新的未知實例實際並木接近目標樣本。
KNN代碼(Python實現):
1 import csv 2 import random 3 import math 4 import operator 5 6 7 def loadDataset(filename, split, trainingSet=[], testSet=[]): 8 with open(filename, 'r') as csvfile: 9 lines = csv.reader(csvfile) 10 dataset = list(lines) #得到文件中的數據 11 for x in range(len(dataset) - 1): 12 for y in range(4): 13 dataset[x][y] = float(dataset[x][y]) 14 if random.random() < split: #以split為界限把數據集分為兩部分 15 trainingSet.append(dataset[x]) 16 else: 17 testSet.append(dataset[x]) 18 19 20 def euclideanDistance(instance1, instance2, length): #計算距離(傳入兩個實例和維度) 21 distance = 0 22 for x in range(length): 23 distance += pow((instance1[x] - instance2[x]), 2) #計算所有維度的平方和 24 return math.sqrt(distance) 25 26 27 def getNeighbors(trainingSet, testInstance, k): #返回最近的幾個鄰域 28 distances = [] #所有計算得出的距離的容器 29 length = len(testInstance) - 1 #測試實例的維度 30 for x in range(len(trainingSet)): 31 dist = euclideanDistance(testInstance, trainingSet[x], length) 32 distances.append((trainingSet[x], dist)) 33 distances.sort(key=operator.itemgetter(1)) #key=operator.itemgetter(1)根據第一個值域進行排序 34 #print(distances) 35 neighbors = [] 36 for x in range(k): 37 neighbors.append(distances[x][0]) 38 return neighbors 39 40 41 def getResponse(neighbors): 42 classVotes = {} 43 for x in range(len(neighbors)): 44 response = neighbors[x][-1] 45 if response in classVotes: 46 classVotes[response] += 1 47 else: 48 classVotes[response] = 1 49 sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True) 50 return sortedVotes[0][0] 51 52 53 def getAccuracy(testSet, predictions): #計算正確度 54 correct = 0 55 for x in range(len(testSet)): 56 if testSet[x][-1] == predictions[x]: 57 correct += 1 58 return (correct / float(len(testSet))) * 100.0 59 60 61 def main(): 62 # prepare data 63 trainingSet = [] 64 testSet = [] 65 split = 0.67 #以split為界限把數據集分為兩部分 66 loadDataset(r'iris.data.txt', split, trainingSet, testSet) 67 print('Train set: ' + repr(len(trainingSet))) 68 print('Test set: ' + repr(len(testSet))) 69 # generate predictions 70 predictions = [] 71 k = 3 72 for x in range(len(testSet)): 73 neighbors = getNeighbors(trainingSet, testSet[x], k) 74 result = getResponse(neighbors) 75 predictions.append(result) 76 print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1])) 77 accuracy = getAccuracy(testSet, predictions) 78 print('Accuracy: ' + repr(accuracy) + '%') 79 80 if __name__ == '__main__': 81 main()
虹膜數據:
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 5.4,3.7,1.5,0.2,Iris-setosa 4.8,3.4,1.6,0.2,Iris-setosa 4.8,3.0,1.4,0.1,Iris-setosa 4.3,3.0,1.1,0.1,Iris-setosa 5.8,4.0,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,Iris-setosa 5.1,3.5,1.4,0.3,Iris-setosa 5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa 5.4,3.4,1.7,0.2,Iris-setosa 5.1,3.7,1.5,0.4,Iris-setosa 4.6,3.6,1.0,0.2,Iris-setosa 5.1,3.3,1.7,0.5,Iris-setosa 4.8,3.4,1.9,0.2,Iris-setosa 5.0,3.0,1.6,0.2,Iris-setosa 5.0,3.4,1.6,0.4,Iris-setosa 5.2,3.5,1.5,0.2,Iris-setosa 5.2,3.4,1.4,0.2,Iris-setosa 4.7,3.2,1.6,0.2,Iris-setosa 4.8,3.1,1.6,0.2,Iris-setosa 5.4,3.4,1.5,0.4,Iris-setosa 5.2,4.1,1.5,0.1,Iris-setosa 5.5,4.2,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.2,Iris-setosa 5.0,3.2,1.2,0.2,Iris-setosa 5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.6,1.4,0.1,Iris-setosa 4.4,3.0,1.3,0.2,Iris-setosa 5.1,3.4,1.5,0.2,Iris-setosa 5.0,3.5,1.3,0.3,Iris-setosa 4.5,2.3,1.3,0.3,Iris-setosa 4.4,3.2,1.3,0.2,Iris-setosa 5.0,3.5,1.6,0.6,Iris-setosa 5.1,3.8,1.9,0.4,Iris-setosa 4.8,3.0,1.4,0.3,Iris-setosa 5.1,3.8,1.6,0.2,Iris-setosa 4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 6.5,2.8,4.6,1.5,Iris-versicolor 5.7,2.8,4.5,1.3,Iris-versicolor 6.3,3.3,4.7,1.6,Iris-versicolor 4.9,2.4,3.3,1.0,Iris-versicolor 6.6,2.9,4.6,1.3,Iris-versicolor 5.2,2.7,3.9,1.4,Iris-versicolor 5.0,2.0,3.5,1.0,Iris-versicolor 5.9,3.0,4.2,1.5,Iris-versicolor 6.0,2.2,4.0,1.0,Iris-versicolor 6.1,2.9,4.7,1.4,Iris-versicolor 5.6,2.9,3.6,1.3,Iris-versicolor 6.7,3.1,4.4,1.4,Iris-versicolor 5.6,3.0,4.5,1.5,Iris-versicolor 5.8,2.7,4.1,1.0,Iris-versicolor 6.2,2.2,4.5,1.5,Iris-versicolor 5.6,2.5,3.9,1.1,Iris-versicolor 5.9,3.2,4.8,1.8,Iris-versicolor 6.1,2.8,4.0,1.3,Iris-versicolor 6.3,2.5,4.9,1.5,Iris-versicolor 6.1,2.8,4.7,1.2,Iris-versicolor 6.4,2.9,4.3,1.3,Iris-versicolor 6.6,3.0,4.4,1.4,Iris-versicolor 6.8,2.8,4.8,1.4,Iris-versicolor 6.7,3.0,5.0,1.7,Iris-versicolor 6.0,2.9,4.5,1.5,Iris-versicolor 5.7,2.6,3.5,1.0,Iris-versicolor 5.5,2.4,3.8,1.1,Iris-versicolor 5.5,2.4,3.7,1.0,Iris-versicolor 5.8,2.7,3.9,1.2,Iris-versicolor 6.0,2.7,5.1,1.6,Iris-versicolor 5.4,3.0,4.5,1.5,Iris-versicolor 6.0,3.4,4.5,1.6,Iris-versicolor 6.7,3.1,4.7,1.5,Iris-versicolor 6.3,2.3,4.4,1.3,Iris-versicolor 5.6,3.0,4.1,1.3,Iris-versicolor 5.5,2.5,4.0,1.3,Iris-versicolor 5.5,2.6,4.4,1.2,Iris-versicolor 6.1,3.0,4.6,1.4,Iris-versicolor 5.8,2.6,4.0,1.2,Iris-versicolor 5.0,2.3,3.3,1.0,Iris-versicolor 5.6,2.7,4.2,1.3,Iris-versicolor 5.7,3.0,4.2,1.2,Iris-versicolor 5.7,2.9,4.2,1.3,Iris-versicolor 6.2,2.9,4.3,1.3,Iris-versicolor 5.1,2.5,3.0,1.1,Iris-versicolor 5.7,2.8,4.1,1.3,Iris-versicolor 6.3,3.3,6.0,2.5,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 7.1,3.0,5.9,2.1,Iris-virginica 6.3,2.9,5.6,1.8,Iris-virginica 6.5,3.0,5.8,2.2,Iris-virginica 7.6,3.0,6.6,2.1,Iris-virginica 4.9,2.5,4.5,1.7,Iris-virginica 7.3,2.9,6.3,1.8,Iris-virginica 6.7,2.5,5.8,1.8,Iris-virginica 7.2,3.6,6.1,2.5,Iris-virginica 6.5,3.2,5.1,2.0,Iris-virginica 6.4,2.7,5.3,1.9,Iris-virginica 6.8,3.0,5.5,2.1,Iris-virginica 5.7,2.5,5.0,2.0,Iris-virginica 5.8,2.8,5.1,2.4,Iris-virginica 6.4,3.2,5.3,2.3,Iris-virginica 6.5,3.0,5.5,1.8,Iris-virginica 7.7,3.8,6.7,2.2,Iris-virginica 7.7,2.6,6.9,2.3,Iris-virginica 6.0,2.2,5.0,1.5,Iris-virginica 6.9,3.2,5.7,2.3,Iris-virginica 5.6,2.8,4.9,2.0,Iris-virginica 7.7,2.8,6.7,2.0,Iris-virginica 6.3,2.7,4.9,1.8,Iris-virginica 6.7,3.3,5.7,2.1,Iris-virginica 7.2,3.2,6.0,1.8,Iris-virginica 6.2,2.8,4.8,1.8,Iris-virginica 6.1,3.0,4.9,1.8,Iris-virginica 6.4,2.8,5.6,2.1,Iris-virginica 7.2,3.0,5.8,1.6,Iris-virginica 7.4,2.8,6.1,1.9,Iris-virginica 7.9,3.8,6.4,2.0,Iris-virginica 6.4,2.8,5.6,2.2,Iris-virginica 6.3,2.8,5.1,1.5,Iris-virginica 6.1,2.6,5.6,1.4,Iris-virginica 7.7,3.0,6.1,2.3,Iris-virginica 6.3,3.4,5.6,2.4,Iris-virginica 6.4,3.1,5.5,1.8,Iris-virginica 6.0,3.0,4.8,1.8,Iris-virginica 6.9,3.1,5.4,2.1,Iris-virginica 6.7,3.1,5.6,2.4,Iris-virginica 6.9,3.1,5.1,2.3,Iris-virginica 5.8,2.7,5.1,1.9,Iris-virginica 6.8,3.2,5.9,2.3,Iris-virginica 6.7,3.3,5.7,2.5,Iris-virginica 6.7,3.0,5.2,2.3,Iris-virginica 6.3,2.5,5.0,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3.0,5.1,1.8,Iris-virginica