機器學習：K-近鄰算法（KNN）

本文轉載自查看原文 2018-09-17 11:25 5215 機器學習/ k-近鄰算法/ 人工智能/ KNN

機器學習：K-近鄰算法（KNN）

一、KNN算法概述

KNN作為一種有監督分類算法，是最簡單的機器學習算法之一，顧名思義，其算法主體思想就是根據距離相近的鄰居類別，來判定自己的所屬類別。算法的前提是需要有一個已被標記類別的訓練數據集，具體的計算步驟分為一下三步：

1、計算測試對象與訓練集中所有對象的距離，可以是歐式距離、余弦距離等，比較常用的是較為簡單的歐式距離；

2、找出上步計算的距離中最近的K個對象，作為測試對象的鄰居；

3、找出K個對象中出現頻率最高的對象，其所屬的類別就是該測試對象所屬的類別。

二、算法優缺點

1、優點

思想簡單，易於理解，易於實現，無需估計參數，無需訓練；

適合對稀有事物進行分類；

特別適合於多分類問題。

2、缺點

懶惰算法，進行分類時計算量大，要掃描全部訓練樣本計算距離，內存開銷大，評分慢；

當樣本不平衡時，如其中一個類別的樣本較大，可能會導致對新樣本計算近鄰時，大容量樣本占大多數，影響分類效果；

可解釋性較差，無法給出決策樹那樣的規則。

三、注意問題

1、K值的設定

K值設置過小會降低分類精度；若設置過大，且測試樣本屬於訓練集中包含數據較少的類，則會增加噪聲，降低分類效果。

通常，K值的設定采用交叉檢驗的方式（以K=1為基准）

經驗規則：K一般低於訓練樣本數的平方根。

2、優化問題

壓縮訓練樣本；

確定最終的類別時，不是簡單的采用投票法，而是進行加權投票，距離越近權重越高。

四、python中scikit-learn對KNN算法的應用

#KNN調用
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
iris_X = iris.data
iris_y = iris.target
np.unique(iris_y)
# Split iris data in train and test data
# A random permutation, to split the data randomly
np.random.seed(0)
# permutation隨機生成一個范圍內的序列
indices = np.random.permutation(len(iris_X))
# 通過隨機序列將數據隨機進行測試集和訓練集的划分
iris_X_train = iris_X[indices[:-10]]
iris_y_train = iris_y[indices[:-10]]
iris_X_test  = iris_X[indices[-10:]]
iris_y_test  = iris_y[indices[-10:]]
# Create and fit a nearest-neighbor classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(iris_X_train, iris_y_train) 
 
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

knn.predict(iris_X_test)
print iris_y_test

KNeighborsClassifier方法中含有8個參數（以下前兩個常用）：

n_neighbors : int, optional (default = 5)：K的取值，默認的鄰居數量是5；

weights：確定近鄰的權重，“uniform”權重一樣，“distance”指權重為距離的倒數，默認情況下是權重相等。也可以自己定義函數確定權重的方式；

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'},optional：計算最近鄰的方法，可根據需要自己選擇；

leaf_size : int, optional (default = 30)
| Leaf size passed to BallTree or KDTree. This can affect the
| speed of the construction and query, as well as the memory
| required to store the tree. The optimal value depends on the
| nature of the problem.
|
| metric : string or DistanceMetric object (default = 'minkowski')
| the distance metric to use for the tree. The default metric is
| minkowski, and with p=2 is equivalent to the standard Euclidean
| metric. See the documentation of the DistanceMetric class for a
| list of available metrics.
|
| p : integer, optional (default = 2)
| Power parameter for the Minkowski metric. When p = 1, this is
| equivalent to using manhattan_distance (l1), and euclidean_distance
| (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.
|
| metric_params: dict, optional (default = None)
| additional keyword arguments for the metric function.

輸出結果：

結果一致。

--------------------------------------------------------------------------------------------------------------------------------------

一、簡介

簡單地說，k-近鄰算法采用測量不同特征值之間的距離方法進來分類
特點：

優點：精度高、對異常值不敏感、無數據輸入假定
缺點：計算復雜度高、空間復雜度高
適用數據范圍：數值型和標稱型

k-近鄰算法稱為kNN，它的工作原理是：存在一個樣本數據集合，也稱作訓練樣本集，並且樣本集中每個數據都存在標簽，即我們知道樣本集中每一數據與所屬分類的對應關系。輸入沒有標簽的新數據后，將新數據的每個特征與樣本集中數據對應的特征進行比較，然后算法提取樣本集中特征最相似數據（最近鄰）的分類標簽。一般來說，我們只選擇樣本數據集中前κ個最相似的數據，這就是k-近鄰算法中κ的出處。通常κ是不大於20的整數。最后，選擇κ個最相似數據出現次數最多的分類，作為新數據的分類。

二、示例

電影分類。
樣本數據：

電影名稱	打斗鏡頭	接吻鏡頭	電影類型
California Man	3	104	愛情片
He’s Not Really into Dudes	2	100	愛情片
Beautiful woman	1	81	愛情片
Kevin Longblade	101	10	動作片
Robo Slayer 3000	99	5	動作片
Amped II	98	22	動作片
?	18	90	未知

如果我們計算出已知電影與未知電影的距離：

電影名稱	與未知電影的距離
California Man	20.5
He’s Not Really into Dudes	18.7
Beautiful woman	19.2
Kevin Longblade	115.3
Robo Slayer 3000	117.4
Amped II	118.9

按照距離遞增排序，可以找到k個距離最近的電影。假定k=3，則三個最靠近的電影依次是：

He’s Not Really into Dudes
Beautiful woman
California Man

kNN按照距離最近的三部電影的類型，決定未知電影的類型——愛情片。

三、Python操作

1. 使用Python導入數據

from numpy import *
import operator

def createDataSet():
#用來創建數據集和標簽
    group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]])
    labels = ['A','A','B','B']
    return group , labels

這里有4組數據，每組數據有兩個我們已知的屬性或者特征值。向量labels包含了每個數據點的標簽信息，labels包含的元素個數等於group矩陣行數。這里將數據點（1,1.1）定義為類A，數據點（0,0.1）定義為類B。為了說明方便，例子中的數值是任意選擇的，並沒有給出軸標簽。
這里寫圖片描述
kNN，帶有4個數據點的簡單例子。

2. 實施kNN分類算法

代碼流程為：
計算已知類別數據集中的每個點依次執行以下操作

計算已知類別數據集中的點與當前點之間的距離
按照距離遞增次序排序
選擇與當前點距離最小的κ個點
確定前κ個點所在類別的出現概率
返回前κ個點出現頻率最高的類別作為當前點的預測分類

classify0函數：

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

參數說明：

inX：用於分類的輸入向量
dataSet：輸入的訓練樣本集
labels：標簽向量
k：用於選擇最近鄰居的數目

其中標簽向量的元素數目和矩陣dataSet的行數相同。程序使用的是歐氏距離公式，計算向量xA與xB之間的距離：

d = (x A 0 - x B 0) 2 + (x A 1 - x B 1) 2------------------------\sqrt

計算完距離后，對數據按照從小到大排序，確認前k個距離最小元素民在的主要分類。輸入k總是正整數；最后，將classCount字典分解為元組列表，然后使用程序第二行導入運算符模塊的itemgetter方法，按照第二個元素的次序對元組進行排序，最后返回發生頻率最高的元素標簽。
運行測試：

group , labels = createDataSet()
print(classify0([0,0],group,labels,3))

這里寫圖片描述

3. 如何測試分類器

錯誤率是評估常用方法，完美的錯誤率為0，最差錯誤率是1.0。

四、示例：使用kNN改進約會網站的配對效果

1.使用Matplotlib創建散點圖

准備一份樣本數據。

每年獲得的飛行常客里程數 玩視頻游戲所耗時間百分比 每周消費的冰淇淋公升數
40920   8.326976    0.953952    3
14488   7.153469    1.673904    2
26052   1.441871    0.805124    1
75136   13.147394   0.428964    1
38344   1.669788    0.134296    1
...

代碼：

from numpy import *
import operator

def classify0(inX,dataSet,labels,k):
    dataSetSize = dataSet.shape[0]
    diffMat = tile(inX,(dataSetSize,1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5
    sortedDistIndicies = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel,0)+1
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1),reverse=True)
    return sortedClassCount[0][0]

def file2matrix(filename):
    fr = open(filename)
    arrayOfLines = fr.readlines()
    numberOfLines = len(arrayOfLines)
    returnMat = zeros((numberOfLines,3))
    classLabelVector = []
    index = 0
    for line in arrayOfLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index,:] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1

datingDataMat,datingLabels = file2matrix('datingTestSet2.txt')

import matplotlib
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()

獲得的散點圖示例：
這里寫圖片描述

樣本數據可以在網上通過搜索”datingTestSet2.txt”獲得。這里散點圖使用datingDataMat矩陣的第二、第三列數據，分別表示特征值“玩視頻游戲所耗時間百分比”和“每周所消費的冰淇淋公升數”。

由於沒有使用樣本分類的特征值，在圖上很難看出任何有用的數據模式信息。一般來說，可以采用色彩或其他的記號來標記不同樣本分類，以便更好地理解數據信息。進行這樣的修改：

ax.scatter(datingDataMat[:,1],datingDataMat[:,2] ,15.0*array(datingLabels),15.0*array(datingLabels))

這里寫圖片描述

利用變量datingLabels存儲的類標簽屬性，在散點圖上繪制了色彩不等、尺寸不同的點。

2.准備數據：歸一化數值

歸一化數值將不同取值范圍的特征值進行數值歸一化，如將取值范圍處理為0到1或者-1到1之間。通過下面公式可以將取值范圍特征值轉化為0到1區間內的值：

newValue=(oldValue-min)/(max-min)

其中min和max分別是數據集中的最小特征值和最大特征值。雖然改變數值取值范圍增加了分類器的復雜度，但為了得到准確結果，我們必須這樣做。下面autoNorm()函數實現歸一化:

def autoNorm(dataSet):
      minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals -minVals
    nromDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet/tile(ranges,(m,1))
    return normDataSet , ranges , minVals

normMat , ranges , minVals = autoNorm(datingDataMat)

3.測試算法

通常我們使用已有數據的90%作為訓練樣本來訓練分類器，而使用10%的數據去測試分類器，檢測分類器的正確率。創建一個測試函數：

def datingClassTest():
    hoRatio = 0.10
    datingDataMat , datingLabels = file2matrix('datingTestSet.txt')
    normMat,ranges,minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m*hoRatio)
    errorCount = 0.0
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],\
                                    datingLabels[numTestVecs:m],3)
        print ("the classifier came back with : %d,the real answer is :%d"\
              %(classifierResult,datingLabels[i]))
        if(classifierResult != datingLabels[i]):errorCount += 1.0
    print ("the total error rate is :%f" % (errorCount / float(numTestVecs)))

使用

normMat , ranges , minVals = autoNorm(datingDataMat)
datingClassTest()

4.補全程序，實現完整功能

def classifyPerson():
    resultList = ['not at all','in small doses','in large doses']
    percentTats = float(input("percetage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("listers of ice cream consumed per year?"))
    datinDataMat,datingLabels = file2matrix('datingTestSet2.txt')
    normMat,ranges ,minVals=autoNorm(datingDataMat)
    inArr = array([ffMiles,percentTats,iceCream])
    classifierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3)
    print ("You will probably like this person:",resultList[classifierResult - 1])

classifyPerson()

運行結果示例：
這里寫圖片描述

==========================================================================

一.基本思想

K近鄰算法，即是給定一個訓練數據集，對新的輸入實例，在訓練數據集中找到與該實例最鄰近的K個實例，這K個實例的多數屬於某個類，就把該輸入實例分類到這個類中。如下面的圖：
這里寫圖片描述
通俗一點來說，就是找最“鄰近”的伙伴，通過這些伙伴的類別來看自己的類別。比如以性格和做過的事情為判斷特征，和你最鄰近的10個人中（這里暫且設k=10），有8個是醫生，有2個是強盜。那么你是醫生的可能性更加大，就把你划到醫生的類別里面去，這就算是K近鄰的思想。
K近鄰思想是非常非常簡單直觀的思想。非常符合人類的直覺，易於理解。
至此，K近鄰算法的核心思想就這么多了。
K值選擇,距離度量,分類決策規則是K近鄰法的三個基本要素.
從K近鄰的思想可以知道，K近鄰算法是離不開對於特征之間“距離”的表征的.

二.實戰

這一部分的數據集《機器學習實戰》中的KNN約會分析，代碼按照自己的風格改了一部分內容。

首先是讀取數據部分(data.py)：

import numpy as np 

def creatData(filename):
    #打開文件,並且讀入整個文件到一個字符串里面
    file=open(filename)
    lines=file.readlines()
    sizeOfRecord=len(lines)

    #開始初始化數據集矩陣和標簽
    group=np.zeros((sizeOfRecord,3))
    labels=[]
    row=0
    #這里從文件讀取存到二維數組的手法記住
    for line in lines:
        line=line.strip()
        tempList=line.split('\t')
        group[row,:]=tempList[:3]

        labels.append(tempList[-1])
        row+=1
    return group,labels

然后是KNN算法的模塊：KNN.py

import numpy as np 
#分類函數(核心)
def classify(testdata,dataset,labels,k):
    dataSize=dataset.shape[0]
    testdata=np.tile(testdata,(dataSize,1))
    #計算距離並且按照返回排序后的下標值列表
    distance=(((testdata-dataset)**2).sum(axis=1))**0.5
    index=distance.argsort()

    classCount={}
    for i in range(k):
        label=labels[index[i]]
        classCount[label]=classCount.get(label,0)+1

    sortedClassCount=sorted(list(classCount.items()),
        key=lambda d:d[1],reverse=True)

    return sortedClassCount[0][0]


#歸一化函數(傳入的都是處理好的只帶數據的矩陣)
def norm(dataset):
    #sum/min/max函數傳入0軸表示每列,得到單行M列的數組
    minValue=dataset.min(0)
    maxValue=dataset.max(0)

    m=dataset.shape[0]
    return (dataset-np.tile(minValue,(m,1)))/np.tile(maxValue-minValue,(m,1))


#測試函數
def classifyTest(testdataset,dataset,dataset_labels,
                testdataset_labels,k):
    sampleAmount=testdataset.shape[0]

    #歸一化測試集合和訓練集合
    testdataset=norm(testdataset)
    dataset=norm(dataset)
    #測試
    numOfWrong=0
    for i in range(sampleAmount):
        print("the real kind is:",testdataset_labels[i])
        print("the result kind is:",
            classify(testdataset[i],dataset,dataset_labels,k))
        if testdataset_labels[i]==classify(testdataset[i],
                                    dataset,dataset_labels,k):
            print("correct!!")
        else:
            print("Wrong!!")
            numOfWrong+=1
        print()

    print(numOfWrong)

畫圖模塊(drawer.py)：

import matplotlib.pyplot as plt
import numpy as np 
from mpl_toolkits.mplot3d import Axes3D
import data 

def drawPlot(dataset,labels):
    fig=plt.figure(1)
    ax=fig.add_subplot(111,projection='3d')
    for i in range(dataset.shape[0]):
        x=dataset[i][0]
        y=dataset[i][1]
        z=dataset[i][2]
        if labels[i]=='largeDoses':
            ax.scatter(x,y,z,c='b',marker='o')
        elif labels[i]=='smallDoses':
            ax.scatter(x,y,z,c='r',marker='s')
        else:
            ax.scatter(x,y,z,c='g',marker='^')
    plt.show()

測試模塊（run.py）

import data
import KNN 
import drawer

#這里測試數據集和訓練數據集都是采用的同一個數據集
dataset,labels=data.creatData("datingTestSet.txt")
testdata_set,testdataset_labels=data.creatData("datingTestSet.txt")

print(type(dataset[0][0]))
#測試分類效果。K取得是10
KNN.classifyTest(testdata_set,dataset,labels,testdataset_labels,10)

#畫出訓練集的分布
drawer.drawPlot(dataset,labels)

結果：
這里寫圖片描述

三.優缺點分析

從上面的代碼可以看到，K近鄰法並不具有顯式的學習過程,你必須先把數據集存下來，然后類似於比對的來作比較。K近鄰法實際上是利用訓練數據集對特征向量空間進行划分,並且作為其分類的模型
優點：

多數表決規則等價於經驗風險最小化.
精度高,對異常值不敏感,無數據輸入假定

缺點：

K值選擇太小,意味着整體模型變得復雜,容易發生過擬合.但是K值要是選擇過大的話,容易忽略實例中大量有用的信息,也不可取.一般是先取一個比較小的數值,通常采用交叉驗證的方式來選取最優的K值.
計算復雜度高,空間復雜度高

本文摘自：https://blog.csdn.net/xundh/article/details/73611249

　　　　　https://blog.csdn.net/helloworld6746/article/details/50817427

　　　　　https://blog.csdn.net/xierhacker/article/details/61914468

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python機器學習筆記：K-近鄰（KNN）算法 K-近鄰算法（KNN） K-近鄰算法（KNN）機器學習|算法模型——K近鄰法(KNN) 機器學習 | 算法筆記- k近鄰（KNN）【機器學習】最近鄰算法KNN 機器學習實例---1.1、k-近鄰算法（簡單k-nn） Python 實現 KNN（K-近鄰）算法 k-近鄰算法（kNN）完整代碼機器學習實戰筆記(Python實現)-01-K近鄰算法(KNN)