K近鄰算法代碼注釋及詳解

本文轉載自查看原文 2018-10-03 16:47 1544 《機器學習實戰》筆記

這里的代碼是基於Python3的

這里先上詳解

最后附上完整代碼注釋

我的博客素：https://www.cnblogs.com/EvilAnne/

如果我比較勤勞的話，會更新完整本書

否則，我會放棄！

①dataSetSize = dataSet.shape[0]

是來計算共有多少數據集,如是([[1,3],[3,4]]),就是兩組數據集，相當於

| x1 | x2 |

| 1 | 3 |

| 3 | 4 |

dataSetSize = dataSet.shape[0] = 2

②diffMat = tile(inX,(dataSetSize,1)) - dataSet

這是來計算,未知類的數據集與已知數據集的差,但為了方便

要把未知類的數據集化成矩陣計算，設未知類數據是([[0,3]])

即 | x1' | x2' |

| 0 | 3 |

我們想計算 0-1,3-3;0-3，3-4

這樣可以更方便

| 0 3 | - | 1 3 | = | -1 0 | （-dataSet計算差值）

| 0 3 | | 3 4 | | -3 -1 |

而diffMat = tile(inX,(dataSetSize,1))就是

把[0,3] - > | 0 3 | 來方便計算的

| 0 3 |

③sqDiffMat = diffMat ** 2

把差值平方化即

| -1 0 | ** 2 = | 1 0 |

| -3 -1 | | 9 1 |

④distances = sqDiffMat.sum(axis=1)

相當於把（未平方根化之前的）未知數據集與兩個已知數據的距離分別計算出來

得出[1,10],1是未知與已知①的距離,10是未知與已知②的距離

⑤distances = sqdistances ** 0.5

將兩個距離分別進行平方根化，得到

該未知標簽向量與已知①的歐式距離，該未知標簽向量與已知②的歐式距離

⑥sortedDistIndices = distances.argsort()

比如說,

[3,5,1] ---> [2,0,1] 從小到大的索引

從小到大分別是：1，3，5.對應的索引是2，0，1

這樣我們在循環的時候,可以將歐式距離list按值的大小從小到大遍歷出來

------------------------------------------------------------------------------------

把按值大小順序排列的歐氏距離索引list前k個對應的labels遍歷出來

⑨classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1

比如說，k = {};k['A'] = k.get('A',1);k = {'A':1}

而 k = {};k['A'] = k.get('A',1) + 1;k = {'A':2}

但我不太明白這里為什么要+1

所以我們來實驗一下⑨到底是什么意思

for i in range(k): #k是我們自己定的

voteIlabel = labels[sortedDistIndicies[i]] #把按值大小順序排列的歐氏距離索引list前k個對應的labels遍歷出來

classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1

假設labels = ['A','A','B','B']

假設sortedDistIndicies = [0,3,2,1]

令k = 4

---------

當k = 0,

voteIlabel = labels[sortedDistIndicies[0]]

= labels[0]

= 'A'

classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1

↓

classCount['A'] = classCount.get('A', 0) + 1

所以有classCount = {'A':1}

---------

k = 1

voteIlabel = labels[sortedDistIndicies[1]]

= labels[1]

= 'A'

classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1

↓

classCount['A'] = classCount.get('A', 0) + 1

所以有classCount = {'A':2}

---------

k = 2

voteIlabel = labels[sortedDistIndicies[2]]

= labels[2]

= 'B'

classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1

↓

classCount['B'] = classCount.get('B', 0) + 1

所以有classCount = {'A':2,'B':1}

---------

k = 3

voteIlabel = labels[sortedDistIndicies[3]]

= labels[3]

= 'B'

classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1

↓

classCount['B'] = classCount.get('B', 0) + 1

所以有classCount = {'A':2,'B':2}

---------

可以看出，整合的是分類所出現的頻數

所以⑨應該是整合的是，按值大小順序排列的歐氏距離list前k個對應的labels的頻數

假設classCount = {“A”:2,"B":2}

classCount.items()返回的是dict_items

operator.itemgetter()返回的是一個函數

operator.itemgetter(1)按照第二個元素的次序對元組進行排序，reverse=True是逆序，即按照從大到小的順序排列

意思是取dict_items第1個值（從0開始）

所以 sorted這里的意思是：

classCount.items()將classCount字典分解為元組列表

即由

變成

並且按第二個元素進行從大到小的排列例子如下

因為我這里都是2，所以不明顯，大家自己可以試試把數字改一下

最后return sortedClassCount[0][0]

容易理，就是出解現頻數最高的那個對應的分類標簽

file2matrix數據預處理函數

string.strip([chars])

strip：用來去除頭尾字符、空白符(包括\n、\r、\t、' '，即：換行、回車、制表符、空格)

參數chars是可選的，當chars為空，默認刪除string頭尾的空白符(包括\n、\r、\t、' ')

split([char])

split()：拆分字符串。通過指定分隔符對字符串進行切片，並返回分割后的字符串列表（list）

不理解

的意思

先不百度，自己實驗一下

可以知道returnMat[index,:] ，這里index = 0，就是取該數據的第一行數據，由於初始化為零矩陣

所以是[0,0,0]

對於

通過實驗可以知道取前三個數據

也就是說讀取特征向量

所以

的意思就是

取這個文件的特征向量

如果字符串只包含數字則返回 True 否則返回 False。

img2vector

這句不理解，自己實驗下，可知道是取第0行的第（32*i+j）個數據

我們知道returnVect是 1 * 1024的零矩陣

linestr是遍歷讀取的每行數據

所以整體的意思就是說把原來32 * 32的數據

轉換成1 * （32*32） = 1 * 1024 的數據

如

這個4 * 3的向量

我把它轉換成1 * （4*3） = 1 * 12的向量

就是[1,2,3,4,5,6,7,8,9,9,9,9]

handwritingClassTest函數

解釋下以下代碼

其實看test就能很好理解了

完整代碼及注釋：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 K近鄰算法-KNN K-近鄰算法 K-近鄰算法的Python實現：源代碼分析 K-近鄰算法（KNN） k近鄰算法的Python實現 KNN算法(K近鄰算法)實現與剖析機器學習算法( 二、K - 近鄰算法) 分類算法之k-近鄰算法（KNN）機器學習算法之K近鄰算法 TensorFlow實現knn（k近鄰）算法