k-means原理和python代碼實現


k-means:是無監督的分類算法

k代表要分的類數,即要將數據聚為k類; means是均值,代表着聚類中心的迭代策略.

k-means算法思想:

(1)隨機選取k個聚類中心(一般在樣本集中選取,也可以自己隨機選取);

(2)計算每個樣本與k個聚類中心的距離,並將樣本歸到距離最小的那個類中;

(3)更新中心,計算屬於k類的樣本的均值作為新的中心。

(4)反復迭代(2)(3),直到聚類中心不發生變化,后者中心位置誤差在閾值范圍內,或者達到一定的迭代次數。

python實現:

k-means簡單小樣例:

import numpy as np

data = np.random.randint(1,10,(30,2))
#k=4
k=4
#central
np.random.shuffle(data)
cent = data[0:k,:]
#distance
distance = np.zeros((data.shape[0],k))
last_near = np.zeros(data.shape[0])
n=0
while True:
    n = n+1
    print(n)
    for i in range(data.shape[0]):
        for j in range(cent.shape[0]):
            dist = np.sqrt(np.sum((data[i]-cent[j])**2))
            distance[i,j] = dist
    nearst = np.argmin(distance,axis = 1)
    if (last_near == nearst).all():
    #if n<1000:
        break
    #update central
    for ele_cen in range(k):
        cent[ele_cen] = np.mean(data[nearst == ele_cen],axis=0)
    last_near = nearst
print(cent)
下面樣例是為了適應yolov3選取anchorbox的度量需求:

import numpy as np


def iou(box, clusters):
    """
    Calculates the Intersection over Union (IoU) between a box and k clusters.
    :param box: tuple or array, shifted to the origin (i. e. width and height)
    :param clusters: numpy array of shape (k, 2) where k is the number of clusters
    :return: numpy array of shape (k, 0) where k is the number of clusters
    """
    x = np.minimum(clusters[:, 0], box[0])
    y = np.minimum(clusters[:, 1], box[1])
    if np.count_nonzero(x == 0) > 0 or np.count_nonzero(y == 0) > 0:
        raise ValueError("Box has no area")
    intersection = x * y
    box_area = box[0] * box[1]
    cluster_area = clusters[:, 0] * clusters[:, 1]
    iou_ = intersection / (box_area + cluster_area - intersection)
    return iou_

def kmeans(boxes, k, dist=np.median):
    """
    Calculates k-means clustering with the Intersection over Union (IoU) metric.
    :param boxes: numpy array of shape (r, 2), where r is the number of rows
    :param k: number of clusters
    :param dist: distance function
    :return: numpy array of shape (k, 2)
    """
    rows = boxes.shape[0]

    distances = np.empty((rows, k)) #初始化距離矩陣,rows代表樣本數量,k代表聚類數量,用於存放每個樣本對應每個聚類中心的距離
    last_clusters = np.zeros((rows,))#記錄上一次樣本所屬的類型

    np.random.seed()

    # the Forgy method will fail if the whole array contains the same rows
    clusters = boxes[np.random.choice(rows, k, replace=False)]#從樣本中隨機選取聚類中心

    while True:
        for row in range(rows):
            distances[row] = 1 - iou(boxes[row], clusters) #這里是距離計算公式,這里是為了適應yolov3選取anchorbox的度量需求
        nearest_clusters = np.argmin(distances, axis=1)    #找到距離最小的類
        if (last_clusters == nearest_clusters).all(): #判斷是否滿足終止條件
            break
        for cluster in range(k):                        #更新聚類中心
            clusters[cluster] = dist(boxes[nearest_clusters == cluster], axis=0) #將某一類的均值更新為聚類中心
        last_clusters = nearest_clusters
    return clusters

希望可以為正在疑惑的你提供一些思路!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM