機器學習：Mean Shift聚類算法

本文轉載自查看原文 2018-07-09 15:38 10756 機器學習/ Mean Shift/ Python

本文由ChardLau原創，轉載請添加原文鏈接https://www.chardlau.com/mean-shift/

今天的文章介紹如何利用Mean Shift算法的基本形式對數據進行聚類操作。而有關Mean Shift算法加入核函數計算漂移向量部分的內容將不在本文講述范圍內。實際上除了聚類，Mean Shift算法還能用於計算機視覺等場合，有關該算法的理論知識請參考這篇文章。

`Mean Shift`算法原理

下圖展示了Mean Shift算法計算飄逸向量的過程：

Mean Shift算法的關鍵操作是通過感興趣區域內的數據密度變化計算中心點的漂移向量，從而移動中心點進行下一次迭代，直到到達密度最大處（中心點不變）。從每個數據點出發都可以進行該操作，在這個過程，統計出現在感興趣區域內的數據的次數。該參數將在最后作為分類的依據。

與K-Means算法不一樣的是，Mean Shift算法可以自動決定類別的數目。與K-Means算法一樣的是，兩者都用集合內數據點的均值進行中心點的移動。

算法步驟

下面是有關Mean Shift聚類算法的步驟：

在未被標記的數據點中隨機選擇一個點作為起始中心點center；
找出以center為中心半徑為radius的區域中出現的所有數據點，認為這些點同屬於一個聚類C。同時在該聚類中記錄數據點出現的次數加1。
以center為中心點，計算從center開始到集合M中每個元素的向量，將這些向量相加，得到向量shift。
center = center + shift。即center沿着shift的方向移動，移動距離是||shift||。
重復步驟2、3、4，直到shift的很小（就是迭代到收斂），記住此時的center。注意，這個迭代過程中遇到的點都應該歸類到簇C。
如果收斂時當前簇C的center與其它已經存在的簇C2中心的距離小於閾值，那么把C2和C合並，數據點出現次數也對應合並。否則，把C作為新的聚類。
重復1、2、3、4、5直到所有的點都被標記為已訪問。
分類：根據每個類，對每個點的訪問頻率，取訪問頻率最大的那個類，作為當前點集的所屬類。

算法實現

下面使用Python實現了Mean Shift算法的基本形式：

import numpy as np
import matplotlib.pyplot as plt

# Input data set
X = np.array([
    [-4, -3.5], [-3.5, -5], [-2.7, -4.5],
    [-2, -4.5], [-2.9, -2.9], [-0.4, -4.5],
    [-1.4, -2.5], [-1.6, -2], [-1.5, -1.3],
    [-0.5, -2.1], [-0.6, -1], [0, -1.6],
    [-2.8, -1], [-2.4, -0.6], [-3.5, 0],
    [-0.2, 4], [0.9, 1.8], [1, 2.2],
    [1.1, 2.8], [1.1, 3.4], [1, 4.5],
    [1.8, 0.3], [2.2, 1.3], [2.9, 0],
    [2.7, 1.2], [3, 3], [3.4, 2.8],
    [3, 5], [5.4, 1.2], [6.3, 2]
])


def mean_shift(data, radius=2.0):
    clusters = []
    for i in range(len(data)):
        cluster_centroid = data[i]
        cluster_frequency = np.zeros(len(data))

        # Search points in circle
        while True:
            temp_data = []
            for j in range(len(data)):
                v = data[j]
                # Handle points in the circles
                if np.linalg.norm(v - cluster_centroid) <= radius:
                    temp_data.append(v)
                    cluster_frequency[i] += 1

            # Update centroid
            old_centroid = cluster_centroid
            new_centroid = np.average(temp_data, axis=0)
            cluster_centroid = new_centroid
            # Find the mode
            if np.array_equal(new_centroid, old_centroid):
                break

        # Combined 'same' clusters
        has_same_cluster = False
        for cluster in clusters:
            if np.linalg.norm(cluster['centroid'] - cluster_centroid) <= radius:
                has_same_cluster = True
                cluster['frequency'] = cluster['frequency'] + cluster_frequency
                break

        if not has_same_cluster:
            clusters.append({
                'centroid': cluster_centroid,
                'frequency': cluster_frequency
            })

    print('clusters (', len(clusters), '): ', clusters)
    clustering(data, clusters)
    show_clusters(clusters, radius)


# Clustering data using frequency
def clustering(data, clusters):
    t = []
    for cluster in clusters:
        cluster['data'] = []
        t.append(cluster['frequency'])
    t = np.array(t)
    # Clustering
    for i in range(len(data)):
        column_frequency = t[:, i]
        cluster_index = np.where(column_frequency == np.max(column_frequency))[0][0]
        clusters[cluster_index]['data'].append(data[i])


# Plot clusters
def show_clusters(clusters, radius):
    colors = 10 * ['r', 'g', 'b', 'k', 'y']
    plt.figure(figsize=(5, 5))
    plt.xlim((-8, 8))
    plt.ylim((-8, 8))
    plt.scatter(X[:, 0], X[:, 1], s=20)
    theta = np.linspace(0, 2 * np.pi, 800)
    for i in range(len(clusters)):
        cluster = clusters[i]
        data = np.array(cluster['data'])
        plt.scatter(data[:, 0], data[:, 1], color=colors[i], s=20)
        centroid = cluster['centroid']
        plt.scatter(centroid[0], centroid[1], color=colors[i], marker='x', s=30)
        x, y = np.cos(theta) * radius + centroid[0], np.sin(theta) * radius + centroid[1]
        plt.plot(x, y, linewidth=1, color=colors[i])
    plt.show()


mean_shift(X, 2.5)

代碼鏈接

上述代碼執行結果如下：

其他

Mean Shift算法還有很多內容未提及。其中有“動態計算感興趣區域半徑”、“加入核函數計算漂移向量”等。本文作為入門引導，暫時只覆蓋這些內容。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 機器學習——聚類算法 mean shift聚類算法的MATLAB程序機器學習——層次聚類算法 8.機器學習之聚類算法圖解機器學習 | 聚類算法詳解 5.機器學習——DBSCAN聚類算法機器學習——聚類算法的評估指標機器學習Sklearn系列：（五）聚類算法機器學習：雙聚類算法機器學習之DBSCAN聚類算法