1. 算法步驟
- 隨機選取k個樣本點充當k個簇的中心點;
- 計算所有樣本點與各個簇中心之間的距離,然后把樣本點划入最近的簇中;
- 根據簇中已有的樣本點,重新計算簇中心;
- 重復步驟2和3,直到簇中心不再改變或改變很小。
2. 手動Python實現
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets.samples_generator import make_blobs
n_data = 400
n_cluster = 4
# generate training data
X, y = make_blobs(n_samples=n_data, centers=n_cluster, cluster_std=0.60, random_state=0)
# generate centers of clusters
centers = np.random.rand(4, 2)*5
EPOCH = 10
tol = 1e-5
for epoch in range(EPOCH):
labels = np.zeros(n_data, dtype=np.int)
# 計算每個點到簇中心的距離並分配label
for i in range(n_data):
distance = np.sum(np.square(X[i]-centers), axis=1)
label = np.argmin(distance)
labels[i] = label
# 重新計算簇中心
for i in range(n_cluster):
indices = np.where(labels == i)[0] # 找出第i簇的樣本點的下標
points = X[indices]
centers[i, :] = np.mean(points, axis=0) # 更新第i簇的簇中心
plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis')
plt.show()
運行結果:(注:當簇中心初始化不好時,可能計算會有點錯誤)
3. 調用sklearn實現kmeans
import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets.samples_generator import make_blobs # Generate some data X, y = make_blobs(n_samples=400, centers=4, cluster_std=0.60, random_state=0) # kmeans clustering kmeans = KMeans(4, random_state=0) kmeans.fit(X) # 訓練模型 labels = kmeans.predict(X) # 預測分類 plt.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis') plt.show()
運行結果:

