分群評估指標(一)|調整蘭德系數與Silhouette Coefficient 輪廓系數
1 Adjusted Rand index 調整蘭德系數(ARI需要真實標簽)
蘭德系數(Rand index)
給定 \(n\) 個對象集合 \(S=\left\{O_{1}, O_{2}, \ldots, O_{n}\right\}\), 假設 \(U=\left\{u_{1}, \ldots, u_{R}\right\}\) 和 \(V=\left\{v_{1}, \ldots, v_{C}\right\}\) 表示S的兩個不同划分並且滿足 \(\bigcup_{i=1}^{R} u_{i}=S=\bigcup_{j=1}^{C} v_{j}, u_{i} \cap u_{i^{*}}=\emptyset=v_{j} \cap v_{j^{*}}\), 其中 \(1 \leq i \neq i^{*} \leq R, 1 \leq j \neq j^{*} \leq C_{\circ}\)
假設U是外部評價標准即true_label,而V是聚類結果。設定四個統計量:
- a為在U中為同一類且在V中也為同一類別的數據點對數
- b為在U中為同一類但在V中卻隸屬於不同類別的數據點對數
- c為在U中不在同一類但在V中為同一類別的數據點對數
- d為在U中不在同一類且在V中也不屬於同一類別的數據點對數
此時,蘭德系數為
蘭德系數的值在[0,1]之間,當聚類結果完美匹配
時,蘭德系數為1。
調整蘭德系數(Adjusted Rand index)
蘭德系數的問題在於對於兩個隨機的划分,其蘭德系數值不是一個接近於0的常數。 Huberr和Arabie在1985年提出了調整蘭德系數, 調整蘭德系數假設模型的超分布為隨機模型, 即 \(U\) 和 \(V\) 的划分為隨機的,那么各類別和各簇的數據點數目是固定的。
假設 \(n_{i j}\) 表示同在類別 \(u_{i}\) 和簇 \(v_{j}\) 內的數據點數目,$$n_{i}$$為類$$u_{i}$$的數據點數目,$$n_{j}$$為簇$$v_{j}$$的數目,如下表:
調整的蘭德系數為:
ARI其實是去均值歸一化的形式,RI中的 \(a+d\) 可以表示為 \(\sum_{i, j}\left(\begin{array}{c}n_{i j} \\ 2\end{array}\right)\),
ARI∈[−1,1] 。值越大意味着聚類結果與真實情況越吻合。從廣義的角度來將,ARI是衡量兩個數據分布的吻合程度
的。
- 優點:
對任意數量的聚類中心和樣本數,隨機聚類的ARI都非常接近於0;
取值在[-1,1]之間,負數代表結果不好,越接近於1越好;
可用於聚類算法之間的比較 - 缺點:
ARI需要真實標簽
python代碼實現
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score) # 0.24242424242424246
# 與標簽名無關
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score) # 0.24242424242424246
# 具有對稱性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score) # 0.24242424242424246
# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)#1.0
References
https://blog.csdn.net/qq_42887760/article/details/105728101
https://blog.csdn.net/sinat_30203515/article/details/82634778
2 Silhouette Coefficient 輪廓系數(實際類別信息未知)
輪廓系數( Silhouette coefficient)適用於實際類別信息未知
的情況。對於單個樣本,設a是與它同類別中其他樣本的平均距離,b是與它距離最近不同類別中樣本的平均距離,輪廓系數為
對於一個樣本集合,它的輪廓系數是所有樣本輪廓系數的平均值。 輪廓系數取值范圍是 \([-1,1]\), 同類別樣本越距離相近且不同類別 樣本距離越遠,分數越高。
import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,3],[4,6,7],[4,5,6],[2,3,5]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
print(labels)#[2 2 1 1 0]
metrics.silhouette_score(X, labels, metric='euclidean')#0.6371196617847901
references
https://cloud.tencent.com/developer/article/1010857
分群評估指標(二)|調整互信息與Homogeneity, completeness and V-measure
調整互信息(需要已知數據點的真實標簽)
原理
https://blog.csdn.net/qq_42122496/article/details/106193859
https://cloud.tencent.com/developer/article/1010857
互信息( Mutual Information)也是用來衡量兩個數據分布的吻合程度。利用基於互信息的方法來衡量聚類效果需要實際類別信息,MI與NMI取值范圍為[0,1],AMI取值范圍為[-1,1],它們都是值越大意味看聚類結果與真實倩況越吻合。
代碼
from sklearn.metrics.cluster import entropy, mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score
MI = lambda x, y: mutual_info_score(x, y)
NMI = lambda x, y: normalized_mutual_info_score(x, y, average_method='arithmetic')#NMI和AMI的計算均采用算術平均;log函數的底為自然對數e。
AMI = lambda x, y: adjusted_mutual_info_score(x, y, average_method='arithmetic')
A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3]
#print(entropy(A))
#print(MI(A, B))
print(NMI(A, B))#0.36456177185718985
print(AMI(A, B))#0.2601812253892505
C = [1, 1, 2, 2, 3, 3, 3]
D = [1, 1, 1, 2, 1, 1, 1]
print(NMI(C, D))#0.28483386264113447
print(AMI(C, D))#0.05674883175532439
Homogeneity, completeness and V-measure
https://www.jianshu.com/p/e1ee1f336d35
homogeneity : float
score between 0.0 and 1.0. ;1.0 stands for perfectly homogeneous labeling
completeness:float
score between 0.0 and 1.0. ;1.0 stands for perfectly complete labeling
v_measure:float
harmonic mean of the first two
A clustering result satisfies homogeneity
if all of its clusters contain only data points which are members of a single class.
A clustering result satisfies completeness
if all the data points that are members of a given class are elements of the same cluster.
Both scores have positive values between 0.0 and 1.0, larger values being desirable.
from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]
print(metrics.homogeneity_score(labels_true, labels_pred))
print(metrics.completeness_score(labels_true, labels_pred))
print(metrics.v_measure_score(labels_true, labels_pred))
#beta默認值為1.0,但如果beta值小於1:
print(metrics.v_measure_score(labels_true, labels_pred, beta=0.6))
print(metrics.v_measure_score(labels_true, labels_pred, beta=1.8))
#這三個值可以一起計算
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))
labels_pred = [0, 0, 0, 1, 2, 2]
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))
分群評估指標(三)|Fowlkes-Mallows scores與Calinski-Harabaz Index
Fowlkes-Mallows scores
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html
The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision and recall:
FMI = TP / sqrt((TP + FP) * (TP + FN))
The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.
from sklearn.metrics.cluster import fowlkes_mallows_score
print(fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1]))#1.0
print(fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0]))#1.0
Calinski-Harabaz Index(真實的分群label不知道)
在真實的分群label不知道的情況下,可以作為評估模型的一個指標。類別內部數據的協方差越小越好,類別之間的協方差越大越好,這樣的Calinski-Harabasz分數會高。
import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,5],[2,4,7],[1,2,8]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabasz_score(X, labels) #4.125
分群評估指標(四)總結|實戰
總結
實戰
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=0)
# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50, random_state=0).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()