Python實現6大分群評估指標


分群評估指標(一)|調整蘭德系數與Silhouette Coefficient 輪廓系數

1 Adjusted Rand index 調整蘭德系數(ARI需要真實標簽)

蘭德系數(Rand index)

給定 \(n\) 個對象集合 \(S=\left\{O_{1}, O_{2}, \ldots, O_{n}\right\}\), 假設 \(U=\left\{u_{1}, \ldots, u_{R}\right\}\)\(V=\left\{v_{1}, \ldots, v_{C}\right\}\) 表示S的兩個不同划分並且滿足 \(\bigcup_{i=1}^{R} u_{i}=S=\bigcup_{j=1}^{C} v_{j}, u_{i} \cap u_{i^{*}}=\emptyset=v_{j} \cap v_{j^{*}}\), 其中 \(1 \leq i \neq i^{*} \leq R, 1 \leq j \neq j^{*} \leq C_{\circ}\)

假設U是外部評價標准即true_label,而V是聚類結果。設定四個統計量:

  • a為在U中為同一類且在V中也為同一類別的數據點對數
  • b為在U中為同一類但在V中卻隸屬於不同類別的數據點對數
  • c為在U中不在同一類但在V中為同一類別的數據點對數
  • d為在U中不在同一類且在V中也不屬於同一類別的數據點對數

image-20211003203409304

此時,蘭德系數為

\[RI = \frac{a+d}{a+b+c+d} \]

蘭德系數的值在[0,1]之間,當聚類結果完美匹配時,蘭德系數為1。

調整蘭德系數(Adjusted Rand index)

蘭德系數的問題在於對於兩個隨機的划分,其蘭德系數值不是一個接近於0的常數。 Huberr和Arabie在1985年提出了調整蘭德系數, 調整蘭德系數假設模型的超分布為隨機模型, 即 \(U\)\(V\) 的划分為隨機的,那么各類別和各簇的數據點數目是固定的。
假設 \(n_{i j}\) 表示同在類別 \(u_{i}\) 和簇 \(v_{j}\) 內的數據點數目,$$n_{i}$$為類$$u_{i}$$的數據點數目,$$n_{j}$$為簇$$v_{j}$$的數目,如下表:

image-20211003204206911

調整的蘭德系數為:

\[A R I=\frac{R I-E(R I)}{\max (R I)-E(R I)} \]

ARI其實是去均值歸一化的形式,RI中的 \(a+d\) 可以表示為 \(\sum_{i, j}\left(\begin{array}{c}n_{i j} \\ 2\end{array}\right)\),

\[\begin{gathered} E(R I)=E\left(\sum_{i, j}\left(\begin{array}{c} n_{i j} \\ 2 \end{array}\right)\right)=\left[\sum_{i}\left(\begin{array}{c} n_{i} \\ 2 \end{array}\right) \sum_{j}\left(\begin{array}{c} n_{. j} \\ 2 \end{array}\right)\right] /\left(\begin{array}{l} n \\ 2 \end{array}\right) \\ \max (R I)=\frac{1}{2}\left[\sum_{i}\left(\begin{array}{c} n_{i} \\ 2 \end{array}\right)+\sum_{j}\left(\begin{array}{c} n_{. j} \\ 2 \end{array}\right)\right] \end{gathered} \]

ARI∈[−1,1] 。值越大意味着聚類結果與真實情況越吻合。從廣義的角度來將,ARI是衡量兩個數據分布的吻合程度的。

  • 優點:
    對任意數量的聚類中心和樣本數,隨機聚類的ARI都非常接近於0;
    取值在[-1,1]之間,負數代表結果不好,越接近於1越好;
    可用於聚類算法之間的比較
  • 缺點:
    ARI需要真實標簽

python代碼實現

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 與標簽名無關
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 具有對稱性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score)  # 0.24242424242424246

# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)#1.0

References

https://blog.csdn.net/qq_42887760/article/details/105728101

https://blog.csdn.net/sinat_30203515/article/details/82634778

2 Silhouette Coefficient 輪廓系數(實際類別信息未知)

輪廓系數( Silhouette coefficient)適用於實際類別信息未知的情況。對於單個樣本,設a是與它同類別中其他樣本的平均距離,b是與它距離最近不同類別中樣本的平均距離,輪廓系數為

\[s=\frac{b-a}{\max (a, b)} \]

對於一個樣本集合,它的輪廓系數是所有樣本輪廓系數的平均值。 輪廓系數取值范圍是 \([-1,1]\), 同類別樣本越距離相近且不同類別 樣本距離越遠,分數越高。

import numpy as np
from sklearn.cluster import KMeans

X = [[1,2,3],[1,2,3],[4,6,7],[4,5,6],[2,3,5]]

kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
print(labels)#[2 2 1 1 0]
metrics.silhouette_score(X, labels, metric='euclidean')#0.6371196617847901

references

https://cloud.tencent.com/developer/article/1010857

分群評估指標(二)|調整互信息與Homogeneity, completeness and V-measure

調整互信息(需要已知數據點的真實標簽)

原理

https://blog.csdn.net/qq_42122496/article/details/106193859

https://cloud.tencent.com/developer/article/1010857

互信息( Mutual Information)也是用來衡量兩個數據分布的吻合程度。利用基於互信息的方法來衡量聚類效果需要實際類別信息,MI與NMI取值范圍為[0,1],AMI取值范圍為[-1,1],它們都是值越大意味看聚類結果與真實倩況越吻合。

代碼

from sklearn.metrics.cluster import entropy, mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score

MI = lambda x, y: mutual_info_score(x, y)
NMI = lambda x, y: normalized_mutual_info_score(x, y, average_method='arithmetic')#NMI和AMI的計算均采用算術平均;log函數的底為自然對數e。
AMI = lambda x, y: adjusted_mutual_info_score(x, y, average_method='arithmetic')

A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3]
#print(entropy(A))
#print(MI(A, B))
print(NMI(A, B))#0.36456177185718985
print(AMI(A, B))#0.2601812253892505

C = [1, 1, 2, 2, 3, 3, 3]
D = [1, 1, 1, 2, 1, 1, 1]
print(NMI(C, D))#0.28483386264113447
print(AMI(C, D))#0.05674883175532439

Homogeneity, completeness and V-measure

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html

https://www.jianshu.com/p/e1ee1f336d35

homogeneity : float

score between 0.0 and 1.0. ;1.0 stands for perfectly homogeneous labeling

completeness:float

score between 0.0 and 1.0. ;1.0 stands for perfectly complete labeling

v_measure:float

harmonic mean of the first two

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

Both scores have positive values between 0.0 and 1.0, larger values being desirable.

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print(metrics.homogeneity_score(labels_true, labels_pred))

print(metrics.completeness_score(labels_true, labels_pred))

print(metrics.v_measure_score(labels_true, labels_pred))

#beta默認值為1.0,但如果beta值小於1:
print(metrics.v_measure_score(labels_true, labels_pred, beta=0.6))

print(metrics.v_measure_score(labels_true, labels_pred, beta=1.8))

#這三個值可以一起計算
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))

labels_pred = [0, 0, 0, 1, 2, 2]
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))

result

分群評估指標(三)|Fowlkes-Mallows scores與Calinski-Harabaz Index

Fowlkes-Mallows scores

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html

The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision and recall:

FMI = TP / sqrt((TP + FP) * (TP + FN))

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.

from sklearn.metrics.cluster import fowlkes_mallows_score
print(fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1]))#1.0

print(fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0]))#1.0

Calinski-Harabaz Index(真實的分群label不知道)

在真實的分群label不知道的情況下,可以作為評估模型的一個指標。類別內部數據的協方差越小越好,類別之間的協方差越大越好,這樣的Calinski-Harabasz分數會高。

import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,5],[2,4,7],[1,2,8]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabasz_score(X, labels)  #4.125

分群評估指標(四)總結|實戰

總結

image-20211004103648907

實戰

https://scikit-learn.org/stable/auto_examples/cluster/plot_affinity_propagation.html#sphx-glr-auto-examples-cluster-plot-affinity-propagation-py

from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)

# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50, random_state=0).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels, metric='sqeuclidean'))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

result


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM