Python實現6大分群評估指標

本文轉載自查看原文 2021-11-01 22:36 890 聚類評估/ 機器學習/ Python

分群評估指標（一）|調整蘭德系數與Silhouette Coefficient 輪廓系數

1 Adjusted Rand index 調整蘭德系數（ARI需要真實標簽）

蘭德系數（Rand index）

給定 $n$ 個對象集合 $S=\left\{O_{1}, O_{2}, \ldots, O_{n}\right\}$, 假設 $U=\left\{u_{1}, \ldots, u_{R}\right\}$ 和 $V=\left\{v_{1}, \ldots, v_{C}\right\}$ 表示S的兩個不同划分並且滿足 $\bigcup_{i=1}^{R} u_{i}=S=\bigcup_{j=1}^{C} v_{j}, u_{i} \cap u_{i^{*}}=\emptyset=v_{j} \cap v_{j^{*}}$, 其中 $1 \leq i \neq i^{*} \leq R, 1 \leq j \neq j^{*} \leq C_{\circ}$

假設U是外部評價標准即true_label，而V是聚類結果。設定四個統計量：

a為在U中為同一類且在V中也為同一類別的數據點對數
b為在U中為同一類但在V中卻隸屬於不同類別的數據點對數
c為在U中不在同一類但在V中為同一類別的數據點對數
d為在U中不在同一類且在V中也不屬於同一類別的數據點對數

此時，蘭德系數為

\[RI = \frac{a+d}{a+b+c+d} \]

蘭德系數的值在[0,1]之間，當聚類結果完美匹配時，蘭德系數為1。

調整蘭德系數(Adjusted Rand index)

蘭德系數的問題在於對於兩個隨機的划分,其蘭德系數值不是一個接近於0的常數。 Huberr和Arabie在1985年提出了調整蘭德系數, 調整蘭德系數假設模型的超分布為隨機模型, 即 $U$ 和 $V$ 的划分為隨機的，那么各類別和各簇的數據點數目是固定的。
假設 $n_{i j}$ 表示同在類別 $u_{i}$ 和簇 $v_{j}$ 內的數據點數目,$$n_{i}$$為類$$u_{i}$$的數據點數目，$$n_{j}$$為簇$$v_{j}$$的數目，如下表：

調整的蘭德系數為:

\[A R I=\frac{R I-E(R I)}{\max (R I)-E(R I)} \]

ARI其實是去均值歸一化的形式,RI中的 $a+d$ 可以表示為 $\sum_{i, j}\left(\begin{array}{c}n_{i j} \\ 2\end{array}\right)$,

\[\begin{gathered} E(R I)=E\left(\sum_{i, j}\left(\begin{array}{c} n_{i j} \\ 2 \end{array}\right)\right)=\left[\sum_{i}\left(\begin{array}{c} n_{i} \\ 2 \end{array}\right) \sum_{j}\left(\begin{array}{c} n_{. j} \\ 2 \end{array}\right)\right] /\left(\begin{array}{l} n \\ 2 \end{array}\right) \\ \max (R I)=\frac{1}{2}\left[\sum_{i}\left(\begin{array}{c} n_{i} \\ 2 \end{array}\right)+\sum_{j}\left(\begin{array}{c} n_{. j} \\ 2 \end{array}\right)\right] \end{gathered} \]

ARI∈[−1,1] 。值越大意味着聚類結果與真實情況越吻合。從廣義的角度來將，ARI是衡量兩個數據分布的吻合程度的。

優點：
對任意數量的聚類中心和樣本數，隨機聚類的ARI都非常接近於0；
取值在［－1，1］之間，負數代表結果不好，越接近於1越好；
可用於聚類算法之間的比較
缺點：
ARI需要真實標簽

python代碼實現

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 與標簽名無關
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 具有對稱性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score)  # 0.24242424242424246

# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)#1.0

References

https://blog.csdn.net/qq_42887760/article/details/105728101

https://blog.csdn.net/sinat_30203515/article/details/82634778

2 Silhouette Coefficient 輪廓系數（實際類別信息未知）

輪廓系數（ Silhouette coefficient）適用於實際類別信息未知的情況。對於單個樣本，設a是與它同類別中其他樣本的平均距離，b是與它距離最近不同類別中樣本的平均距離，輪廓系數為

\[s=\frac{b-a}{\max (a, b)} \]

對於一個樣本集合，它的輪廓系數是所有樣本輪廓系數的平均值。輪廓系數取值范圍是 $[-1,1]$, 同類別樣本越距離相近且不同類別樣本距離越遠，分數越高。

import numpy as np
from sklearn.cluster import KMeans

X = [[1,2,3],[1,2,3],[4,6,7],[4,5,6],[2,3,5]]

kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
print(labels)#[2 2 1 1 0]
metrics.silhouette_score(X, labels, metric='euclidean')#0.6371196617847901

references

https://cloud.tencent.com/developer/article/1010857

分群評估指標（二）|調整互信息與Homogeneity, completeness and V-measure

調整互信息（需要已知數據點的真實標簽）

原理

https://blog.csdn.net/qq_42122496/article/details/106193859

https://cloud.tencent.com/developer/article/1010857

互信息（ Mutual Information）也是用來衡量兩個數據分布的吻合程度。利用基於互信息的方法來衡量聚類效果需要實際類別信息，MI與NMI取值范圍為[0,1]，AMI取值范圍為[-1,1]，它們都是值越大意味看聚類結果與真實倩況越吻合。

代碼

from sklearn.metrics.cluster import entropy, mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score

MI = lambda x, y: mutual_info_score(x, y)
NMI = lambda x, y: normalized_mutual_info_score(x, y, average_method='arithmetic')#NMI和AMI的計算均采用算術平均；log函數的底為自然對數e。
AMI = lambda x, y: adjusted_mutual_info_score(x, y, average_method='arithmetic')

A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3]
#print(entropy(A))
#print(MI(A, B))
print(NMI(A, B))#0.36456177185718985
print(AMI(A, B))#0.2601812253892505

C = [1, 1, 2, 2, 3, 3, 3]
D = [1, 1, 1, 2, 1, 1, 1]
print(NMI(C, D))#0.28483386264113447
print(AMI(C, D))#0.05674883175532439

Homogeneity, completeness and V-measure

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html

https://www.jianshu.com/p/e1ee1f336d35

homogeneity : float

score between 0.0 and 1.0. ;1.0 stands for perfectly homogeneous labeling

completeness:float

score between 0.0 and 1.0. ;1.0 stands for perfectly complete labeling

v_measure:float

harmonic mean of the first two

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

Both scores have positive values between 0.0 and 1.0, larger values being desirable.

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print(metrics.homogeneity_score(labels_true, labels_pred))

print(metrics.completeness_score(labels_true, labels_pred))

print(metrics.v_measure_score(labels_true, labels_pred))

#beta默認值為1.0，但如果beta值小於1:
print(metrics.v_measure_score(labels_true, labels_pred, beta=0.6))

print(metrics.v_measure_score(labels_true, labels_pred, beta=1.8))

#這三個值可以一起計算
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))

labels_pred = [0, 0, 0, 1, 2, 2]
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))

result

分群評估指標（三）|Fowlkes-Mallows scores與Calinski-Harabaz Index

Fowlkes-Mallows scores

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html

The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision and recall:

FMI = TP / sqrt((TP + FP) * (TP + FN))

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.

from sklearn.metrics.cluster import fowlkes_mallows_score
print(fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1]))#1.0

print(fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0]))#1.0

Calinski-Harabaz Index(真實的分群label不知道)

在真實的分群label不知道的情況下，可以作為評估模型的一個指標。類別內部數據的協方差越小越好，類別之間的協方差越大越好，這樣的Calinski-Harabasz分數會高。

import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,5],[2,4,7],[1,2,8]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabasz_score(X, labels)  #4.125

分群評估指標（四）總結|實戰

總結

實戰

https://scikit-learn.org/stable/auto_examples/cluster/plot_affinity_propagation.html#sphx-glr-auto-examples-cluster-plot-affinity-propagation-py

from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)

# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50, random_state=0).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels, metric='sqeuclidean'))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

result

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 python實現六大分群質量評估指標（蘭德系數、互信息、輪廓系數）聚類評估指標系列(一)：標准化互信息NMI計算步驟及其Python實現聚類算法及其評估指標多分類評估指標分類算法的評估指標模型評估指標推薦系統評估指標損失函數VS評估指標多標簽分類的結果評估指標介紹多分類建模評估指標