Python实现6大分群评估指标

本文转载自查看原文 2021-11-01 22:36 890 聚类评估/ 机器学习/ Python

分群评估指标（一）|调整兰德系数与Silhouette Coefficient 轮廓系数

1 Adjusted Rand index 调整兰德系数（ARI需要真实标签）

兰德系数（Rand index）

给定 $n$ 个对象集合 $S=\left\{O_{1}, O_{2}, \ldots, O_{n}\right\}$, 假设 $U=\left\{u_{1}, \ldots, u_{R}\right\}$ 和 $V=\left\{v_{1}, \ldots, v_{C}\right\}$ 表示S的两个不同划分并且满足 $\bigcup_{i=1}^{R} u_{i}=S=\bigcup_{j=1}^{C} v_{j}, u_{i} \cap u_{i^{*}}=\emptyset=v_{j} \cap v_{j^{*}}$, 其中 $1 \leq i \neq i^{*} \leq R, 1 \leq j \neq j^{*} \leq C_{\circ}$

假设U是外部评价标准即true_label，而V是聚类结果。设定四个统计量：

a为在U中为同一类且在V中也为同一类别的数据点对数
b为在U中为同一类但在V中却隶属于不同类别的数据点对数
c为在U中不在同一类但在V中为同一类别的数据点对数
d为在U中不在同一类且在V中也不属于同一类别的数据点对数

此时，兰德系数为

\[RI = \frac{a+d}{a+b+c+d} \]

兰德系数的值在[0,1]之间，当聚类结果完美匹配时，兰德系数为1。

调整兰德系数(Adjusted Rand index)

兰德系数的问题在于对于两个随机的划分,其兰德系数值不是一个接近于0的常数。 Huberr和Arabie在1985年提出了调整兰德系数, 调整兰德系数假设模型的超分布为随机模型, 即 $U$ 和 $V$ 的划分为随机的，那么各类别和各簇的数据点数目是固定的。
假设 $n_{i j}$ 表示同在类别 $u_{i}$ 和簇 $v_{j}$ 内的数据点数目,$$n_{i}$$为类$$u_{i}$$的数据点数目，$$n_{j}$$为簇$$v_{j}$$的数目，如下表：

调整的兰德系数为:

\[A R I=\frac{R I-E(R I)}{\max (R I)-E(R I)} \]

ARI其实是去均值归一化的形式,RI中的 $a+d$ 可以表示为 $\sum_{i, j}\left(\begin{array}{c}n_{i j} \\ 2\end{array}\right)$,

\[\begin{gathered} E(R I)=E\left(\sum_{i, j}\left(\begin{array}{c} n_{i j} \\ 2 \end{array}\right)\right)=\left[\sum_{i}\left(\begin{array}{c} n_{i} \\ 2 \end{array}\right) \sum_{j}\left(\begin{array}{c} n_{. j} \\ 2 \end{array}\right)\right] /\left(\begin{array}{l} n \\ 2 \end{array}\right) \\ \max (R I)=\frac{1}{2}\left[\sum_{i}\left(\begin{array}{c} n_{i} \\ 2 \end{array}\right)+\sum_{j}\left(\begin{array}{c} n_{. j} \\ 2 \end{array}\right)\right] \end{gathered} \]

ARI∈[−1,1] 。值越大意味着聚类结果与真实情况越吻合。从广义的角度来将，ARI是衡量两个数据分布的吻合程度的。

优点：
对任意数量的聚类中心和样本数，随机聚类的ARI都非常接近于0；
取值在［－1，1］之间，负数代表结果不好，越接近于1越好；
可用于聚类算法之间的比较
缺点：
ARI需要真实标签

python代码实现

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

# 基本用法
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 与标签名无关
labels_pred = [1, 1, 0, 0, 3, 3]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)  # 0.24242424242424246

# 具有对称性
score = metrics.adjusted_rand_score(labels_pred, labels_true)
print(score)  # 0.24242424242424246

# 接近 1 最好
labels_pred = labels_true[:]
score = metrics.adjusted_rand_score(labels_true, labels_pred)
print(score)#1.0

References

https://blog.csdn.net/qq_42887760/article/details/105728101

https://blog.csdn.net/sinat_30203515/article/details/82634778

2 Silhouette Coefficient 轮廓系数（实际类别信息未知）

轮廓系数（ Silhouette coefficient）适用于实际类别信息未知的情况。对于单个样本，设a是与它同类别中其他样本的平均距离，b是与它距离最近不同类别中样本的平均距离，轮廓系数为

\[s=\frac{b-a}{\max (a, b)} \]

对于一个样本集合，它的轮廓系数是所有样本轮廓系数的平均值。轮廓系数取值范围是 $[-1,1]$, 同类别样本越距离相近且不同类别样本距离越远，分数越高。

import numpy as np
from sklearn.cluster import KMeans

X = [[1,2,3],[1,2,3],[4,6,7],[4,5,6],[2,3,5]]

kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
print(labels)#[2 2 1 1 0]
metrics.silhouette_score(X, labels, metric='euclidean')#0.6371196617847901

references

https://cloud.tencent.com/developer/article/1010857

分群评估指标（二）|调整互信息与Homogeneity, completeness and V-measure

调整互信息（需要已知数据点的真实标签）

原理

https://blog.csdn.net/qq_42122496/article/details/106193859

https://cloud.tencent.com/developer/article/1010857

互信息（ Mutual Information）也是用来衡量两个数据分布的吻合程度。利用基于互信息的方法来衡量聚类效果需要实际类别信息，MI与NMI取值范围为[0,1]，AMI取值范围为[-1,1]，它们都是值越大意味看聚类结果与真实倩况越吻合。

代码

from sklearn.metrics.cluster import entropy, mutual_info_score, normalized_mutual_info_score, adjusted_mutual_info_score

MI = lambda x, y: mutual_info_score(x, y)
NMI = lambda x, y: normalized_mutual_info_score(x, y, average_method='arithmetic')#NMI和AMI的计算均采用算术平均；log函数的底为自然对数e。
AMI = lambda x, y: adjusted_mutual_info_score(x, y, average_method='arithmetic')

A = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
B = [1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 1, 1, 3, 3, 3]
#print(entropy(A))
#print(MI(A, B))
print(NMI(A, B))#0.36456177185718985
print(AMI(A, B))#0.2601812253892505

C = [1, 1, 2, 2, 3, 3, 3]
D = [1, 1, 1, 2, 1, 1, 1]
print(NMI(C, D))#0.28483386264113447
print(AMI(C, D))#0.05674883175532439

Homogeneity, completeness and V-measure

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.homogeneity_completeness_v_measure.html

https://www.jianshu.com/p/e1ee1f336d35

homogeneity : float

score between 0.0 and 1.0. ;1.0 stands for perfectly homogeneous labeling

completeness:float

score between 0.0 and 1.0. ;1.0 stands for perfectly complete labeling

v_measure:float

harmonic mean of the first two

A clustering result satisfies homogeneity if all of its clusters contain only data points which are members of a single class.

A clustering result satisfies completeness if all the data points that are members of a given class are elements of the same cluster.

Both scores have positive values between 0.0 and 1.0, larger values being desirable.

from sklearn import metrics
labels_true = [0, 0, 0, 1, 1, 1]
labels_pred = [0, 0, 1, 1, 2, 2]

print(metrics.homogeneity_score(labels_true, labels_pred))

print(metrics.completeness_score(labels_true, labels_pred))

print(metrics.v_measure_score(labels_true, labels_pred))

#beta默认值为1.0，但如果beta值小于1:
print(metrics.v_measure_score(labels_true, labels_pred, beta=0.6))

print(metrics.v_measure_score(labels_true, labels_pred, beta=1.8))

#这三个值可以一起计算
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))

labels_pred = [0, 0, 0, 1, 2, 2]
print(metrics.homogeneity_completeness_v_measure(labels_true, labels_pred))

result

分群评估指标（三）|Fowlkes-Mallows scores与Calinski-Harabaz Index

Fowlkes-Mallows scores

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.fowlkes_mallows_score.html

The Fowlkes-Mallows index (FMI) is defined as the geometric mean between of the precision and recall:

FMI = TP / sqrt((TP + FP) * (TP + FN))

The score ranges from 0 to 1. A high value indicates a good similarity between two clusters.

from sklearn.metrics.cluster import fowlkes_mallows_score
print(fowlkes_mallows_score([0, 0, 1, 1], [0, 0, 1, 1]))#1.0

print(fowlkes_mallows_score([0, 0, 1, 1], [1, 1, 0, 0]))#1.0

Calinski-Harabaz Index(真实的分群label不知道)

在真实的分群label不知道的情况下，可以作为评估模型的一个指标。类别内部数据的协方差越小越好，类别之间的协方差越大越好，这样的Calinski-Harabasz分数会高。

import numpy as np
from sklearn.cluster import KMeans
X = [[1,2,3],[1,2,5],[2,4,7],[1,2,8]]
kmeans_model = KMeans(n_clusters=3, random_state=1).fit(X)
labels = kmeans_model.labels_
metrics.calinski_harabasz_score(X, labels)  #4.125

分群评估指标（四）总结|实战

总结

实战

https://scikit-learn.org/stable/auto_examples/cluster/plot_affinity_propagation.html#sphx-glr-auto-examples-cluster-plot-affinity-propagation-py

from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets import make_blobs

# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
                            random_state=0)

# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50, random_state=0).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

n_clusters_ = len(cluster_centers_indices)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels, metric='sqeuclidean'))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle

plt.close('all')
plt.figure(1)
plt.clf()

colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
    class_members = labels == k
    cluster_center = X[cluster_centers_indices[k]]
    plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
    plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
             markeredgecolor='k', markersize=14)
    for x in X[class_members]:
        plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

result

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 python实现六大分群质量评估指标（兰德系数、互信息、轮廓系数）聚类评估指标系列(一)：标准化互信息NMI计算步骤及其Python实现聚类算法及其评估指标多分类评估指标分类算法的评估指标模型评估指标推荐系统评估指标损失函数VS评估指标多标签分类的结果评估指标介绍多分类建模评估指标