[Scikit-learn] 2.1 Clustering - Variational Bayesian Gaussian Mixture

本文转载自查看原文 2017-09-13 22:27 1410 Brain-ML

最重要的一点是：Bayesian GMM为什么拟合的更好？

PRML 这段文字做了解释：

Ref: http://freemind.pluskid.org/machine-learning/deciding-the-number-of-clusterings/

链接中提到了一些其他的无监督聚类。

From: http://scikit-learn.org/stable/modules/mixture.html#variational-bayesian-gaussian-mixture

Due to its Bayesian nature, the variational algorithm needs more hyper- parameters than expectation-maximization,

the most important of these being the concentration parameter weight_concentration_prior.

Specifying a low value for the concentration prior will make the model put most of the weight on few components set the remaining components weights very close to zero.
High values of the concentration prior will allow a larger number of components to be active in the mixture.

The examples below compare Gaussian mixture models with a fixed number of components, to the variational Gaussian mixture models with a Dirichlet process prior. Here, a classical Gaussian mixture is fitted with 5 components on a dataset composed of 2 clusters.

We can see that the variational Gaussian mixture with a Dirichlet process prior is able to limit itself to only 2 components whereas the Gaussian mixture fits the data with a fixed number of components that has to be set a priori by the user. In this case the user has selected n_components=5 which does not match the true generative distribution of this toy dataset. Note that with very little observations, the variational Gaussian mixture models with a Dirichlet process prior can take a conservative stand, and fit only one component.

Dirichlet distribution 具有自动的特征选取的作用，找出起主要作用的components。

5 for GMM [ 0.1258077   0.23638361  0.23330578  0.26361639  0.14088652] 5 for Bayesian GMM [ 0.001019    0.00101796  0.49948856  0.47955123  0.01892325]

问题来了：

为什么dirichlet会让三个的权重偏小，而GMM却没有，难道是收敛速度不同？

应该跟速度没有关系。加了先验后，后验变为了dirichlet，那么参数的估计过程中便具备了dirichlet的良好性质。

原始数据

Our data set will be the classic Old Faithful dataset.

plt.scatter(data['eruptions'], data['waiting'], alpha=0.5); plt.xlabel('eruptions'); plt.ylabel('waiting');

如何拟合？

from sklearn.mixture import BayesianGaussianMixture mixture_model = BayesianGaussianMixture( n_components=10, random_state=5,  # control the pseudo-random initialization
    weight_concentration_prior_type='dirichlet_distribution', weight_concentration_prior=1.0,  # parameter of the Dirichlet component prior
    max_iter=200,  # choose this to be big in case it takes a long time to fit
) mixture_model.fit(data);

Ref: http://scikit-learn.org/stable/auto_examples/mixture/plot_concentration_prior.html

可直接调用该程式：

plot_ellipses(ax1, model.weights_, model.means_, model.covariances_) def plot_ellipses(ax, weights, means, covars): """ Given a list of mixture component weights, means, and covariances, plot ellipses to show the orientation and scale of the Gaussian mixture dispersal. """
    for n in range(means.shape[0]): eig_vals, eig_vecs = (covars[n])
 unit_eig_vec = eig_vecs[0] / np.linalg.norm(eig_vecs[0]) angle = np.arctan2(unit_eig_vec[1], unit_eig_vec[0]) # Ellipse needs degrees
        angle = 180 * angle / np.pi # eigenvector normalization
        eig_vals = 2 * np.sqrt(2) * np.sqrt(eig_vals) ell = mpl.patches.Ellipse( means[n], eig_vals[0], eig_vals[1], 180 + angle, edgecolor=None,) ell2 = mpl.patches.Ellipse( means[n], eig_vals[0], eig_vals[1], 180 + angle, edgecolor='black', fill=False, linewidth=1,) ell.set_clip_box(ax.bbox) ell2.set_clip_box(ax.bbox) ell.set_alpha(weights[n]) ell.set_facecolor('#56B4E9') ax.add_artist(ell) ax.add_artist(ell2)

plot_results( mixture_model, data['eruptions'], data['waiting'], 'weight_concentration_prior={}'.format(1.0)) def plot_results(model, x, y, title, plot_title=False): fig, ax = plt.subplots(3, 1, sharex=False)
 # 上面是ax没用，以下重新定义了ax1 ax2 gs = gridspec.GridSpec(3, 1)　　# 自定义子图位置 ax1 = plt.subplot(gs[0:2, 0])
 # 以下四行是固定套路 ax1.set_title(title) ax1.scatter(x, y, s=5, marker='o', alpha=0.8) ax1.set_xticks(()) ax1.set_yticks(())
 n_components = model.get_params()['n_components'] plot_ellipses(ax1, model.weights_, model.means_, model.covariances_) 
 # ax1:画椭圆
 # ax2:画权重 ax2 = plt.subplot(gs[2, 0]) ax2.get_xaxis().set_tick_params(direction='out') ax2.yaxis.grid(True, alpha=0.7) for k, w in enumerate(model.weights_): ax2.bar(k, w, width=0.9, color='#56B4E9', zorder=3, align='center', edgecolor='black') ax2.text(k, w + 0.007, "%.1f%%" % (w * 100.), horizontalalignment='center') ax2.set_xlim(-.6, n_components - .4) ax2.set_ylim(0., 1.1) ax2.tick_params(axis='y', which='both', left='off', right='off', labelleft='off') ax2.tick_params(axis='x', which='both', top='off') if plot_title: ax1.set_ylabel('Estimated Mixtures') ax2.set_ylabel('Weight of each component')

查看拟合过程：

lower_bounds = [] mixture_model = BayesianGaussianMixture( n_components =10, covariance_type ='full', max_iter =1, random_state =2, weight_concentration_prior_type ='dirichlet_distribution', warm_start =True, )
# 设置model.fit为只递归一次 for i in range(200): mixture_model.fit(data) if mixture_model.converged_: break lower_bounds.append(mixture_model.lower_bound_) if i%5==0 and i<60: plt.figure(); plot_results( mixture_model, data['eruptions'], data['waiting'], 'EM step={}, lower_bound={}'.format( i, mixture_model.lower_bound_) ); plt.figure(); plt.plot(lower_bounds); plt.gca().set_xlabel('step') plt.gca().set_ylabel('lower bound')

Lower bound 逐渐增加。

不同初始，效果对比：

for seed in range(6,11): lower_bounds = [] mixture_model = BayesianGaussianMixture( n_components=10, covariance_type='full', max_iter=1, random_state=seed, weight_concentration_prior_type='dirichlet_distribution', warm_start=True, ) for i in range(200): mixture_model.fit(data) if mixture_model.converged_: break lower_bounds.append(mixture_model.lower_bound_) plt.plot(lower_bounds); plt.gca().set_xlabel('step') plt.gca().set_ylabel('lower bound');

Result:

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。

猜您在找 scikit-learn 应用机器学习笔记：Hierarchical_clustering with scikit-learn(层次聚类) Python 安装scikit-learn Scikit-learn CountVectorizer与TfidfVectorizer Scikit-learn使用总结 scikit-learn进行TFIDF处理 scikit-learn 中常用的评估模型 scikit-learn 不同聚类算法的比较通过 numpy 和 scikit-learn 计算 PCA 实例 SciKit-Learn 加载数据集