【sklearn】Gaussian Mixture Model

本文轉載自查看原文 2021-02-02 18:08 594 Speaker Recognition/ Pattern Recognition/ Algorithms/ Machine Learning/ Python

概述

參考

sklearn.mixture: Gaussian Mixture Models

高斯混合模型（GMM）源代碼實現（二）

A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori(MAP) estimation from a well-trained prior model.

高斯混合模型，經典的概率模型/生成模型，常用於聲紋識別、語音識別等模式識別應用。常使用最大似然估計方法訓練（估計參數），用期望最大化算法（Expectation Maximization，EM）具體實現。算法原理：
avatar

The sklearn.mixture module implements mixture modeling algorithms. 里面有Gaussian_mixture和Baysian_mixture，這兩個類都繼承於BaseMixture。

GaussianMixture

高斯混合模型的概率分布，參數估計。
參考sklearn.mixture.GaussianMixture及其源碼。

class sklearn.mixture.GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)

初始化

初始化 GaussianMixture 類。使用方式如下：

from sklearn.mixture import GaussianMixture 
gmm = GaussianMixture(n_components = 20, max_iter = 200, covariance_type='diag', n_init = 3)

該GMM由20個高斯分布函數組成，訓練過程中EM算法迭代次數為200，協方差類型為diag（每個高斯分量都有對角協方差矩陣），3次初始化，訓練過程中保存最好的結果。
參數（weights，means，precisions）的初始化默認采用kmeans方法，precisions_init默認為None，此時的尺寸由covariance_type決定，為diag時尺寸為(n_components, n_features)。
熱啟動warm_start默認為False。
注意：n_samples >= n_components

.fit

Estimate model parameters with the EM algorithm. The method fits the model n_init times and sets the parameters with which the model has the largest likelihood or lower bound. Within each trial, the method iterates between E-step and M-step for max_iter times until the change of likelihood or lower bound is less than tol, otherwise, a ConvergenceWarning is raised.
If warm_start is True, then n_init is ignored and a single initialization is performed upon the first call. Upon consecutive calls, training starts where it left off.
參考源碼。使用方式如下：

gmm.fit(datas)

The datas is array-like of shape (n_samples, n_features). List of n_features-dimensional data points. Each row corresponds to a single data point.

.score

Compute the per-sample average log-likelihood of the given data X.
參考源碼。使用方式如下：

ll_score = gmm.score(test_datas)

The test_datas is array-like of shape (n_samples, n_dimensions). List of n_features-dimensional data points. Each row corresponds to a single data point.
The result ll_score is float. Log likelihood of the Gaussian mixture given test_datas.

Parameters: X, array-like of shape (n_samples, n_dimensions), List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns: log_likelihood, float data

preprocessing

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
‎例如，學習算法的目標函數中使用的許多elements（如SVM的RBF核，線性模型的L1和L2正則），假定all features are centered around 0 and have variance in the same order。
If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The sklearn.preprocessing module includes scaling, centering, normalization, binarization methods.

.scale

高斯混合模型訓練前，一般要對特征進行標准化處理。可以使用sklearn.preprocessing.scale處理。Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
參考sklearn.preprocessing.scale及其源碼，使用方法如下：

from sklearn import preprocessing  
X_tr = preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True)

需要注意的是，輸入X是一個 array-like sparse matrix of shape (n_samples, n_features), which is the data to center and scale.
If the axis used to compute the means and standard deviations along, is 0, independently standardize each feature, otherwise (if 1) standardize each sample.
Return the transformed data X_tr is ndarray, sparse matrix of shape (n_samples, n_features).

Warning Risk of data leak

Do not use scale unless you know what you are doing.
A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set. In general, we recommend using StandardScaler within a Pipeline in order to prevent most risks of data leaking: pipe = make_pipeline(StandardScaler(), LogisticRegression()).

未用？

.StandardScaler

Standardize features by removing the mean and scaling to unit variance. 計算方式：$ {\text{z}} = \left( {x - \mu } \right)/\sigma $.
參考sklearn.preprocessing.StandardScaler：

class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)

默認with_mean和with_std均為True，需要對輸入數據提前進行中心化和單位標准差歸一化。
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 混合高斯模型（Gaussian mixture model, GMM）高斯混合模型Gaussian Mixture Model (GMM)——通過增加 Model 的個數，我們可以任意地逼近任何連續的概率密分布 Mixture unigram Model, PLSA及LDA [Scikit-learn] 2.1 Clustering - Variational Bayesian Gaussian Mixture [Scikit-learn] 2.1 Clustering - Gaussian mixture models & EM ImportError: cannot import name 'GMM' from 'sklearn.mixture' sklearn.model_selection Geometric deep learning on graphs and manifolds using mixture model CNNs sklearn.linear_model.LassoCV sklearn.linear_model.Lasso