概述
參考
A Gaussian Mixture Model (GMM) is a parametric probability density function represented as a weighted sum of Gaussian component densities. GMMs are commonly used as a parametric model of the probability distribution of continuous measurements or features in a biometric system, such as vocal-tract related spectral features in a speaker recognition system. GMM parameters are estimated from training data using the iterative Expectation-Maximization (EM) algorithm or Maximum A Posteriori(MAP) estimation from a well-trained prior model.
高斯混合模型,經典的概率模型/生成模型,常用於聲紋識別、語音識別等模式識別應用。常使用最大似然估計方法訓練(估計參數),用期望最大化算法(Expectation Maximization,EM)具體實現。算法原理:

The sklearn.mixture module implements mixture modeling algorithms. 里面有Gaussian_mixture和Baysian_mixture,這兩個類都繼承於BaseMixture。
GaussianMixture
高斯混合模型的概率分布,參數估計。
參考sklearn.mixture.GaussianMixture及其源碼。
class sklearn.mixture.GaussianMixture(n_components=1, *, covariance_type='full', tol=0.001, reg_covar=1e-06, max_iter=100, n_init=1, init_params='kmeans', weights_init=None, means_init=None, precisions_init=None, random_state=None, warm_start=False, verbose=0, verbose_interval=10)
初始化
初始化 GaussianMixture 類。使用方式如下:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 20, max_iter = 200, covariance_type='diag', n_init = 3)
該GMM由20個高斯分布函數組成,訓練過程中EM算法迭代次數為200,協方差類型為diag(每個高斯分量都有對角協方差矩陣),3次初始化,訓練過程中保存最好的結果。
參數(weights,means,precisions)的初始化默認采用kmeans方法,precisions_init默認為None,此時的尺寸由covariance_type決定,為diag時尺寸為(n_components, n_features)。
熱啟動warm_start默認為False。
注意:n_samples >= n_components
.fit
Estimate model parameters with the EM algorithm. The method fits the model n_init times and sets the parameters with which the model has the largest likelihood or lower bound. Within each trial, the method iterates between E-step and M-step for max_iter times until the change of likelihood or lower bound is less than tol, otherwise, a ConvergenceWarning is raised.
If warm_start is True, then n_init is ignored and a single initialization is performed upon the first call. Upon consecutive calls, training starts where it left off.
參考源碼。使用方式如下:
gmm.fit(datas)
The datas is array-like of shape (n_samples, n_features). List of n_features-dimensional data points. Each row corresponds to a single data point.
.score
Compute the per-sample average log-likelihood of the given data X.
參考源碼。使用方式如下:
ll_score = gmm.score(test_datas)
The test_datas is array-like of shape (n_samples, n_dimensions). List of n_features-dimensional data points. Each row corresponds to a single data point.
The result ll_score is float. Log likelihood of the Gaussian mixture given test_datas.
Parameters: X, array-like of shape (n_samples, n_dimensions), List of n_features-dimensional data points. Each row corresponds to a single data point.
Returns: log_likelihood, float data
preprocessing
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
例如,學習算法的目標函數中使用的許多elements(如SVM的RBF核,線性模型的L1和L2正則),假定all features are centered around 0 and have variance in the same order。
If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
The sklearn.preprocessing module includes scaling, centering, normalization, binarization methods.
.scale
高斯混合模型訓練前,一般要對特征進行標准化處理。可以使用sklearn.preprocessing.scale處理。Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
參考sklearn.preprocessing.scale及其源碼,使用方法如下:
from sklearn import preprocessing
X_tr = preprocessing.scale(X, *, axis=0, with_mean=True, with_std=True, copy=True)
需要注意的是,輸入X是一個 array-like sparse matrix of shape (n_samples, n_features), which is the data to center and scale.
If the axis used to compute the means and standard deviations along, is 0, independently standardize each feature, otherwise (if 1) standardize each sample.
Return the transformed data X_tr is ndarray, sparse matrix of shape (n_samples, n_features).
Warning Risk of data leak
Do not use
scaleunless you know what you are doing.
A common mistake is to apply it to the entire data before splitting into training and test sets. This will bias the model evaluation because information would have leaked from the test set to the training set. In general, we recommend usingStandardScalerwithin aPipelinein order to prevent most risks of data leaking:pipe = make_pipeline(StandardScaler(), LogisticRegression()).
未用?
.StandardScaler
Standardize features by removing the mean and scaling to unit variance. 計算方式:$ {\text{z}} = \left( {x - \mu } \right)/\sigma $.
參考sklearn.preprocessing.StandardScaler:
class sklearn.preprocessing.StandardScaler(*, copy=True, with_mean=True, with_std=True)
默認with_mean和with_std均為True,需要對輸入數據提前進行中心化和單位標准差歸一化。
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform.
