有兩種類型的轉換是可用的:分位數轉換和冪函數轉換。分位數和冪變換都基於特征的單調變換,從而保持了每個特征值的秩。
通過執行秩變換,分位數變換平滑了異常分布,並且比縮放方法受異常值的影響更小。但是它的確使特征間及特征內的關聯和距離失真了。
冪變換則是一組參數變換,其目的是將數據從任意分布映射到接近高斯分布的位置。
1 映射到均勻分布
QuantileTransformer
類以及 quantile_transform
函數提供了一個基於分位數函數的無參數轉換,將數據映射到了零到一的均勻分布上:
>>> from sklearn.datasets import load_iris >>> from sklearn.model_selection import train_test_split >>> iris = load_iris() >>> X, y = iris.data, iris.target >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) >>> quantile_transformer = preprocessing.QuantileTransformer(random_state=0) >>> X_train_trans = quantile_transformer.fit_transform(X_train) >>> X_test_trans = quantile_transformer.transform(X_test) >>> np.percentile(X_train[:, 0], [0, 25, 50, 75, 100]) array([ 4.3, 5.1, 5.8, 6.5, 7.9])
這個特征是萼片的厘米單位的長度。一旦應用分位數轉換,這些元素就接近於之前定義的百分位數:
>>> np.percentile(X_train_trans[:, 0], [0, 25, 50, 75, 100])
...
array([ 0.00... , 0.24..., 0.49..., 0.73..., 0.99... ])
這可以在具有類似形式的獨立測試集上確認:
>>> np.percentile(X_test[:, 0], [0, 25, 50, 75, 100]) ... array([ 4.4 , 5.125, 5.75 , 6.175, 7.3 ]) >>> np.percentile(X_test_trans[:, 0], [0, 25, 50, 75, 100]) ... array([ 0.01..., 0.25..., 0.46..., 0.60... , 0.94...])
2 映射到高斯分布
在許多建模場景中,需要數據集中的特征的正態化。冪變換是一類參數化的單調變換, 其目的是將數據從任何分布映射到盡可能接近高斯分布,以便穩定方差和最小化偏斜。
類 PowerTransformer 目前提供兩個這樣的冪變換,Yeo-Johnson transform
和 the Box-Cox transform
。
Yeo-Johnson transform:
Box-Cox transform:
Box-Cox只能應用於嚴格的正數據。在這兩種方法中,變換都是參數化的,通過極大似然估計來確定。下面是一個使用Box-Cox將樣本從對數正態分布映射到正態分布的示例:
>>> pt = preprocessing.PowerTransformer(method='box-cox', standardize=False) >>> X_lognormal = np.random.RandomState(616).lognormal(size=(3, 3)) >>> X_lognormal array([[1.28..., 1.18..., 0.84...], [0.94..., 1.60..., 0.38...], [1.35..., 0.21..., 1.09...]]) >>> pt.fit_transform(X_lognormal) array([[ 0.49..., 0.17..., -0.15...], [-0.05..., 0.58..., -0.57...], [ 0.69..., -0.84..., 0.10...]])
上述示例設置了參數standardize
的選項為 False 。 但是,默認情況下,類PowerTransformer將會應用zero-mean
,unit-variance normalization
到變換出的輸出上。
下面的示例中 將 Box-Cox 和 Yeo-Johnson 應用到各種不同的概率分布上。 請注意 當把這些方法用到某個分布上的時候, 冪變換得到的分布非常像高斯分布。但是對其他的一些分布,結果卻不太有效。這更加強調了在冪變換前后對數據進行可視化的重要性。
我們也可以 使用類 QuantileTransformer (通過設置 output_distribution
='normal')把數據變換成一個正態分布。下面是將其應用到iris dataset上的結果:
>>> quantile_transformer = preprocessing.QuantileTransformer( ... output_distribution='normal', random_state=0) >>> X_trans = quantile_transformer.fit_transform(X) >>> quantile_transformer.quantiles_ array([[4.3, 2. , 1. , 0.1], [4.4, 2.2, 1.1, 0.1], [4.4, 2.2, 1.2, 0.1], ..., [7.7, 4.1, 6.7, 2.5], [7.7, 4.2, 6.7, 2.5], [7.9, 4.4, 6.9, 2.5]])
因此,輸入的中值變成了輸出的均值,以0為中心。正態輸出被裁剪以便輸入的最大最小值(分別對應於1e-7和1-1e-7)不會在變換之下變成無窮。
API
class sklearn.preprocessing.
QuantileTransformer
(*, n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)
Transform features using quantiles information.
This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
Read more in the User Guide.
New in version 0.19.
- Parameters
-
- n_quantiles int, default=1000 or n_samples
-
Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.
- output_distribution {‘uniform’, ‘normal’}, default=’uniform’
-
Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
- ignore_implicit_zeros bool, default=False
-
Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
- subsample int, default=1e5
-
Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
- random_state int, RandomState instance or None, default=None
-
Determines random number generation for subsampling and smoothing noise. Please see
subsample
for more details. Pass an int for reproducible results across multiple function calls. See Glossary - copy bool, default=True
-
Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).
- Attributes
-
- n_quantiles_ int
-
The actual number of quantiles used to discretize the cumulative distribution function.
- quantiles_ ndarray of shape (n_quantiles, n_features)
-
The values corresponding the quantiles of reference.
- references_ ndarray of shape (n_quantiles, )
-
Quantiles of references.
Methods
|
Compute the quantiles used for transforming. |
|
Fit to data, then transform it. |
|
Get parameters for this estimator. |
Back-projection to the original space. |
|
|
Set the parameters of this estimator. |
|
Feature-wise transformation of the data. |
Examples
>>> import numpy as np >>> from sklearn.preprocessing import QuantileTransformer >>> rng = np.random.RandomState(0) >>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0) >>> qt = QuantileTransformer(n_quantiles=10, random_state=0) >>> qt.fit_transform(X) array([...])
API
class sklearn.preprocessing.
PowerTransformer
(method='yeo-johnson', *, standardize=True, copy=True)
Apply a power transform featurewise to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
By default, zero-mean, unit-variance normalization is applied to the transformed data.
Read more in the User Guide.
New in version 0.20.
- Parameters
-
- method {‘yeo-johnson’, ‘box-cox’}, default=’yeo-johnson’
-
The power transform method. Available methods are:
- standardize bool, default=True
-
Set to True to apply zero-mean, unit-variance normalization to the transformed output.
- copy bool, default=True
-
Set to False to perform inplace computation during transformation.
- Attributes
-
- lambdas_ ndarray of float of shape (n_features,)
-
The parameters of the power transformation for the selected features.
Methods
|
Estimate the optimal parameter lambda for each feature. |
|
Fit to data, then transform it. |
|
Get parameters for this estimator. |
Apply the inverse power transformation using the fitted lambdas. |
|
|
Set the parameters of this estimator. |
|
Apply the power transform to each feature using the fitted lambdas. |
Examples
>>> import numpy as np >>> from sklearn.preprocessing import PowerTransformer >>> pt = PowerTransformer() >>> data = [[1, 2], [3, 2], [4, 5]] >>> print(pt.fit(data)) PowerTransformer() >>> print(pt.lambdas_) [ 1.386... -3.100...] >>> print(pt.transform(data)) [[-1.316... -0.707...] [ 0.209... -0.707...] [ 1.106... 1.414...]]