Python有包可以直接實現特征選擇,也就是看自變量對因變量的相關性。今天我們先開看一下如何用卡方檢驗實現特征選擇。
1. 首先import包和實驗數據:
from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.datasets import load_iris #導入IRIS數據集 iris = load_iris() iris.data#查看數據
結果輸出:
array([[ 5.1, 3.5, 1.4, 0.2], [ 4.9, 3. , 1.4, 0.2], [ 4.7, 3.2, 1.3, 0.2], [ 4.6, 3.1, 1.5, 0.2], [ 5. , 3.6, 1.4, 0.2], [ 5.4, 3.9, 1.7, 0.4], [ 4.6, 3.4, 1.4, 0.3],
2. 使用卡方檢驗來選擇特征
model1 = SelectKBest(chi2, k=2)#選擇k個最佳特征 model1.fit_transform(iris.data, iris.target)#iris.data是特征數據,iris.target是標簽數據,該函數可以選擇出k個特征
結果輸出為:
array([[ 1.4, 0.2],
[ 1.4, 0.2],
[ 1.3, 0.2],
[ 1.5, 0.2],
[ 1.4, 0.2],
[ 1.7, 0.4],
[ 1.4, 0.3],
可以看出后使用卡方檢驗,選擇出了后兩個特征。如果我們還想查看卡方檢驗的p值和得分,可以使用第3步。
3. 查看p-values和scores
model1.scores_ #得分
得分輸出為:
array([ 10.81782088, 3.59449902, 116.16984746, 67.24482759])
可以看出后兩個特征得分最高,與我們第二步的結果一致;
model1.pvalues_ #p-values
p值輸出為:
array([ 4.47651499e-03, 1.65754167e-01, 5.94344354e-26, 2.50017968e-15])
可以看出后兩個特征的p值最小,置信度也最高,與前面的結果一致。
API
class sklearn.feature_selection.
SelectKBest
(score_func=<function f_classif>, *, k=10)
Select features according to the k highest scores.
Read more in the User Guide.
- Parameters
-
- score_func callable, default=f_classif
-
Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues) or a single array with scores. Default is f_classif (see below “See Also”). The default function only works with classification tasks.
New in version 0.18.
- k int or “all”, default=10
-
Number of top features to select. The “all” option bypasses selection, for use in a parameter search.
- Attributes
-
- scores_ array-like of shape (n_features,)
-
Scores of features.
- pvalues_ array-like of shape (n_features,)
-
p-values of feature scores, None if
score_func
returned only scores.
>>> from sklearn.datasets import load_digits >>> from sklearn.feature_selection import SelectKBest, chi2 >>> X, y = load_digits(return_X_y=True) >>> X.shape (1797, 64) >>> X_new = SelectKBest(chi2, k=20).fit_transform(X, y) >>> X_new.shape (1797, 20)