sklearn提供了許多包來進行機器學習,只是很多不去了解的話,到使用的時候就會手忙腳亂根本不會去用,所以這里整理一下,這里整理的順序是個人想要了解的順序。
在一開始對這個工具毫無概念的話,可以嘗試閱讀:
2、sklearn.model_selection
sklearn有很完善的官方文檔(
2.1.1 train_test_split 拆分訓練集測試集
# train_test_split from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, random_state=SEED) print("拆分前:",X.shape,y.shape) X_train,y_train,X_test,y_test = train_test_split(X,y,test_size=0.25,random_state=SEED) print("拆分后:",X_train.shape,y_train.shape,X_test.shape,y_test.shape)
2.1.2 check_cv 簡單進行五折拆分數據集
-
check_cv返回的是一個KFold實例
-
check_cv拆分后的順序是沒有打亂的,譬如100個樣本拆分五折會默認分成五份,其下標固定為(0,19)(20,39)(40,59),(60,79)(80,99)
# check_cv from sklearn.model_selection import check_cv from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, random_state=SEED) print("拆分前:",X.shape,y.shape) aKFold = check_cv(cv=5, y=y, classifier=False) #返回的是一個KFold實例 for train_index, test_index in aKFold.split(X): # train_index, test_index返回的是下標 #print("%s %s" % (train_index, test_index)) X_train,y_train,X_test,y_test = X[train_index],y[train_index],X[test_index],y[test_index] print("拆分后:",X_train.shape,y_train.shape,X_test.shape,y_test.shape)
2.2 Splitter Classes 拆分器類
這里有15個數據集拆分器,為了靈活地應對各種拆分需求,各種拓展看着我頭疼,甚至一度懷疑我這樣子是不是在浪費時間。有時候其實只有在有應用需求的時候才會明白為什么需要這個拆分器。所以進行以下的分類,從簡單的開始。
2.2.1 K折拆分--KFold
-
默認是五折拆分,不打亂順序,不放回
-
shuffle=True后則是不固定的五折拆分,需要設置隨機種子random_state以進行復現
# KFold #K折交叉驗證,即將原數據集分成K份,每一折將其中一份作為測試集,其他k-1份作為訓練集 # 隨機的多折拆分(默認五折拆分),shuffle=True會打亂訓練集測試集 from sklearn.model_selection import KFold from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, random_state=SEED) print("拆分前:",X.shape,y.shape) aKFold = KFold(n_splits=5, shuffle=True, random_state=SEED) #返回的是一個KFold實例,shuffle=True則不是固定的下標 for train_index, test_index in aKFold.split(X): # train_index, test_index返回的是下標 print("%s %s" % (train_index, test_index)) X_train,y_train,X_test,y_test = X[train_index],y[train_index],X[test_index],y[test_index] print("拆分后:",X_train.shape,y_train.shape,X_test.shape,y_test.shape)
2.2.2 K折拆分--GroupKFold
-
GroupKFold(n_splits=5)
:返回一個GroupKFold實例 -
GroupKFold.get_n_splits(self, X=None, y=None, groups=None)
:返回拆分的折數 -
split(self, X, y=None, groups=None)
,返回拆分結果index的迭代器, 會根據傳入的第三個參數groups來拆分數據集X,y,使得拆分后分類比例不變
# GroupKFold # 簡單的多折拆分(默認五折拆分),需要傳入groups,會根據傳入groups使得每個groups在訓練集測試集的比例不變,與Stratified類似 from sklearn.model_selection import GroupKFold import numpy as np import pandas as pd from sklearn.datasets import make_classification #help(GroupKFold) SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=2, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.9], random_state=SEED) print("拆分前:",X.shape,y.shape) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) group_kfold = GroupKFold(n_splits=2) #n_splits要與傳入groups的分類數相符 #group_kfold.get_n_splits(X, y, y) # 會根據傳入的第三個參數groups來拆分數據集X,y,傳入了分類標簽y所以會將二分類數據按照0,1拆開 for train_index, test_index in group_kfold.split(X, y, groups = y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts()) print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] print("拆分后:",X_train.shape,y_train.shape,X_test.shape,y_test.shape)
2.2.3 K折拆分--StratifiedKFold
-
生成測試集,使所有包含相同的類分布,或盡可能接近。
-
是不變的類別標簽:重貼標簽到 不應該改變所產生的指標。
y = ["Happy", "Sad"]``y = [1, 0]
-
保留數據集排序中的順序依賴性,當
shuffle=False
:某些測試集中來自類 k 的所有樣本在 y 中是連續的,或者在 y 中被來自除 k 之外的類的樣本分隔。 -
生成測試集,其中最小和最大最多相差一個樣本。
# StratifiedKFold # 相比於KFold,在進行split的時候需要傳入y,並且會根據y的分類,保證分類后y在各個數據集中比例不變,類似於GroupKFold(基於參數groups) import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=200, n_features=20, shuffle=True, n_classes=3, n_clusters_per_class=1, n_informative =18, random_state=SEED) skf = StratifiedKFold(n_splits=5, shuffle=False, random_state=None) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) for train_idx, test_idx in skf.split(X,y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.4 K折拆分--StratifiedGroupKFold
# StratifiedGroupKFold # 在進行split的時候需要傳入X,y和groups,觀察其結果,似乎只取決於傳入的group,group的長度取決於X、y的長度,分類數最好與n_splits相同 import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=30, n_features=20, shuffle=True, n_classes=2, n_clusters_per_class=1, n_informative =18, random_state=SEED) sgk = StratifiedGroupKFold(n_splits=3, shuffle=False, random_state=None) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) groups = np.hstack((np.zeros(10),np.ones(10),np.ones(10)+1)) for train_idx, test_idx in sgk.split(X,y,groups): print("TRAIN:", train_idx, "TEST:", test_idx) print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.5 K折拆分--RepeatedKFold
# RepeatedKFold #重復n_repeats次n_splits折的KFold拆分,最后拆分的次數應該是n_splits*n_repeats import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=20, n_features=20, shuffle=True, n_classes=6, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.5,0.1,0.1,0.1,0.1], random_state=SEED) #重復n_repeats次n_splits折的拆分,最后拆分的次數應該是n_splits*n_repeats rkf = RepeatedKFold(n_splits=4, n_repeats=2, random_state=666) for train_idx, test_idx in rkf.split(X): print("TRAIN:", train_idx, "TEST:", test_idx)
2.2.6 K折拆分--RepeatedStratifiedKFold
# RepeatedStratifiedKFold # 重復n_repeats次n_splits折的StratifiedKFold拆分,最后拆分的次數應該是n_splits*n_repeats import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=30, n_features=20, shuffle=True, n_classes=2, n_clusters_per_class=1, n_informative =18, random_state=SEED) # rskf = RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=SEED) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) for train_idx, test_idx in rskf.split(X,y): print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.7 隨機拆分--ShuffleSplit
# ShuffleSplit # 相比於K折拆分,ShuffleSplit可指定拆分數據集的次數及每次拆分數據集的測試集比例 # 可指定拆分次數和測試集比例,需要指定random_state才可以復現數據 import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=6, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.5,0.1,0.1,0.1,0.1], random_state=SEED) # ss = ShuffleSplit(n_splits=10, test_size=None, train_size=None, random_state=None) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) # 完全是按照groups的參數進行的拆分 for train_idx, test_idx in ss.split(X, y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.8 隨機拆分--GroupShuffleSplit
# GroupShuffleSplit # 可指定拆分次數和測試集比例,需要傳入groups,按照分組拆分 import numpy as np import pandas as pd from matplotlib import pyplot as plt from sklearn.model_selection import GroupShuffleSplit from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=4, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.6,0.2,0.1], random_state=SEED) # gss = GroupShuffleSplit(n_splits=5, test_size=0.2, train_size=None, random_state=SEED) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) # 完全是按照groups的參數進行的拆分 for train_idx, test_idx in gss.split(X, y, groups=y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.9 隨機拆分--StratifiedShuffleSplit
# StratifiedShuffleSplit # 可指定拆分次數和測試集比例,需要傳入X、y,在划分后的數據集中y標簽比例相似 import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=200, n_features=20, shuffle=True, n_classes=3, n_clusters_per_class=1, n_informative =18, random_state=SEED) skf = StratifiedShuffleSplit(n_splits=3, test_size=None, train_size=None, random_state=SEED) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) for train_idx, test_idx in skf.split(X,y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.10 留一法-- LeaveOneOut
#### LeaveOneOut import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=6, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.5,0.1,0.1,0.1,0.1], random_state=SEED) # logo = LeaveOneOut() print("拆分前的數據") print(pd.DataFrame(y).value_counts()) # 完全是按照groups的參數進行的拆分 for train_idx, test_idx in logo.split(X, y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.11 留一法-- LeaveOneGroupOut
# LeaveOneGroupOut import numpy as np import pandas as pd from matplotlib import pyplot as plt from sklearn.model_selection import LeaveOneGroupOut from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=6, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.5,0.1,0.1,0.1,0.1], random_state=SEED) # logo = LeaveOneGroupOut() print("拆分前的數據") print(pd.DataFrame(y).value_counts()) for train_idx, test_idx in logo.split(X, y, groups=y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.12 留一法-- LeavePOut
#### LeavePOut
import numpy as np
import pandas as pd
from sklearn.model_selection import *
from sklearn.datasets import make_classification
SEED = 666
X,y = make_classification(n_samples=100,
n_features=20,
shuffle=True,
n_classes=6,
n_clusters_per_class=1,
n_informative =18,
weights=[0.1,0.5,0.1,0.1,0.1,0.1],
random_state=SEED)
#
logo = LeavePOut(10)
print("拆分前的數據")
print(pd.DataFrame(y).value_counts())
# 完全是按照groups的參數進行的拆分
for train_idx, test_idx in logo.split(X, y):
print("拆分--------------------------------------------------")
print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts())
print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.13 留一法-- LeavePGroupsOut
# LeavePGroupsOut import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=6, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.5,0.1,0.1,0.1,0.1], random_state=SEED) # logo = LeavePGroupsOut(2) print("拆分前的數據") print(pd.DataFrame(y).value_counts()) for train_idx, test_idx in logo.split(X, y, groups=y): print("拆分--------------------------------------------------") print("訓練集數據:\n",pd.DataFrame(y[train_idx]).value_counts()) print("測試集數據:\n",pd.DataFrame(y[test_idx]).value_counts())
2.2.14 指定拆分--PredefinedSplit
# PredefinedSplit #根據提前指定的分類來划分數據集,譬如說test_fold包含三類0、1、2,那么會拆分三次,每一次其中一類作為測試集,(-1對應的index永遠在訓練集) import numpy as np import pandas as pd from sklearn.model_selection import * from sklearn.datasets import make_classification SEED = 666 X,y = make_classification(n_samples=100, n_features=20, shuffle=True, n_classes=6, n_clusters_per_class=1, n_informative =18, weights=[0.1,0.5,0.1,0.1,0.1,0.1], random_state=SEED) # test_fold = np.hstack((np.zeros(20),np.ones(40),np.ones(10)+1,np.zeros(30)-1)) #分為三類0,1,2,設置為-1的樣本永遠包含在測試集中 print(test_fold) pres = PredefinedSplit(test_fold) for train_idx, test_idx in pres.split(): print("TRAIN:", train_idx, "TEST:", test_idx)
2.2.15 時間窗口拆分--TimeSeriesSplit
時間序列的拆分。
# TimeSeriesSplit # 時間序列拆分,類似於滑動窗口模式,以前n個樣本作為訓練集,第n+1個樣本作為測試集 import numpy as np import pandas as pd from sklearn.model_selection import * X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([1, 2, 3, 4, 5, 6]) tss = TimeSeriesSplit(n_splits=5, max_train_size=None, test_size=None, gap=0) for train_idx, test_idx in tss.split(X): print("TRAIN:", train_idx, "TEST:", test_idx) ''' TRAIN: [0] TEST: [1] TRAIN: [0 1] TEST: [2] TRAIN: [0 1 2] TEST: [3] TRAIN: [0 1 2 3] TEST: [4] TRAIN: [0 1 2 3 4] TEST: [5] '''
2.2.x 附件筆記
# 15個 #---------------------------K折驗證------------------------------------ #K折交叉驗證,即將原數據集分成K份,每一折將其中一份作為測試集,其他k-1份作為訓練集 # 隨機的多折拆分(默認五折拆分),shuffle=True會打亂訓練集測試集 KFold(n_splits=5, shuffle=True, random_state=SEED) for train_index, test_index in aKFold.split(X) # 簡單的多折拆分(默認五折拆分),需要傳入groups,會根據傳入groups使得每個groups在訓練集測試集的比例不變 GroupKFold(n_splits=5) for train_index, test_index in group_kfold.split(X, y, groups = y) # 相比於KFold,在進行split的時候需要傳入y,並且會根據y的分類,保證分類后y在各個數據集中比例不變,類似於GroupKFold(基於參數groups) StratifiedKFold(n_splits=5, shuffle=False, random_state=None) # 在進行split的時候需要傳入X,y和groups,觀察其結果,似乎只取決於傳入的group,group的長度取決於X、y的長度,分類數最好與n_splits相同 StratifiedGroupKFold(n_splits=3, shuffle=False, random_state=None) #重復n_repeats次n_splits折的KFold拆分,最后拆分的次數應該是n_splits*n_repeats RepeatedKFold(n_splits=4, n_repeats=2, random_state=666) # 重復n_repeats次n_splits折的StratifiedKFold拆分,最后拆分的次數應該是n_splits*n_repeats RepeatedStratifiedKFold(n_splits=3, n_repeats=2, random_state=SEED) # -------------------ShuffleSplit---------------------------------- # 相比於K折拆分,ShuffleSplit可指定拆分數據集的次數及每次拆分數據集的測試集比例 # 可指定拆分次數和測試集比例,需要指定random_state才可以復現數據 ShuffleSplit(n_splits=5, test_size=0.25, train_size=None, random_state=666) # 可指定拆分次數和測試集比例,需要傳入X、y,在划分后的數據集中y標簽比例相似 StratifiedShuffleSplit(n_splits=3, test_size=None, train_size=None, random_state=SEED) # 可指定拆分次數和測試集比例,需要傳入groups,按照分組拆分 GroupShuffleSplit(n_splits=5, test_size=None, train_size=None, random_state=None) for train_idx, test_idx in gss.split(X, y=None, groups=y) # -------------------------留一法----------------------------------------- # 留一法及其拓展留P法,即指定1(或者P)個樣本(或組)作為測試集,其他樣本(或組)做為訓練集,拆分數由樣本數決定,不必指定 # 隨機拆分的留一法,每次只會保留一個樣本作為測試集,樣本數為n則默認進行n-1次拆分 LeaveOneOut() for train_idx, test_idx in logo.split(X) # 按組拆分的留一法,按照傳入的groups分組,然后根據分組進行留一拆分 LeaveOneGroupOut() for train_idx, test_idx in logo.split(X, y, groups=y): # 留一法的拓展,LeavePOut(1)與LeaveOneOut()是一樣的 LeavePOut(p) # 留一法的拓展,LeavePGroupsOut(1)與LeaveOneGroupOut()是一樣的 LeavePGroupsOut(p) #--------------------------指定拆分----------------------------------------- #根據提前指定的分類來划分數據集,譬如說test_fold包含三類0、1、2,那么會拆分三次,每一次其中一類作為測試集,(-1對應的index永遠在訓練集) test_fold = np.hstack((np.zeros(20),np.ones(40),np.ones(10)+1,np.zeros(30)-1)) #分為三類0,1,2,設置為-1的樣本永遠包含在測試集中 pres = PredefinedSplit(test_fold) # -------------------------時間序列拆分----------------------------------- # 時間序列拆分,類似於滑動窗口模式,以前n個樣本作為訓練集,第n+1個樣本作為測試集 TimeSeriesSplit(n_splits=5, max_train_size=None, test_size=None, gap=0)