使用sklearn.model_selection.train_test_split可以在數據集上隨機划分出一定比例的訓練集和測試集
1.使用形式為:
1 from sklearn.model_selection import train_test_split 2 X_train, X_test, y_train, y_test = train_test_split(train_data,train_target,test_size=0.2, random_state=0)
2.參數解釋:
train_data:樣本特征集
train_target:樣本的標簽集
test_size:樣本占比,測試集占數據集的比重,如果是整數的話就是樣本的數量
random_state:是隨機數的種子。在同一份數據集上,相同的種子產生相同的結果,不同的種子產生不同的划分結果
X_train,y_train:構成了訓練集
X_test,y_test:構成了測試集
3.舉例:
生成一個包含100個樣本的數據集,隨機換分出20%為測試集
1 #py36 2 #!/usr/bin/env python 3 # -*- coding: utf-8 -*- 4 5 #from sklearn.cross_validation import train_test_split 6 from sklearn.model_selection import train_test_split 7 8 # 生成100條數據:100個2維的特征向量,對應100個標簽 9 X = [["feature ","one "]] * 50 + [["feature ","two "]] * 50 10 y = [1] * 50 + [2] * 50 11 12 # 隨機抽取20%的測試集 13 X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=1) 14 print ("train:",len(X_train), "test:",len(X_test)) 15 16 # 查看被划分出的測試集 17 for i in range(len(X_test)): 18 print ("".join(X_test[i]), y_test[i]) 19 20 ''' 21 train: 80 test: 20 22 feature two 2 23 feature two 2 24 feature one 1 25 feature two 2 26 feature two 2 27 feature one 1 28 feature one 1 29 feature two 2 30 feature two 2 31 feature two 2 32 feature two 2 33 feature one 1 34 feature two 2 35 feature two 2 36 feature two 2 37 feature one 1 38 feature one 1 39 feature one 1 40 feature two 2 41 feature one 1 42 '''