【R語言學習筆記】7. 將數據划分為訓練集、驗證集和測試集

本文轉載自查看原文 2020-01-29 00:56 6491 R語言/ 數據處理

1. 目的：介紹將數據集划分為訓練集、驗證集和測試集的方法。

2. 數據來源：github https://github.com/reisanar/datasets/blob/master/WestRoxbury.csv

3. 此博客主要介紹划分數據的方法，因此不對變量做過多介紹。

4. 划分方法

4.1 將變量划分為訓練集、驗證集和測試集

Method 1：

## partitioning into training (50%), validation (30%), and test (20%) sets

# randomly sample 50% of the row IDs for training
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5)

# sample 30% of the row IDs into the validation set, drawing only from records not already in the training set
# use setdiff() to find records not already in the training set
valid.rows <- sample(setdiff(rownames(housing.df), train.rows), dim(housing.df)[1]*0.3)

# assign the remaining 20% row IDs serve as test
test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows))

# create the 3 data frames by collecting all columns from the appropriate rows 
housing.train <- housing.df[train.rows, ]
housing.valid <- housing.df[valid.rows, ]
housing.test <- housing.df[test.rows, ]

Method 2:

## alternative

train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.5)
housing.train <- housing.df[train.rows,]

remain <- housing.df[-train.rows,]
valid.rows <- sample(1:nrow(remain), dim(housing.df)[1]*0.3) #dim(housing.df)[1]*0.3 -> be careful!
housing.valid <- remain[valid.rows,]
housing.test <- remain[-valid.rows,]

4.2 將數據划分為訓練集和測試集

Method 1:

## partitioning into training (60%) and validation (40%)
set.seed(1) ## to get the same sequence of numbers

# randomly sample 60% of the row IDs for training; the remaining 40% serve as validation
train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.6)

# collect all the columns with training row ID into training set:
housing.train <- housing.df[train.rows, ]

# assign row IDs that are not already in the training set, into validation 
valid.rows <- setdiff(rownames(housing.df), train.rows) 
housing.valid <- housing.df[valid.rows, ]

Method 2:

## alternative 1
# collect all the columns without training row ID into validation set 
train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.6)
housing.train <- housing.df[train.rows, ]
housing.valid <- housing.df[-train.rows, ]

Method 3:

## alternative 2: generate random numbers
gp <- runif(nrow(housing.df)) # generate uniform random numbers
housing.train <- housing.df[gp < 0.6,]
housing.test <- housing.df[gp >=0.6,]

Method 4:

## alternative 3
n_obs <- nrow(housing.df) # get the number of observations
permuted_rows <- sample(n_obs) # shuffle row indices: permuted_rows
housing_shuffled <- housing.df[permuted_rows,] # randomly order data: Sonar
split <- round(n_obs * 0.6) # identify row to split on: split
housing.train <- housing_shuffled[1: split,] # create train
housing.test <- housing_shuffled[(split+1):nrow(housing_shuffled),] # create test

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 將數據划分為訓練集和測試集；縮放特征區間關於數據集的划分--訓練集、驗證集和測試集關於訓練集,驗證集,測試集的划分關於訓練集,驗證集,測試集的划分 Python數據預處理—訓練集和測試集數據划分 10-Python實現數據集划分（訓練集/驗證集/測試集）訓練集、驗證集和測試集的概念及划分原則 python將圖像划分成訓練集，驗證集和測試集划分訓練集與測試集如何把數據集划分成訓練集和測試集