1. 目的:介紹將數據集划分為訓練集、驗證集和測試集的方法。
2. 數據來源:github https://github.com/reisanar/datasets/blob/master/WestRoxbury.csv
3. 此博客主要介紹划分數據的方法,因此不對變量做過多介紹。
4. 划分方法
4.1 將變量划分為訓練集、驗證集和測試集
Method 1:
## partitioning into training (50%), validation (30%), and test (20%) sets # randomly sample 50% of the row IDs for training train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.5) # sample 30% of the row IDs into the validation set, drawing only from records not already in the training set # use setdiff() to find records not already in the training set valid.rows <- sample(setdiff(rownames(housing.df), train.rows), dim(housing.df)[1]*0.3) # assign the remaining 20% row IDs serve as test test.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows)) # create the 3 data frames by collecting all columns from the appropriate rows housing.train <- housing.df[train.rows, ] housing.valid <- housing.df[valid.rows, ] housing.test <- housing.df[test.rows, ]
Method 2:
## alternative train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.5) housing.train <- housing.df[train.rows,] remain <- housing.df[-train.rows,] valid.rows <- sample(1:nrow(remain), dim(housing.df)[1]*0.3) #dim(housing.df)[1]*0.3 -> be careful! housing.valid <- remain[valid.rows,] housing.test <- remain[-valid.rows,]
4.2 將數據划分為訓練集和測試集
Method 1:
## partitioning into training (60%) and validation (40%) set.seed(1) ## to get the same sequence of numbers # randomly sample 60% of the row IDs for training; the remaining 40% serve as validation train.rows <- sample(rownames(housing.df), dim(housing.df)[1]*0.6) # collect all the columns with training row ID into training set: housing.train <- housing.df[train.rows, ] # assign row IDs that are not already in the training set, into validation valid.rows <- setdiff(rownames(housing.df), train.rows) housing.valid <- housing.df[valid.rows, ]
Method 2:
## alternative 1 # collect all the columns without training row ID into validation set train.rows <- sample(1:nrow(housing.df), dim(housing.df)[1]*0.6) housing.train <- housing.df[train.rows, ] housing.valid <- housing.df[-train.rows, ]
Method 3:
## alternative 2: generate random numbers gp <- runif(nrow(housing.df)) # generate uniform random numbers housing.train <- housing.df[gp < 0.6,] housing.test <- housing.df[gp >=0.6,]
Method 4:
## alternative 3 n_obs <- nrow(housing.df) # get the number of observations permuted_rows <- sample(n_obs) # shuffle row indices: permuted_rows housing_shuffled <- housing.df[permuted_rows,] # randomly order data: Sonar split <- round(n_obs * 0.6) # identify row to split on: split housing.train <- housing_shuffled[1: split,] # create train housing.test <- housing_shuffled[(split+1):nrow(housing_shuffled),] # create test