在R語言中對包括分類變量(factor)的數據建模時,一般會將其自動處理為虛擬變量或啞變量(dummy variable)。但有一些特殊的函數,如neuralnet包中的neuralnet函數就不會預處理。如果直接將原始數據扔進去,會出現”requires numeric/complex matrix/vector arguments”需要數值/復數矩陣/矢量參數錯誤。
這個時候,除了將這些變量刪除,我們只能手動將factor variable轉換為取值(0,1)的虛擬變量。所用的函數一般有model.matrix(),nnet package中的class.ind()。
下面以UCI的german credit data為例說明。
首先,從UCI網站上下載到german.data數據集,並用str函數對其有個簡單的認識。
- download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data",
- "./german.data")
- data <- read.table("./german.data")
- str(data)
該數據有21個變量,其中V21為目標變量,V1-V20中包括integer和factor兩種類型。下面將用V1分類變量(包含4個level)和V2,V5,V8三個數值型變量作為解釋變量建模。
- ## 'data.frame': 1000 obs. of 21 variables:
- ## $ V1 : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
- ## $ V2 : int 6 48 12 42 24 36 24 36 12 30 ...
- ## $ V3 : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
- ## $ V4 : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
- ## $ V5 : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
- ## $ V6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
- ## $ V7 : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
- ## $ V8 : int 4 2 2 2 3 2 3 2 2 4 ...
- ## $ V9 : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
- ## $ V10: Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
- ## $ V11: int 4 2 3 4 4 4 4 2 4 2 ...
- ## $ V12: Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
- ## $ V13: int 67 22 49 45 53 35 53 35 61 28 ...
- ## $ V14: Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
- ## $ V15: Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
- ## $ V16: int 2 1 1 1 2 1 1 1 1 2 ...
- ## $ V17: Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
- ## $ V18: int 1 1 2 2 2 2 1 1 1 1 ...
- ## $ V19: Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
- ## $ V20: Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
- ## $ V21: int 1 2 1 1 2 1 1 1 1 2 ...
首先加載neuralnet包嘗試一下,只用數值型變量建模,沒有報錯。
- library("neuralnet")
- NNModelAllNum <- neuralnet(V21 ~ V2 + V5 + V8, data)
- NNModelAllNum
- ## Call: neuralnet(formula = V21 ~ V2 + V5 + V8, data = data)
- ##
- ## 1 repetition was calculated.
- ##
- ## Error Reached Threshold Steps
- ## 1 104.9993578 0.005128715177 55
當我們把V1放入解釋變量中出現了如下錯誤:
- NNModel <- neuralnet(V21 ~ V1 + V2 + V5 + V8, data)
- ## Error: 需要數值/復數矩陣/矢量參數
此時可以用model.matrix函數將V1轉化為三個虛擬變量,V1A12,V1A13,V1A14。
- dummyV1 <- model.matrix(~V1, data)
- head(cbind(dummyV1, data$V1))
- ## (Intercept) V1A12 V1A13 V1A14
- ## 1 1 0 0 0 1
- ## 2 1 1 0 0 2
- ## 3 1 0 0 1 4
- ## 4 1 0 0 0 1
- ## 5 1 0 0 0 1
- ## 6 1 0 0 1 4
因為model.matrix函數對數值型和分類Level=2的類別型變量沒有影響,所以可以將四個變量一起用該函數生成新的數據集modelData,就可以用該數據集建模了。
- modelData <- model.matrix(~V1 + V2 + V5 + V8 + V21, data)
- head(modelData)
- ## (Intercept) V1A12 V1A13 V1A14 V2 V5 V8 V21
- ## 1 1 0 0 0 6 1169 4 1
- ## 2 1 1 0 0 48 5951 2 2
- ## 3 1 0 0 1 12 2096 2 1
- ## 4 1 0 0 0 42 7882 2 1
- ## 5 1 0 0 0 24 4870 3 2
- ## 6 1 0 0 1 36 9055 2 1
- NNModel <- neuralnet(V21 ~ V1A12 + V1A13 + V1A14 + V2 + V5 + V8, modelData)
另外一種方法來自nnet package的class.ind函數。
- library("nnet")
- dummyV12 <- class.ind(data$V1)
- head(dummyV12)
可以看到,該結果和model.matrix稍有區別,生成了四個虛擬變量。要注意,為了避免多重共線性,對於level=n的分類變量只需選取其任意n-1個虛擬變量。
- ## A11 A12 A13 A14
- ## [1,] 1 0 0 0
- ## [2,] 0 1 0 0
- ## [3,] 0 0 0 1
- ## [4,] 1 0 0 0
- ## [5,] 1 0 0 0
- ## [6,] 0 0 0 1