R語言中reshape2函數族
前言
前幾天放出來的那個R的展示中,有說到其實學R的過程更多的就是熟悉各種函數的過程(學習統計模型不在此列...我個人還是傾向於不要借助軟件來學習理論知識,雖然可以直接看codes...筆和紙上的推導還是不可或缺的基本功),然后各種基礎函數熟悉了之后很多被打包好的函數就是縮短代碼長度的利器了。
excel里面有神奇的“數據透視表(pivot table)”,其實很多時候真的已經很神奇了....不過我還是喜歡R,喜歡R直接輸出csv或者xlsx的簡潔。揉數據呢(學名貌似叫數據整理),我也還是喜歡寫出來代碼的形式,而不是直接向excel那樣面對結果。只是感覺更加不容易出錯吧。
揉數據,顧名思義,就是在原有的數據格式基礎上,變化出來其他的形式。比如,長長的時間序列變成寬一點的~當然這個可以簡單的借助reshape()函數了。可惜我還是不死心,想找一個更好用的,於是就自然而然的看到了reshape2這個包。
這個包里面函數精華在melt()和*cast()。說實話melt()耗了我一段時間來理解,尤其是為什么需要先melt再cast...后來發現這個步驟簡直是無敵啊,什么樣的形狀都變得更加容易揉了,大贊。
warm-up完畢,還是回到正題吧,怎么用reshape2揉數據呢?雖然reshape2支持array, list和data.frame,但是我一般還是習慣於用data.frame,所以還是說說這東西怎么揉吧。揉數據的第一步就是調用melt()函數,不用擔心你的input是什么格式,這個函數array, list和data.frame通吃。然后,要告訴他哪些變量是(唯一)識別一個個體的,這句話是什么意思呢?我們先看melt()的參數:
melt(data, id.vars, measure.vars,
variable.name = "variable", ..., na.rm = FALSE,
value.name = "value")
其中id.vars可以指定一系列變量,然后measure.vars就可以留空了,這樣生成的新數據會保留id.vars的所有列,然后增加兩個新列:variable和value,一個存儲變量的名稱一個存儲變量值。這樣就相當於面板數據的長格式了。直接拷一個作者給出的例子:
原數據:
head(airquality)
ozone solar.r wind temp month day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
dim(airquality)
[1] 153 6
然后我們將month和day作為識別個體記錄的變量,調用melt(airquality, id=c("month", "day")):
require(reshape2)
head(melt(airquality, id=c("month", "day")))
month day variable value
1 5 1 ozone 41
2 5 2 ozone 36
3 5 3 ozone 12
4 5 4 ozone 18
5 5 5 ozone NA
6 5 6 ozone 28
dim(melt(airquality, id=c("month", "day")))
[1] 612 4
嗯,這樣數據就變長了~然后,就可以隨意的cast了...dcast()會給出寬格式的數據,比如我們想把day作為唯一的識別,那么:
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
head(dcast(aqm, day ~ variable+month))
day ozone_5 ozone_6 ozone_7 ozone_8 ozone_9 solar.r_5 solar.r_6 solar.r_7 solar.r_8 solar.r_9 wind_5 wind_6 wind_7
1 1 41 NA 135 39 96 190 286 269 83 167 7.4 8.6 4.1
2 2 36 NA 49 9 78 118 287 248 24 197 8.0 9.7 9.2
3 3 12 NA 32 16 73 149 242 236 77 183 12.6 16.1 9.2
4 4 18 NA NA 78 91 313 186 101 NA 189 11.5 9.2 10.9
5 5 NA NA 64 35 47 NA 220 175 NA 95 14.3 8.6 4.6
6 6 28 NA 40 66 32 NA 264 314 NA 92 14.9 14.3 10.9
wind_8 wind_9 temp_5 temp_6 temp_7 temp_8 temp_9
1 6.9 6.9 67 78 84 81 91
2 13.8 5.1 72 74 85 81 92
3 7.4 2.8 74 67 81 82 93
4 6.9 4.6 62 84 84 86 93
5 7.4 7.4 56 85 83 85 87
6 4.6 15.5 66 79 83 87 84
或者對於每個月,求平均數:
head(dcast(aqm, month ~ variable, mean, margins = c("month", "variable")))
month ozone solar.r wind temp (all)
1 5 23.61538 181.2963 11.622581 65.54839 68.70696
2 6 29.44444 190.1667 10.266667 79.10000 87.38384
3 7 59.11538 216.4839 8.941935 83.90323 93.49748
4 8 59.96154 171.8571 8.793548 83.96774 79.71207
5 9 31.44828 167.4333 10.180000 76.90000 71.82689
6 (all) 42.12931 185.9315 9.957516 77.88235 80.05722
當然還有更強大的acast(),配合.函數:
library(plyr) # needed to access . function
acast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))
5 6 7 8 9
ozone 23.61538 29.44444 59.11538 59.96154 31.44828
嗯,基本上數據就可以這么揉來揉去了...哈哈。怎么感覺有點像數據透視表捏?只是更加靈活,還可以自定義函數。
此外還有recast()可以一步到位,只是返回的是list;colsplit()可以分割變量名...函數不多,卻精華的很啊。
Example_1
# code_1
require(reshape2)
x = data.frame(subject = c("John", "Mary"),
time = c(1,1),
age = c(33,NA),
weight = c(90, NA),
height = c(2,2))
x
subject time age weight height
1 John 1 33 90 2
2 Mary 1 NA NA 2
------------------------------------------------------
# code_2
molten = melt(x, id = c("subject", "time"))
molten
subject time variable value
1 John 1 age 33
2 Mary 1 age NA
3 John 1 weight 90
4 Mary 1 weight NA
5 John 1 height 2
6 Mary 1 height 2
------------------------------------------------------
# code_3
molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
molten
subject time variable value
1 John 1 age 33
3 John 1 weight 90
5 John 1 height 2
6 Mary 1 height 2
------------------------------------------------------
# 語句
dcast(molten, formula = time + subject ~ variable)
dcast(molten, formula = subject + time ~ variable)
dcast(molten, formula = subject ~ variable)
dcast(molten, formula = ... ~ variable)
# 結果
> dcast(molten, formula = time + subject ~ variable)
time subject age weight height
1 1 John 33 90 2
2 1 Mary NA NA 2
> dcast(molten, formula = subject + time ~ variable)
subject time age weight height
1 John 1 33 90 2
2 Mary 1 NA NA 2
> dcast(molten, formula = subject ~ variable)
subject age weight height
1 John 33 90 2
2 Mary NA NA 2
> dcast(molten, formula = ... ~ variable)
subject time age weight height
1 John 1 33 90 2
2 Mary 1 NA NA 2
------------------------------------------------------
# 語句
acast(molten, formula = subject ~ time ~ variable)
# 結果
> acast(molten, formula = subject ~ time ~ variable)
, , age
1
John 33
Mary NA
, , weight
1
John 90
Mary NA
, , height
1
John 2
Mary 2
------------------------------------------------------
# Melt French Fries dataset
data(french_fries)
head(french_fries)
ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
head(ffm)
# Aggregate examples - all 3 yield the same result
dcast(ffm, treatment ~ .)
dcast(ffm, treatment ~ ., function(x) length(x))
dcast(ffm, treatment ~ ., length)
# Passing further arguments through ...
dcast(ffm, treatment ~ ., sum)
dcast(ffm, treatment ~ ., sum, trim = 0.1)
------------------------------------------------------
Example_2
Converting data between wide and long format
Problem
You want to do convert data from a wide format to a long format.
Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.
Solution
There are two sets of methods that are explained below:
gather()
andspread()
from the tidyr package. This is a newer interface to the reshape2 package.melt()
anddcast()
from the reshape2 package.
There are a number of other methods which aren’t covered here, since they are not as easy to use:
- The
reshape()
function, which is confusingly not part of the reshape2 package; it is part of the base install of R. stack()
andunstack()
Sample data
These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.
olddata_wide <- read.table(header=TRUE, text='
subject sex control cond1 cond2
1 M 7.9 12.3 10.7
2 F 6.3 10.6 11.1
3 F 9.5 13.1 13.8
4 M 11.5 13.4 12.9
')
# Make sure the subject column is a factor
olddata_wide$subject <- factor(olddata_wide$subject)
olddata_long <- read.table(header=TRUE, text='
subject sex condition measurement
1 M control 7.9
1 M cond1 12.3
1 M cond2 10.7
2 F control 6.3
2 F cond1 10.6
2 F cond2 11.1
3 F control 9.5
3 F cond1 13.1
3 F cond2 13.8
4 M control 11.5
4 M cond1 13.4
4 M cond2 12.9
')
# Make sure the subject column is a factor
olddata_long$subject <- factor(olddata_long$subject)
tidyr
From wide to long
Use gather
:
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(tidyr)
# The arguments to gather():
# - data: Data object
# - key: Name of new key column (made from names of data columns)
# - value: Name of new value column
# - ...: Names of source columns that contain values
data_long <- gather(olddata_wide, condition, measurement, control:cond2)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
In this example, the source columns that are gathered are specified with control:cond2
. This means to use all the columns, positionally, between control
and cond2
. Another way of doing it is to name the columns individually, as in:
gather(olddata_wide, condition, measurement, control, cond1, cond2)
If you need to use gather()
programmatically, you may need to use variables containing column names. To do this, you should use the gather_()
function instead, which takes strings instead of bare (unquoted) column names.
keycol <- "condition"
valuecol <- "measurement"
gathercols <- c("control", "cond1", "cond2")
gather_(olddata_wide, keycol, valuecol, gathercols)
Optional: Rename the factor levels of the variable column, and sort.
# Rename factor names from "cond1" and "cond2" to "first" and "second"
levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
# Sort by subject first, then by condition
data_long <- data_long[order(data_long$subject, data_long$condition), ]
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 5 1 M first 12.3
#> 9 1 M second 10.7
#> 2 2 F control 6.3
#> 6 2 F first 10.6
#> 10 2 F second 11.1
#> 3 3 F control 9.5
#> 7 3 F first 13.1
#> 11 3 F second 13.8
#> 4 4 M control 11.5
#> 8 4 M first 13.4
#> 12 4 M second 12.9
From long to wide
Use spread
:
olddata_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 1 M cond1 12.3
#> 3 1 M cond2 10.7
#> 4 2 F control 6.3
#> 5 2 F cond1 10.6
#> 6 2 F cond2 11.1
#> 7 3 F control 9.5
#> 8 3 F cond1 13.1
#> 9 3 F cond2 13.8
#> 10 4 M control 11.5
#> 11 4 M cond1 13.4
#> 12 4 M cond2 12.9
library(tidyr)
# The arguments to spread():
# - data: Data object
# - key: Name of column containing the new column names
# - value: Name of column containing values
data_wide <- spread(olddata_long, condition, measurement)
data_wide
#> subject sex cond1 cond2 control
#> 1 1 M 12.3 10.7 7.9
#> 2 2 F 10.6 11.1 6.3
#> 3 3 F 13.1 13.8 9.5
#> 4 4 M 13.4 12.9 11.5
Optional: A few things to make the data look nicer.
# Rename cond1 to first, and cond2 to second
names(data_wide)[names(data_wide)=="cond1"] <- "first"
names(data_wide)[names(data_wide)=="cond2"] <- "second"
# Reorder the columns
data_wide <- data_wide[, c(1,2,5,3,4)]
data_wide
#> subject sex control first second
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.
reshape2
From wide to long
Use melt
:
olddata_wide
#> subject sex control cond1 cond2
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
library(reshape2)
# Specify id.vars: the variables to keep but not split apart on
melt(olddata_wide, id.vars=c("subject", "sex"))
#> subject sex variable value
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
There are options for melt
that can make the output a little easier to work with:
data_long <- melt(olddata_wide,
# ID variables - all the variables to keep but not split apart on
id.vars=c("subject", "sex"),
# The source columns
measure.vars=c("control", "cond1", "cond2" ),
# Name of the destination column that will identify the original
# column that the measurement came from
variable.name="condition",
value.name="measurement"
)
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 2 F control 6.3
#> 3 3 F control 9.5
#> 4 4 M control 11.5
#> 5 1 M cond1 12.3
#> 6 2 F cond1 10.6
#> 7 3 F cond1 13.1
#> 8 4 M cond1 13.4
#> 9 1 M cond2 10.7
#> 10 2 F cond2 11.1
#> 11 3 F cond2 13.8
#> 12 4 M cond2 12.9
If you leave out the measure.vars
, melt
will automatically use all the other variables as the id.vars
. The reverse is true if you leave out id.vars
.
If you don’t specify variable.name, it will name that column "variable", and if you leave out value.name, it will name that column "measurement".
Optional: Rename the factor levels of the variable column.
# Rename factor names from "cond1" and "cond2" to "first" and "second"
levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
# Sort by subject first, then by condition
data_long <- data_long[ order(data_long$subject, data_long$condition), ]
data_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 5 1 M first 12.3
#> 9 1 M second 10.7
#> 2 2 F control 6.3
#> 6 2 F first 10.6
#> 10 2 F second 11.1
#> 3 3 F control 9.5
#> 7 3 F first 13.1
#> 11 3 F second 13.8
#> 4 4 M control 11.5
#> 8 4 M first 13.4
#> 12 4 M second 12.9
From long to wide
The following code uses dcast
to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast
instead.
olddata_long
#> subject sex condition measurement
#> 1 1 M control 7.9
#> 2 1 M cond1 12.3
#> 3 1 M cond2 10.7
#> 4 2 F control 6.3
#> 5 2 F cond1 10.6
#> 6 2 F cond2 11.1
#> 7 3 F control 9.5
#> 8 3 F cond1 13.1
#> 9 3 F cond2 13.8
#> 10 4 M control 11.5
#> 11 4 M cond1 13.4
#> 12 4 M cond2 12.9
# From the source:
# "subject" and "sex" are columns we want to keep the same
# "condition" is the column that contains the names of the new column to put things in
# "measurement" holds the measurements
library(reshape2)
data_wide <- dcast(olddata_long, subject + sex ~ condition, value.var="measurement")
data_wide
#> subject sex cond1 cond2 control
#> 1 1 M 12.3 10.7 7.9
#> 2 2 F 10.6 11.1 6.3
#> 3 3 F 13.1 13.8 9.5
#> 4 4 M 13.4 12.9 11.5
Optional: A few things to make the data look nicer.
# Rename cond1 to first, and cond2 to second
names(data_wide)[names(data_wide)=="cond1"] <- "first"
names(data_wide)[names(data_wide)=="cond2"] <- "second"
# Reorder the columns
data_wide <- data_wide[, c(1,2,5,3,4)]
data_wide
#> subject sex control first second
#> 1 1 M 7.9 12.3 10.7
#> 2 2 F 6.3 10.6 11.1
#> 3 3 F 9.5 13.1 13.8
#> 4 4 M 11.5 13.4 12.9
The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.