R語言-數據整形之reshape2包


R語言中reshape2函數族

前言

  前幾天放出來的那個R的展示中,有說到其實學R的過程更多的就是熟悉各種函數的過程(學習統計模型不在此列...我個人還是傾向於不要借助軟件來學習理論知識,雖然可以直接看codes...筆和紙上的推導還是不可或缺的基本功),然后各種基礎函數熟悉了之后很多被打包好的函數就是縮短代碼長度的利器了。

  excel里面有神奇的“數據透視表(pivot table)”,其實很多時候真的已經很神奇了....不過我還是喜歡R,喜歡R直接輸出csv或者xlsx的簡潔。揉數據呢(學名貌似叫數據整理),我也還是喜歡寫出來代碼的形式,而不是直接向excel那樣面對結果。只是感覺更加不容易出錯吧。

  揉數據,顧名思義,就是在原有的數據格式基礎上,變化出來其他的形式。比如,長長的時間序列變成寬一點的~當然這個可以簡單的借助reshape()函數了。可惜我還是不死心,想找一個更好用的,於是就自然而然的看到了reshape2這個包。

  這個包里面函數精華在melt()和*cast()。說實話melt()耗了我一段時間來理解,尤其是為什么需要先melt再cast...后來發現這個步驟簡直是無敵啊,什么樣的形狀都變得更加容易揉了,大贊。

  warm-up完畢,還是回到正題吧,怎么用reshape2揉數據呢?雖然reshape2支持array, list和data.frame,但是我一般還是習慣於用data.frame,所以還是說說這東西怎么揉吧。揉數據的第一步就是調用melt()函數,不用擔心你的input是什么格式,這個函數array, list和data.frame通吃。然后,要告訴他哪些變量是(唯一)識別一個個體的,這句話是什么意思呢?我們先看melt()的參數:

	 melt(data, id.vars, measure.vars,
	    variable.name = "variable", ..., na.rm = FALSE,
	    value.name = "value")

  其中id.vars可以指定一系列變量,然后measure.vars就可以留空了,這樣生成的新數據會保留id.vars的所有列,然后增加兩個新列:variable和value,一個存儲變量的名稱一個存儲變量值。這樣就相當於面板數據的長格式了。直接拷一個作者給出的例子:

  原數據:

	head(airquality)
	  ozone solar.r wind temp month day
	1    41     190  7.4   67     5   1
	2    36     118  8.0   72     5   2
	3    12     149 12.6   74     5   3
	4    18     313 11.5   62     5   4
	5    NA      NA 14.3   56     5   5
	6    28      NA 14.9   66     5   6
	
	dim(airquality)
	[1] 153   6

  然后我們將month和day作為識別個體記錄的變量,調用melt(airquality, id=c("month", "day")):

	require(reshape2)
	head(melt(airquality, id=c("month", "day")))
	  month day variable value
	1     5   1    ozone    41
	2     5   2    ozone    36
	3     5   3    ozone    12
	4     5   4    ozone    18
	5     5   5    ozone    NA
	6     5   6    ozone    28
	
	dim(melt(airquality, id=c("month", "day")))
	[1] 612   4

  嗯,這樣數據就變長了~然后,就可以隨意的cast了...dcast()會給出寬格式的數據,比如我們想把day作為唯一的識別,那么:

	names(airquality) <- tolower(names(airquality))
	aqm <- melt(airquality, id=c("month", "day"), na.rm=TRUE)
	head(dcast(aqm, day ~ variable+month))
	
	  day ozone_5 ozone_6 ozone_7 ozone_8 ozone_9 solar.r_5 solar.r_6 solar.r_7 solar.r_8 solar.r_9 wind_5 wind_6 wind_7
	1   1      41      NA     135      39      96       190       286       269        83       167    7.4    8.6    4.1
	2   2      36      NA      49       9      78       118       287       248        24       197    8.0    9.7    9.2
	3   3      12      NA      32      16      73       149       242       236        77       183   12.6   16.1    9.2
	4   4      18      NA      NA      78      91       313       186       101        NA       189   11.5    9.2   10.9
	5   5      NA      NA      64      35      47        NA       220       175        NA        95   14.3    8.6    4.6
	6   6      28      NA      40      66      32        NA       264       314        NA        92   14.9   14.3   10.9
	  wind_8 wind_9 temp_5 temp_6 temp_7 temp_8 temp_9
	1    6.9    6.9     67     78     84     81     91
	2   13.8    5.1     72     74     85     81     92
	3    7.4    2.8     74     67     81     82     93
	4    6.9    4.6     62     84     84     86     93
	5    7.4    7.4     56     85     83     85     87
	6    4.6   15.5     66     79     83     87     84

  或者對於每個月,求平均數:

	 head(dcast(aqm, month ~ variable, mean, margins = c("month", "variable")))
	
	  month    ozone  solar.r      wind     temp    (all)
	1     5 23.61538 181.2963 11.622581 65.54839 68.70696
	2     6 29.44444 190.1667 10.266667 79.10000 87.38384
	3     7 59.11538 216.4839  8.941935 83.90323 93.49748
	4     8 59.96154 171.8571  8.793548 83.96774 79.71207
	5     9 31.44828 167.4333 10.180000 76.90000 71.82689
	6 (all) 42.12931 185.9315  9.957516 77.88235 80.05722

  當然還有更強大的acast(),配合.函數:

	library(plyr) # needed to access . function
	acast(aqm, variable ~ month, mean, subset = .(variable == "ozone"))
	
	             5        6        7        8        9
	ozone 23.61538 29.44444 59.11538 59.96154 31.44828

  嗯,基本上數據就可以這么揉來揉去了...哈哈。怎么感覺有點像數據透視表捏?只是更加靈活,還可以自定義函數。

  此外還有recast()可以一步到位,只是返回的是list;colsplit()可以分割變量名...函數不多,卻精華的很啊。

Example_1

	# code_1
	require(reshape2)
	x = data.frame(subject = c("John", "Mary"), 
	               time = c(1,1),
	               age = c(33,NA),
	               weight = c(90, NA),
	               height = c(2,2))
	x
	  subject time age weight height
	1    John    1  33     90      2
	2    Mary    1  NA     NA      2

	------------------------------------------------------
	
	# code_2
	molten = melt(x, id = c("subject", "time"))
	molten
	  subject time variable value
	1    John    1      age    33
	2    Mary    1      age    NA
	3    John    1   weight    90
	4    Mary    1   weight    NA
	5    John    1   height     2
	6    Mary    1   height     2
	
	------------------------------------------------------		

	# code_3
	molten = melt(x, id = c("subject", "time"), na.rm = TRUE)
	molten
	  subject time variable value
	1    John    1      age    33
	3    John    1   weight    90
	5    John    1   height     2
	6    Mary    1   height     2

	------------------------------------------------------
	
	# 語句
    dcast(molten, formula = time + subject ~ variable)
	dcast(molten, formula = subject + time  ~ variable)
	dcast(molten, formula = subject  ~ variable)
	dcast(molten, formula = ...  ~ variable)			
	
	# 結果
	> dcast(molten, formula = time + subject ~ variable)
	  time subject age weight height
	1    1    John  33     90      2
	2    1    Mary  NA     NA      2
	> dcast(molten, formula = subject + time  ~ variable)
	  subject time age weight height
	1    John    1  33     90      2
	2    Mary    1  NA     NA      2
	> dcast(molten, formula = subject  ~ variable)
	  subject age weight height
	1    John  33     90      2
	2    Mary  NA     NA      2
	> dcast(molten, formula = ...  ~ variable)
	  subject time age weight height
	1    John    1  33     90      2
	2    Mary    1  NA     NA      2
	
	------------------------------------------------------
	
	# 語句 
	acast(molten, formula = subject  ~ time ~ variable)		
	
	# 結果
	> acast(molten, formula = subject  ~ time ~ variable)
	, , age
	
	      1
	John 33
	Mary NA
	
	, , weight
	
	      1
	John 90
	Mary NA
	
	, , height
	
	     1
	John 2
	Mary 2
	
	------------------------------------------------------

	# Melt French Fries dataset
	data(french_fries)
	head(french_fries)
	ffm <- melt(french_fries, id = 1:4, na.rm = TRUE)
	head(ffm)
	
	# Aggregate examples - all 3 yield the same result
	dcast(ffm, treatment ~ .)
	dcast(ffm, treatment ~ ., function(x) length(x))
	dcast(ffm, treatment ~ ., length) 
	
	# Passing further arguments through ...
	dcast(ffm, treatment ~ ., sum)
	dcast(ffm, treatment ~ ., sum, trim = 0.1)

	------------------------------------------------------

Example_2

Converting data between wide and long format

Problem

  You want to do convert data from a wide format to a long format.

  Many functions in R expect data to be in a long format rather than a wide format. Programs like SPSS, however, often use wide-formatted data.

Solution

  There are two sets of methods that are explained below:

  • gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.
  • melt() and dcast() from the reshape2 package.

  There are a number of other methods which aren’t covered here, since they are not as easy to use:

  • The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.
  • stack() and unstack()

Sample data

  These data frames hold the same data, but in wide and long formats. They will each be converted to the other format below.

	olddata_wide <- read.table(header=TRUE, text='
	 subject sex control cond1 cond2
	       1   M     7.9  12.3  10.7
	       2   F     6.3  10.6  11.1
	       3   F     9.5  13.1  13.8
	       4   M    11.5  13.4  12.9
	')
	# Make sure the subject column is a factor
	olddata_wide$subject <- factor(olddata_wide$subject)

	olddata_long <- read.table(header=TRUE, text='
	 subject sex condition measurement
	       1   M   control         7.9
	       1   M     cond1        12.3
	       1   M     cond2        10.7
	       2   F   control         6.3
	       2   F     cond1        10.6
	       2   F     cond2        11.1
	       3   F   control         9.5
	       3   F     cond1        13.1
	       3   F     cond2        13.8
	       4   M   control        11.5
	       4   M     cond1        13.4
	       4   M     cond2        12.9
	')
	# Make sure the subject column is a factor
	olddata_long$subject <- factor(olddata_long$subject)		

tidyr

From wide to long

Use gather:

	olddata_wide
	#>   subject sex control cond1 cond2
	#> 1       1   M     7.9  12.3  10.7
	#> 2       2   F     6.3  10.6  11.1
	#> 3       3   F     9.5  13.1  13.8
	#> 4       4   M    11.5  13.4  12.9
	
	library(tidyr)
	
	# The arguments to gather():
	# - data: Data object
	# - key: Name of new key column (made from names of data columns)
	# - value: Name of new value column
	# - ...: Names of source columns that contain values
	data_long <- gather(olddata_wide, condition, measurement, control:cond2)
	data_long
	#>    subject sex condition measurement
	#> 1        1   M   control         7.9
	#> 2        2   F   control         6.3
	#> 3        3   F   control         9.5
	#> 4        4   M   control        11.5
	#> 5        1   M     cond1        12.3
	#> 6        2   F     cond1        10.6
	#> 7        3   F     cond1        13.1
	#> 8        4   M     cond1        13.4
	#> 9        1   M     cond2        10.7
	#> 10       2   F     cond2        11.1
	#> 11       3   F     cond2        13.8
	#> 12       4   M     cond2        12.9

  In this example, the source columns that are gathered are specified with control:cond2. This means to use all the columns, positionally, between control and cond2. Another way of doing it is to name the columns individually, as in:

	gather(olddata_wide, condition, measurement, control, cond1, cond2)

  If you need to use gather() programmatically, you may need to use variables containing column names. To do this, you should use the gather_() function instead, which takes strings instead of bare (unquoted) column names.

	keycol <- "condition"
	valuecol <- "measurement"
	gathercols <- c("control", "cond1", "cond2")
	
	gather_(olddata_wide, keycol, valuecol, gathercols)

  Optional: Rename the factor levels of the variable column, and sort.

	# Rename factor names from "cond1" and "cond2" to "first" and "second"
	levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
	levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
	
	# Sort by subject first, then by condition
	data_long <- data_long[order(data_long$subject, data_long$condition), ]
	data_long
	#>    subject sex condition measurement
	#> 1        1   M   control         7.9
	#> 5        1   M     first        12.3
	#> 9        1   M    second        10.7
	#> 2        2   F   control         6.3
	#> 6        2   F     first        10.6
	#> 10       2   F    second        11.1
	#> 3        3   F   control         9.5
	#> 7        3   F     first        13.1
	#> 11       3   F    second        13.8
	#> 4        4   M   control        11.5
	#> 8        4   M     first        13.4
	#> 12       4   M    second        12.9

From long to wide

Use spread:

	olddata_long
	#>    subject sex condition measurement
	#> 1        1   M   control         7.9
	#> 2        1   M     cond1        12.3
	#> 3        1   M     cond2        10.7
	#> 4        2   F   control         6.3
	#> 5        2   F     cond1        10.6
	#> 6        2   F     cond2        11.1
	#> 7        3   F   control         9.5
	#> 8        3   F     cond1        13.1
	#> 9        3   F     cond2        13.8
	#> 10       4   M   control        11.5
	#> 11       4   M     cond1        13.4
	#> 12       4   M     cond2        12.9
	
	library(tidyr)
	
	# The arguments to spread():
	# - data: Data object
	# - key: Name of column containing the new column names
	# - value: Name of column containing values
	data_wide <- spread(olddata_long, condition, measurement)
	data_wide
	#>   subject sex cond1 cond2 control
	#> 1       1   M  12.3  10.7     7.9
	#> 2       2   F  10.6  11.1     6.3
	#> 3       3   F  13.1  13.8     9.5
	#> 4       4   M  13.4  12.9    11.5

  Optional: A few things to make the data look nicer.

	# Rename cond1 to first, and cond2 to second
	names(data_wide)[names(data_wide)=="cond1"] <- "first"
	names(data_wide)[names(data_wide)=="cond2"] <- "second"
	
	# Reorder the columns
	data_wide <- data_wide[, c(1,2,5,3,4)]
	data_wide
	#>   subject sex control first second
	#> 1       1   M     7.9  12.3   10.7
	#> 2       2   F     6.3  10.6   11.1
	#> 3       3   F     9.5  13.1   13.8
	#> 4       4   M    11.5  13.4   12.9

  The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

reshape2

From wide to long

Use melt:

	olddata_wide
	#>   subject sex control cond1 cond2
	#> 1       1   M     7.9  12.3  10.7
	#> 2       2   F     6.3  10.6  11.1
	#> 3       3   F     9.5  13.1  13.8
	#> 4       4   M    11.5  13.4  12.9
	
	library(reshape2)
	
	# Specify id.vars: the variables to keep but not split apart on
	melt(olddata_wide, id.vars=c("subject", "sex"))
	#>    subject sex variable value
	#> 1        1   M  control   7.9
	#> 2        2   F  control   6.3
	#> 3        3   F  control   9.5
	#> 4        4   M  control  11.5
	#> 5        1   M    cond1  12.3
	#> 6        2   F    cond1  10.6
	#> 7        3   F    cond1  13.1
	#> 8        4   M    cond1  13.4
	#> 9        1   M    cond2  10.7
	#> 10       2   F    cond2  11.1
	#> 11       3   F    cond2  13.8
	#> 12       4   M    cond2  12.9

  There are options for melt that can make the output a little easier to work with:

	data_long <- melt(olddata_wide,
	        # ID variables - all the variables to keep but not split apart on
	    id.vars=c("subject", "sex"),
	        # The source columns
	    measure.vars=c("control", "cond1", "cond2" ),
	        # Name of the destination column that will identify the original
	        # column that the measurement came from
	    variable.name="condition",
	    value.name="measurement"
	)
	data_long
	#>    subject sex condition measurement
	#> 1        1   M   control         7.9
	#> 2        2   F   control         6.3
	#> 3        3   F   control         9.5
	#> 4        4   M   control        11.5
	#> 5        1   M     cond1        12.3
	#> 6        2   F     cond1        10.6
	#> 7        3   F     cond1        13.1
	#> 8        4   M     cond1        13.4
	#> 9        1   M     cond2        10.7
	#> 10       2   F     cond2        11.1
	#> 11       3   F     cond2        13.8
	#> 12       4   M     cond2        12.9

  If you leave out the measure.vars, melt will automatically use all the other variables as the id.vars. The reverse is true if you leave out id.vars.

  If you don’t specify variable.name, it will name that column "variable", and if you leave out value.name, it will name that column "measurement".

  Optional: Rename the factor levels of the variable column.

	# Rename factor names from "cond1" and "cond2" to "first" and "second"
	levels(data_long$condition)[levels(data_long$condition)=="cond1"] <- "first"
	levels(data_long$condition)[levels(data_long$condition)=="cond2"] <- "second"
	
	# Sort by subject first, then by condition
	data_long <- data_long[ order(data_long$subject, data_long$condition), ]
	data_long
	#>    subject sex condition measurement
	#> 1        1   M   control         7.9
	#> 5        1   M     first        12.3
	#> 9        1   M    second        10.7
	#> 2        2   F   control         6.3
	#> 6        2   F     first        10.6
	#> 10       2   F    second        11.1
	#> 3        3   F   control         9.5
	#> 7        3   F     first        13.1
	#> 11       3   F    second        13.8
	#> 4        4   M   control        11.5
	#> 8        4   M     first        13.4
	#> 12       4   M    second        12.9

From long to wide

  The following code uses dcast to reshape the data. This function is meant for data frames; if you are working with arrays or matrices, use acast instead.

	olddata_long
	#>    subject sex condition measurement
	#> 1        1   M   control         7.9
	#> 2        1   M     cond1        12.3
	#> 3        1   M     cond2        10.7
	#> 4        2   F   control         6.3
	#> 5        2   F     cond1        10.6
	#> 6        2   F     cond2        11.1
	#> 7        3   F   control         9.5
	#> 8        3   F     cond1        13.1
	#> 9        3   F     cond2        13.8
	#> 10       4   M   control        11.5
	#> 11       4   M     cond1        13.4
	#> 12       4   M     cond2        12.9
	
	# From the source:
	# "subject" and "sex" are columns we want to keep the same
	# "condition" is the column that contains the names of the new column to put things in
	# "measurement" holds the measurements
	library(reshape2)
	
	data_wide <- dcast(olddata_long, subject + sex ~ condition, value.var="measurement")
	data_wide
	#>   subject sex cond1 cond2 control
	#> 1       1   M  12.3  10.7     7.9
	#> 2       2   F  10.6  11.1     6.3
	#> 3       3   F  13.1  13.8     9.5
	#> 4       4   M  13.4  12.9    11.5

  Optional: A few things to make the data look nicer.

	# Rename cond1 to first, and cond2 to second
	names(data_wide)[names(data_wide)=="cond1"] <- "first"
	names(data_wide)[names(data_wide)=="cond2"] <- "second"
	
	# Reorder the columns
	data_wide <- data_wide[, c(1,2,5,3,4)]
	data_wide
	#>   subject sex control first second
	#> 1       1   M     7.9  12.3   10.7
	#> 2       2   F     6.3  10.6   11.1
	#> 3       3   F     9.5  13.1   13.8
	#> 4       4   M    11.5  13.4   12.9

  The order of factor levels determines the order of the columns. The level order can be changed before reshaping, or the columns can be re-ordered afterward.

參考資料


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM