桑基圖(Sankey diagram),是一種特定類型的流程圖,圖中延伸的分支的寬度對應數據流量的大小,通常應用於能源、材料成分、金融等數據的可視化分析。
因1898年Matthew Henry Phineas Riall Sankey繪制的“蒸汽機的能源效率圖”而聞名,此后便以其名字命名為“桑基圖”。
一 載入R包,數據
本文使用TCGA數據集中的LIHC的臨床數據進行展示,大家可以根據數據格式處理自己的臨床數據。也可后台回復“R-桑基圖”獲得示例數據以及R代碼。
#install.packages("ggalluvial")
library(ggalluvial)
library(ggplot2)
library(dplyr)
#讀入LIHC臨床數據
LIHC <- read.csv("TCGA_lihc.csv",header=TRUE)
#展示數據情況
head(LIHC)
PATIENT_ID AGE SEX AJCC_PATHOLOGIC_TUMOR_STAGE OS_STATUS
1 TCGA-XR-A8TE less50 Male STAGE III LIVING
2 TCGA-5R-AA1D less50 Female STAGE III LIVING
3 TCGA-DD-A1EC less50 Female STAGE I LIVING
4 TCGA-ED-A7PY less50 Female STAGE II LIVING
5 TCGA-RC-A6M5 less50 Female STAGE IV LIVING
6 TCGA-DD-A1EH less50 Male STAGE III LIVING
summary(LIHC)
桑基圖的數據結構需要節點,權重等信息,ggalluvial 的輸入數據可以是長數據亦可以是寬數據。
二 繪制桑基圖
1 寬數據示例
對臨床數據進行簡單的處理,得到后四個變量的頻數,整理成寬數據:以下處理過程可參考鏈接
#分組計算頻數
LIHCData <- group_by(data,AGE,SEX,AJCC_PATHOLOGIC_TUMOR_STAGE,OS_STATUS) %>% summarise(., count = n())
#查看寬數據格式
head(LIHCData)
AGE SEX AJCC_PATHOLOGIC_TUMOR_STAGE OS_STATUS count
<fct> <fct> <fct> <fct> <int>
1 50to70 Female STAGE I DECEASED 11
2 50to70 Female STAGE I LIVING 16
3 50to70 Female STAGE II DECEASED 3
4 50to70 Female STAGE II LIVING 11
5 50to70 Female STAGE III DECEASED 8
6 50to70 Female STAGE III LIVING 9
繪制桑基圖
ggplot(as.data.frame(LIHCData),
aes(axis1 = AJCC_PATHOLOGIC_TUMOR_STAGE, axis2 = SEX, axis3 = AGE,
y= count)) +
scale_x_discrete(limits = c("AJCC_STAGE", "SEX", "AGE"), expand = c(.1, .05)) +
geom_alluvium(aes(fill = OS_STATUS)) +
geom_stratum() + geom_text(stat = "stratum", label.strata = TRUE) +
theme_minimal() +
ggtitle("Patients in the TCGA-LIHC cohort",
"stratified by demographics and survival")
-
axis參數設置待展示的節點信息(柱子);
-
geom_alluvium參數設置組間面積連接,此處按生存狀態分組;
2 長數據示例
ggplot2通常處理的都是長表格模式,使用to_lodes_form函數即可轉換
#to_lodes_form會生成alluvium和stratum列。主分組位於命名的key列中
LIHC_long <- to_lodes_form(data.frame(LIHCData),
key = "Demographic",
axes = 1:3)
head(LIHC_long)
OS_STATUS count alluvium Demographic stratum
1 DECEASED 11 1 AGE 50to70
2 LIVING 16 2 AGE 50to70
3 DECEASED 3 3 AGE 50to70
4 LIVING 11 4 AGE 50to70
5 DECEASED 8 5 AGE 50to70
6 LIVING 9 6 AGE 50to70
# 繪制桑基圖
ggplot(data = LIHC_long,
aes(x = Demographic, stratum = stratum, alluvium = alluvium,
y = count, label = stratum)) +
geom_alluvium(aes(fill = OS_STATUS)) +
geom_stratum() + geom_text(stat = "stratum") +
theme_minimal() +
ggtitle("Patients in the TCGA-LIHC cohort",
"stratified by demographics and survival")
3 狀態變化的趨勢
vaccinations為R包內置數據集,可展示同一subject在不同survey狀態下的response情況。
data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
aes(x = survey, stratum = response, alluvium = subject,
y = freq,
fill = response, label = response)) +
scale_x_discrete(expand = c(.1, .1)) +
geom_flow() +
geom_stratum(alpha = .5) +
geom_text(stat = "stratum", size = 3) +
theme(legend.position = "none") +
ggtitle("vaccination survey responses at three points in time")
4 更多細節
vignette(topic = "ggalluvial", package = "ggalluvial")
以上就是如何使用R-ggalluvial包繪制桑基圖的簡單介紹,可以自己動手展示了 🤭。
【關注“生信補給站”公眾號,更多的精彩內容】