基因或蛋白富集分析，gsea與ssgsea

本文轉載自查看原文 2020-07-05 12:40 4712 蛋白組學

大家應該對通路富集分析都很熟悉，如DAVID、超幾何富集分析等。都是在大量文章中常見的通路富集方法，給大家介紹一個更加復雜的通路富集分析的前期數據處理包GSVA（gene set variation analysis）與gsea選擇。GSVA是一種非參數的無監督分析方法，主要用來評估芯片核轉錄組的基因集富集結果。主要是通過將基因在不同樣品間的表達量矩陣轉化成基因集在樣品間的表達量矩陣，從而來評估不同的通路在不同樣品間是否富集。具體的一個分析流程如下：

1. 超幾何檢驗GO、KEGG基因富集分析

這是相對簡單粗暴一些的基因富集分析方法，不需要輸入基因的表達值，只需要通過自己設定的閾值(|log2FC| >1 和Pvalue < 0.05)篩選得到的差異基因名。進行統計檢驗后返回顯著富集的功能基因集。如DESeq2 (獲得差異基因信息)，clusterProfiler（ID轉換+富集分析+作圖一站式神包！）

2. GSEA富集分析

相比於第一種簡單粗暴的用硬閾值截取+往籃子里塞雞蛋的方法，GSEA同時考慮了基因在整個表達譜中所處的FoldChange rank以及同一基因集中的基因在表達譜rank中的距離。通俗來講，GSEA基於如下假設：一個基因集中的基因如果在表達譜中所處的rank越極端（高/低FoldChange），而且基因之間的距離越短（rank相近），則這個基因集越可能顯著。所以GSEA需要輸入的是【所有納入差異分析的完整基因list和它們的FC值，並已經預先排序（pre-ranked）】。

具體的原理參考，https://cloud.tencent.com/developer/article/1426130

3. GSVA/ssGSEA分析

ssGSEA顧名思義是一種特殊的GSEA，它主要針對單樣本無法做GSEA而提出的一種實現方法，原理上與GSEA是類似的，不同的是GSEA需要准備表達譜文件即gct，根據表達譜文件計算每個基因的rank值，再進行后續的統計分析。ssGSEA是為無重復的樣本進行，geneset enrichment analysis准備的，所以不同於上方以組別為單位（cancer vs normal）的GSEA分析，通過ssGSEA，每個樣本都可以得到相應基因集的評分。

a. 首先假設有一個樣本的表達數據，那么他應該是這樣的，第一列為基因，第二列為表達值，這樣的兩列的數據矩陣。

首先對樣本的所有基因的表達水平進行排序獲得其在所有基因中的秩次rank，這些基因的集合為BG。

b. 假設要對其進行KEGG的分析，首先我們需要在GSEA官網找到KEGG對應的gmt文件。

gmt文件主要格式是：每行表示一個通路，第一列為通路ID，第二列為通路對應的描述，第三列開始到最后一列為該通路中的基因。

c. 那么對於任意的一個通路A，我們可以拿到這個通路的基因列表GL，從GL中尋找BG里存在的基因並計數為NC，並將這些基因的表達水平加和為SG。

開始計算ES：對於任意一個表達譜中的基因 G，如果G是集合GL中的基因則他的ES等於該基因的表達水平除以SG，否則記該基因的ES等於1除以（基因集合BG總個數減去NC）

依次計算每個BG中的基因的ES值，找到其中絕對值最大的ES作為通路A的A.ES

d. 到此通路A的ES計算完畢，需要一個統計學方法來評估該ES是否是顯著的，即非隨機的。

按照上述計算ES的方法，先隨機打亂表達譜中基因的表達順序，然后再依次計算ES值，如此重復一千次，得到一千個ES值，我們根據這一千個ES值的分布，來計算A.ES在這個分布中所處的位置及出現在該位置時的概率即得到了p值。依次分別計算每個通路的ES及p值，然后使用多重檢驗矯正得到每個通路的FDR

此外，對於多個樣本或多組樣本，也可以利用T檢驗或方差檢驗對結果進行p值的計算與相應的FDR值。

以上即是整個ssGSEA算法的整體思路。

R語言，GSVA包（進行GSVA/ssGSEA分析），limma包（差異pathway的篩選），pheatmap包（熱圖繪制）。

library(GSEABase)

library(GSVA)

#讀取基因集文件

geneSets <- getGmt("test.geneset")

#讀取表達量文件並去除重復

mydata <- read.table(file = "all.genes.fpkm.xls",header=T)

name=mydata[,1]

index <- duplicated(mydata[,1])

fildup=mydata[!index,]

exp=fildup[,-1]

row.names(exp)=name

#將數據框轉換成矩陣

mydata= as.matrix(exp)

#使用gsva方法進行分析，默認mx.diff=TRUE，min.sz=1,max.zs=Inf，這里是設置最小值和最大值

res_es <- gsva(mydata, geneSets, min.sz=10, max.sz=500, verbose=FALSE, parallel.sz=1)

pheatmap(res_es)

Usage
## S4 method for signature 'ExpressionSet,list'
gsva(expr, gset.idx.list, annotation,method=c("gsva", "ssgsea", "zscore", "plage"),
kcdf=c("Gaussian", "Poisson", "none"),abs.ranking=FALSE,min.sz=1,max.sz=Inf,
parallel.sz=0,parallel.type="SOCK",mx.diff=TRUE,tau=switch(method, gsva=1, ssgsea=0.25, NA),ssgsea.norm=TRUE,verbose=TRUE)
Arguments
expr
Gene expression data which can be given either as an ExpressionSet object or as a matrix of expression values where rows correspond to genes and columns correspond to samples.

gset.idx.list
Gene sets provided either as a list object or as a GeneSetCollection object.

annotation
In the case of calling gsva() with expression data in a matrix and gene sets as a GeneSetCollection object, the annotation argument can be used to supply the name of the Bioconductor package that contains annotations for the class of gene identifiers occurring in the row names of the expression data matrix. By default gsva() will try to match the identifiers in expr to the identifiers in gset.idx.list just as they are, unless the annotation argument is set.

method
Method to employ in the estimation of gene-set enrichment scores per sample. By default this is set to gsva (Hänzelmann et al, 2013) and other options are ssgsea (Barbie et al, 2009), zscore (Lee et al, 2008) or plage (Tomfohr et al, 2005). The latter two standardize first expression profiles into z-scores over the samples and, in the case of zscore, it combines them together as their sum divided by the square-root of the size of the gene set, while in the case of plage they are used to calculate the singular value decomposition (SVD) over the genes in the gene set and use the coefficients of the first right-singular vector as pathway activity profile.

kcdf
Character string denoting the kernel to use during the non-parametric estimation of the cumulative distribution function of expression levels across samples when method="gsva". By default, kcdf="Gaussian" which is suitable when input expression values are continuous, such as microarray fluorescent units in logarithmic scale, RNA-seq log-CPMs, log-RPKMs or log-TPMs. When input expression values are integer counts, such as those derived from RNA-seq experiments, then this argument should be set to kcdf="Poisson".

abs.ranking
Flag used only when mx.diff=TRUE. When abs.ranking=FALSE (default) a modified Kuiper statistic is used to calculate enrichment scores, taking the magnitude difference between the largest positive and negative random walk deviations. When abs.ranking=TRUE the original Kuiper statistic that sums the largest positive and negative random walk deviations, is used. In this latter case, gene sets with genes enriched on either extreme (high or low) will be regarded as 'highly' activated.

min.sz
Minimum size of the resulting gene sets.

max.sz
Maximum size of the resulting gene sets.

parallel.sz
Number of processors to use when doing the calculations in parallel. This requires to previously load either the parallel or the snow library. If parallel is loaded and this argument is left with its default value (parallel.sz=0) then it will use all available core processors unless we set this argument with a smaller number. If snow is loaded then we must set this argument to a positive integer number that specifies the number of processors to employ in the parallel calculation.

parallel.type
Type of cluster architecture when using snow.

mx.diff
Offers two approaches to calculate the enrichment statistic (ES) from the KS random walk statistic. mx.diff=FALSE: ES is calculated as the maximum distance of the random walk from 0. mx.diff=TRUE (default): ES is calculated as the magnitude difference between the largest positive and negative random walk deviations.

tau
Exponent defining the weight of the tail in the random walk performed by both the gsva (Hänzelmann et al., 2013) and the ssgsea (Barbie et al., 2009) methods. By default, this tau=1 when method="gsva" and tau=0.25 when method="ssgsea" just as specified by Barbie et al. (2009) where this parameter is called alpha.

ssgsea.norm
Logical, set to TRUE (default) with method="ssgsea" runs the SSGSEA method from Barbie et al. (2009) normalizing the scores by the absolute difference between the minimum and the maximum, as described in their paper. When ssgsea.norm=FALSE this last normalization step is skipped.

verbose
Gives information about each calculation step. Default: FALSE.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 GSEABase | clusterProfiler | GSEA富集分析 | MSigDB 批處理！功能富集分析、基因ID轉換、查找同源基因、SNP注釋一站式服務【蛋白質基因組】Proteogenomics方法介紹及分析思路功能富集分析概述 GO富集分析信號通路 excel之富集分析圖制作用網絡圖展示富集分析富集分析DAVID、Metascape、Enrichr、ClueGO RNA-seq(9):功能富集分析 GO富集分析柱狀圖