手動計算富集分析

本文轉載自查看原文 2020-03-17 14:14 802 工具

富集分析非常常見，用於判斷抽樣的結果是否顯著。

例子1：一個工廠總共有N件產品，其中M件次品，現在從中抽取n件做檢查，抽到k件次品的概率分布服從超幾何分布。

例子2：一個細胞有N個基因，其中在pathway A里面有M個基因，現在從中抽取n個基因，抽到k個pathway A里基因的概率分布服從超幾何分布。

最靠譜的富集分析當屬clusterProfiler，里面的enrichGO可以做富集分析。

現在我有個性化的需求，所以要自己做，想借鑒一下里面的代碼。

查看enrichGO代碼，發現是里面的enricher_internal在做這件事，去GitHub查，發現enricher_internal是DOSE的函數，繼續去GitHub查，定位到enricher_internal。

代碼一覽無余：

termID2ExtID <- termID2ExtID[idx]
qTermID2ExtID <- qTermID2ExtID[idx]
qTermID <- unique(names(qTermID2ExtID))

## prepare parameter for hypergeometric test
k <- sapply(qTermID2ExtID, length)
k <- k[qTermID]
M <- sapply(termID2ExtID, length)
M <- M[qTermID]

N <- rep(length(extID), length(M))
## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop.
n <- rep(length(qExtID2TermID), length(M))
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                      numW=M,        ## White balls
                      numB=N-M,      ## Black balls
                      numDrawn=n)    ## balls drawn


## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
                 phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
                 )

## gene ratio and background ratio
GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x)
                   paste(x[1], "/", x[2], sep="", collapse="")
                   )
BgRatio <- apply(data.frame(a=M, b=N), 1, function(x)
                 paste(x[1], "/", x[2], sep="", collapse="")
                 )


Over <- data.frame(ID = as.character(qTermID),
                   GeneRatio = GeneRatio,
                   BgRatio = BgRatio,
                   pvalue = pvalues,
                   stringsAsFactors = FALSE)

p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod)
qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL)

if (class(qobj) == "qvalue") {
    qvalues <- qobj$qvalues
} else {
    qvalues <- NA
}

geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/"))
geneID <- geneID[qTermID]
Over <- data.frame(Over,
                   p.adjust = p.adj,
                   qvalue = qvalues,
                   geneID = geneID,
                   Count = k,
                   stringsAsFactors = FALSE)

Description <- TERM2NAME(qTermID, USER_DATA)

if (length(qTermID) != length(Description)) {
    idx <- qTermID %in% names(Description)
    Over <- Over[idx,]
}
Over$Description <- Description
nc <- ncol(Over)
Over <- Over[, c(1,nc, 2:(nc-1))]


Over <- Over[order(pvalues),]


Over$ID <- as.character(Over$ID)
Over$Description <- as.character(Over$Description)

row.names(Over) <- as.character(Over$ID)

x <- new("enrichResult",
         result         = Over,
         pvalueCutoff   = pvalueCutoff,
         pAdjustMethod  = pAdjustMethod,
         qvalueCutoff   = qvalueCutoff,
         gene           = as.character(gene),
         universe       = extID,
         geneSets       = geneSets,
         organism       = "UNKNOWN",
         keytype        = "UNKNOWN",
         ontology       = "UNKNOWN",
         readable       = FALSE
         )
return (x)

核心就是這個代碼了：

args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                  numW=M,        ## White balls
                  numB=N-M,      ## Black balls
                  numDrawn=n)    ## balls drawn


## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
             phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
             )

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

其中：

第一個參數：q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

第二個參數：m
the number of white balls in the urn.

第三個參數：n
the number of black balls in the urn.

第四個參數：k
the number of balls drawn from the urn.

ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25

關鍵的只有兩個數：

GeneRatio：我輸入的基因數854（n），其中在pathway A里的有68個（k）

BgRatio：總共背景（有注釋）基因數23210（N），其中pathway A里的基因數364個（M）

phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)

帶入數字：