手動計算富集分析


富集分析非常常見,用於判斷抽樣的結果是否顯著。

 

例子1:一個工廠總共有N件產品,其中M件次品,現在從中抽取n件做檢查,抽到k件次品的概率分布服從超幾何分布。

例子2:一個細胞有N個基因,其中在pathway A里面有M個基因,現在從中抽取n個基因,抽到k個pathway A里基因的概率分布服從超幾何分布。

 

最靠譜的富集分析當屬clusterProfiler,里面的enrichGO可以做富集分析。

現在我有個性化的需求,所以要自己做,想借鑒一下里面的代碼。

 

查看enrichGO代碼,發現是里面的enricher_internal在做這件事,去GitHub查,發現enricher_internal是DOSE的函數,繼續去GitHub查,定位到enricher_internal。

代碼一覽無余:

termID2ExtID <- termID2ExtID[idx]
qTermID2ExtID <- qTermID2ExtID[idx]
qTermID <- unique(names(qTermID2ExtID))

## prepare parameter for hypergeometric test
k <- sapply(qTermID2ExtID, length)
k <- k[qTermID]
M <- sapply(termID2ExtID, length)
M <- M[qTermID]

N <- rep(length(extID), length(M))
## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop.
n <- rep(length(qExtID2TermID), length(M))
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                      numW=M,        ## White balls
                      numB=N-M,      ## Black balls
                      numDrawn=n)    ## balls drawn


## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
                 phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
                 )

## gene ratio and background ratio
GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x)
                   paste(x[1], "/", x[2], sep="", collapse="")
                   )
BgRatio <- apply(data.frame(a=M, b=N), 1, function(x)
                 paste(x[1], "/", x[2], sep="", collapse="")
                 )


Over <- data.frame(ID = as.character(qTermID),
                   GeneRatio = GeneRatio,
                   BgRatio = BgRatio,
                   pvalue = pvalues,
                   stringsAsFactors = FALSE)

p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod)
qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL)

if (class(qobj) == "qvalue") {
    qvalues <- qobj$qvalues
} else {
    qvalues <- NA
}

geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/"))
geneID <- geneID[qTermID]
Over <- data.frame(Over,
                   p.adjust = p.adj,
                   qvalue = qvalues,
                   geneID = geneID,
                   Count = k,
                   stringsAsFactors = FALSE)

Description <- TERM2NAME(qTermID, USER_DATA)

if (length(qTermID) != length(Description)) {
    idx <- qTermID %in% names(Description)
    Over <- Over[idx,]
}
Over$Description <- Description
nc <- ncol(Over)
Over <- Over[, c(1,nc, 2:(nc-1))]


Over <- Over[order(pvalues),]


Over$ID <- as.character(Over$ID)
Over$Description <- as.character(Over$Description)

row.names(Over) <- as.character(Over$ID)

x <- new("enrichResult",
         result         = Over,
         pvalueCutoff   = pvalueCutoff,
         pAdjustMethod  = pAdjustMethod,
         qvalueCutoff   = qvalueCutoff,
         gene           = as.character(gene),
         universe       = extID,
         geneSets       = geneSets,
         organism       = "UNKNOWN",
         keytype        = "UNKNOWN",
         ontology       = "UNKNOWN",
         readable       = FALSE
         )
return (x)

  

核心就是這個代碼了:

args.df <- data.frame(numWdrawn=k-1, ## White balls drawn
                  numW=M,        ## White balls
                  numB=N-M,      ## Black balls
                  numDrawn=n)    ## balls drawn


## calcute pvalues based on hypergeometric model
pvalues <- apply(args.df, 1, function(n)
             phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE)
             )

  

phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)

其中:

第一個參數:q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.

第二個參數:m
the number of white balls in the urn.

第三個參數:n
the number of black balls in the urn.

第四個參數:k
the number of balls drawn from the urn.

ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25

關鍵的只有兩個數:

GeneRatio:我輸入的基因數854(n),其中在pathway A里的有68個(k)

BgRatio:總共背景(有注釋)基因數23210(N),其中pathway A里的基因數364個(M)

 

phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)

 

帶入數字:

phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29

結果一致。

 

進階:

  • lower.tail是什么
  • 為什么第一個參數要減1

 

參考:

超幾何分布檢驗(hypergeometric test)

https://github.com/YuLab-SMU/DOSE/blob/5be8e21c56242d58b5576c20412b3457bc61dae7/R/enricher_internal.R

不靠譜的富集軟件太多了!


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM