富集分析非常常見,用於判斷抽樣的結果是否顯著。
例子1:一個工廠總共有N件產品,其中M件次品,現在從中抽取n件做檢查,抽到k件次品的概率分布服從超幾何分布。
例子2:一個細胞有N個基因,其中在pathway A里面有M個基因,現在從中抽取n個基因,抽到k個pathway A里基因的概率分布服從超幾何分布。
最靠譜的富集分析當屬clusterProfiler,里面的enrichGO可以做富集分析。
現在我有個性化的需求,所以要自己做,想借鑒一下里面的代碼。
查看enrichGO代碼,發現是里面的enricher_internal在做這件事,去GitHub查,發現enricher_internal是DOSE的函數,繼續去GitHub查,定位到enricher_internal。
代碼一覽無余:
termID2ExtID <- termID2ExtID[idx] qTermID2ExtID <- qTermID2ExtID[idx] qTermID <- unique(names(qTermID2ExtID)) ## prepare parameter for hypergeometric test k <- sapply(qTermID2ExtID, length) k <- k[qTermID] M <- sapply(termID2ExtID, length) M <- M[qTermID] N <- rep(length(extID), length(M)) ## n <- rep(length(gene), length(M)) ## those genes that have no annotation should drop. n <- rep(length(qExtID2TermID), length(M)) args.df <- data.frame(numWdrawn=k-1, ## White balls drawn numW=M, ## White balls numB=N-M, ## Black balls numDrawn=n) ## balls drawn ## calcute pvalues based on hypergeometric model pvalues <- apply(args.df, 1, function(n) phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE) ) ## gene ratio and background ratio GeneRatio <- apply(data.frame(a=k, b=n), 1, function(x) paste(x[1], "/", x[2], sep="", collapse="") ) BgRatio <- apply(data.frame(a=M, b=N), 1, function(x) paste(x[1], "/", x[2], sep="", collapse="") ) Over <- data.frame(ID = as.character(qTermID), GeneRatio = GeneRatio, BgRatio = BgRatio, pvalue = pvalues, stringsAsFactors = FALSE) p.adj <- p.adjust(Over$pvalue, method=pAdjustMethod) qobj <- tryCatch(qvalue(p=Over$pvalue, lambda=0.05, pi0.method="bootstrap"), error=function(e) NULL) if (class(qobj) == "qvalue") { qvalues <- qobj$qvalues } else { qvalues <- NA } geneID <- sapply(qTermID2ExtID, function(i) paste(i, collapse="/")) geneID <- geneID[qTermID] Over <- data.frame(Over, p.adjust = p.adj, qvalue = qvalues, geneID = geneID, Count = k, stringsAsFactors = FALSE) Description <- TERM2NAME(qTermID, USER_DATA) if (length(qTermID) != length(Description)) { idx <- qTermID %in% names(Description) Over <- Over[idx,] } Over$Description <- Description nc <- ncol(Over) Over <- Over[, c(1,nc, 2:(nc-1))] Over <- Over[order(pvalues),] Over$ID <- as.character(Over$ID) Over$Description <- as.character(Over$Description) row.names(Over) <- as.character(Over$ID) x <- new("enrichResult", result = Over, pvalueCutoff = pvalueCutoff, pAdjustMethod = pAdjustMethod, qvalueCutoff = qvalueCutoff, gene = as.character(gene), universe = extID, geneSets = geneSets, organism = "UNKNOWN", keytype = "UNKNOWN", ontology = "UNKNOWN", readable = FALSE ) return (x)
核心就是這個代碼了:
args.df <- data.frame(numWdrawn=k-1, ## White balls drawn numW=M, ## White balls numB=N-M, ## Black balls numDrawn=n) ## balls drawn ## calcute pvalues based on hypergeometric model pvalues <- apply(args.df, 1, function(n) phyper(n[1], n[2], n[3], n[4], lower.tail=FALSE) )
phyper(q, m, n, k, lower.tail = TRUE, log.p = FALSE)
其中:
第一個參數:q
vector of quantiles representing the number of white balls drawn without replacement from an urn which contains both black and white balls.
第二個參數:m
the number of white balls in the urn.
第三個參數:n
the number of black balls in the urn.
第四個參數:k
the number of balls drawn from the urn.
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0008380 RNA splicing 68/854 364/23210 6.07E-29 2.70E-25 2.28E-25
關鍵的只有兩個數:
GeneRatio:我輸入的基因數854(n),其中在pathway A里的有68個(k)
BgRatio:總共背景(有注釋)基因數23210(N),其中pathway A里的基因數364個(M)
phyper(k-1, M, N-M, n, lower.tail = TRUE, log.p = FALSE)
帶入數字:
phyper(67, 364, 23210-364, 854, lower.tail = TRUE, log.p = FALSE) = 6.07163831922482e-29
結果一致。
進階:
- lower.tail是什么
- 為什么第一個參數要減1
參考: