DiffBind是基於peak的差異分析包,peaks由其他peak caller軟件生成,如MACS2、HOMER.一個peak可能表示一個染色質開放區域、蛋白質結合位點等。call出來的peak是包含它的染色體、開始和結束位置信息的,然后可以通過bam文件根絕位置信息獲取在peak上的read數量。進一步通過DESeq2或edgeR進行核心的差異分析。
官方文檔:http://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf
本文是根據官方文檔所寫的一點筆記。詳細內容請移步官網文檔。
DiffBind安裝
在R里使用以下命令安裝
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DiffBind")
要是報錯的話,就去搜一下報錯信息,可能R版本不對,或者沒權限等問題。
基本使用
初始輸入信息
DiffBind輸入的文件需要一個sample sheet
,可以從預先填寫好的csv文件中讀取,或者構建一個dataframe
。
完整的字段包括這些
不過構建csv文件時,也不需要全部都填上。SampleID,bamReads(bam文件地址),Peaks(peak文件地址)填好,其他的根據實際情況填寫或者默認就好。
在ATAC-Seq數據中沒有control
,controlID
和bamControl
刪去就好。PeakCaller
和PeakFormat
用一個就好,支持的格式有
下面以這個sample.test.csv
為例做演示:
1 SampleID,Factor,Replicate,bamReads,Peaks,PeakCaller,Tissue
2 tt.e11.5.Rep1,tt.e11.5,1,/home/user/project/atac_seq/bam/tt.e11.5.rep1.bt2.mm10.sort.filter.shift.bam,/home/user/project/atac_seq/callPeak/tt.e11.5.rep1_peaks.narrowPeak,narrow,tt
3 tt.e11.5.Rep2,tt.e11.5,2,/home/user/project/atac_seq/bam/tt.e11.5.rep2.bt2.mm10.sort.filter.shift.bam,/home/user/project/atac_seq/callPeak/tt.e11.5.rep2_peaks.narrowPeak,narrow,tt
4 tt.e15.5.Rep1,tt.e15.5,1,/home/user/project/atac_seq/bam/tt.e15.5.rep1.bt2.mm10.sort.filter.shift.bam,/home/user/project/atac_seq/callPeak/tt.e15.5.rep1_peaks.narrowPeak,narrow,tt
5 tt.e15.5.Rep2,tt.e15.5,2,/home/user/project/atac_seq/bam/tt.e15.5.rep2.bt2.mm10.sort.filter.shift.bam,/home/user/project/atac_seq/callPeak/tt.e15.5.rep2_peaks.narrowPeak,narrow,tt
Reading in the peaksets:
> dbObj <- dba(sampleSheet="./sample.test.csv")
Counting reads:
> dbObj <- dba.count(dbObj , bUseSummarizeOverlaps=TRUE)
額,作者說參數bUseSummarizeOverlaps
用更加標准的方法來counting。
bUseSummarizeOverlaps
logical indicating that summarizeOverlaps should be used for counting instead of the built-in counting code.
This option is slower but uses the more standard counting function. If TRUE, all read files must be BAM (.bam extension),
with associated index files (.bam.bai extension). The insertLength parameter must absent.
See notes for when the bRemoveDuplicates parameter is set to TRUE, where the built-in counting code
may not correctly handle certain cases and bUseSummarizeOverlaps should be set to TRUE.
Establishing a contrast:
這是建立一個分組,指定哪些樣本是一組的。
可以使用參數categories
自動構建分組,DBA_FACTOR是DiffBind中定義的常量,類似的常量都是以"DBA_"開頭,如DBA_ID
, DBA_TISSUE
,DBA_CONDITION
minMembers
是每組最小成員數目。
## 將根據factor分為兩組
> dbObj <- dba.contrast(dbObj, categories=DBA_FACTOR, minMembers = 2)
> dbObj
4 Samples, 28477 sites in matrix:
ID Tissue Factor Replicate Caller Intervals FRiP
1 tt.e11.5.Rep1 tt tt.e11.5 1 counts 28477 0.11
2 tt.e11.5.Rep2 tt tt.e11.5 2 counts 28477 0.12
3 tt.e15.5.Rep1 tt tt.e15.5 1 counts 28477 0.19
4 tt.e15.5.Rep2 tt tt.e15.5 2 counts 28477 0.12
1 Contrast:
Group1 Members1 Group2 Members2
1 tt.e11.5 2 tt.e15.5 2
Performing the differential analysis
dbObj <- dba.analyze(dbObj, method = DBA_DESEQ2)
默認使用DESeq2進行計算,可以選擇DBA_EDGER(edgR),或者兩個都要DBA_ALL_METHODS
韋恩圖繪制
> dba.plotVenn(dbObj, mask=dbObj$masks$nt)
mask是用於選擇哪些樣本用於繪圖。
## 內置的一些mask
> dbObj$masks
$nt
nt.e11.5.Rep1 nt.e11.5.Rep2 nt.e15.5.Rep1 nt.e15.5.Rep2
TRUE TRUE TRUE TRUE
$nt.e11.5
nt.e11.5.Rep1 nt.e11.5.Rep2 nt.e15.5.Rep1 nt.e15.5.Rep2
TRUE TRUE FALSE FALSE
$nt.e15.5
nt.e11.5.Rep1 nt.e11.5.Rep2 nt.e15.5.Rep1 nt.e15.5.Rep2
FALSE FALSE TRUE TRUE
......
> dba.plotVenn(dbObj, mask=dbObj$masks$Replicate.2)
保存結果
write.csv(dbObj.report, file="~/tmp/test.csv")