根據motif binding來確定target gene | HOMER | FIMO

根據motif binding來確定target gene | HOMER | FIMO | MEME

本文轉載自查看原文 2019-12-08 18:27 399 轉錄調控

主流的motif數據庫

JASPAR

dbcorrdb - SCENIC使用的

TRANSFAC® 7.0 Public 2005 and TRANSCompel 7.0 Public 2005 - 現在最新版要收費了

motif格式問題

我關注的這個motif (ENCSR000ARI) 是Ezh2的，並沒有被收錄在常規的TF數據庫里，所以Homer里面沒有。

dbcorrdb這個數據庫里有，但是很老的JASPAR格式，可以轉成meme格式。[dbcorrdb__EZH2__ENCSR000ARI_1__m5.png]

jaspar2meme -pfm -bundle ENCSR000ARI.all.sites > ENCSR000ARI.all.meme

可以用meme的Ceqlogo畫出logo

ceqlogo -i1 ENCSR000ARI.m5.meme -o logo.png -f PNG

然后meme的FIMO可以直接把motif比對到指定的fasta

fimo --alpha 1 --max-strand -oc target ENCSR000ARI.all.meme target.promt.TSS.fasta
fimo --alpha 1 --max-strand -oc random ENCSR000ARI.all.meme target.promt.TSS.random.fasta

這里就有個問題了，怎么從genome中提取出核心區域
因為promoter的定義不是很完善，這里我用的區域是TSS區 [-400, 100]

其實Homer里面都寫好了，但因為里面的perl過於復雜，最后還是決定用R來做（核心就是防止位置溢出）

library(GenomicFeatures)

# library(TxDb.Hsapiens.UCSC.hg19.knownGene)
# library(BSgenome.Hsapiens.UCSC.hg19)

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(BSgenome.Mmusculus.UCSC.mm10)

# PR <- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, upstream=2000, downstream=0)

# entrez geneID for a cell cycle control transcription
# factor, chr6 on the plus strand
# tmp.gene <- "429"  

# transcriptCoordsByGene.GRangesList <- transcriptsBy (TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene") [tmp.gene]
# a GrangesList of length one, describing three transcripts

# promoter.seqs <- getPromoterSeq (head(PR), Hsapiens, upstream=10, downstream=0)

trs <- transcriptsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")

# trs <- keepStandardChromosomes(trs)

prom <- getPromoterSeq(trs, Mmusculus, upstream = 500, downstream=100)

head(prom)

gene ID轉換問題

這里的list的name就是NCBI的genename，即Entrez ID（長見識了，同一個基因在不同物種中的entrez ID是不同的，所以你找的genecard上肯定是跟mouse的對不上）

以下是常見的geneID轉換的腳本，ensemble ID轉Entrez ID

mart <- useDataset("mmusculus_gene_ensembl", useMart("ensembl"))
att <- listAttributes(mart)

transcripts <- getBM(attributes=c("chromosome_name", "transcript_start", "transcript_end", 
                                  "entrezgene_id", "ensembl_gene_id","ensembl_transcript_id"),
                      filters = "biotype",
                      values  = c("protein_coding"),
                      mart    = mart)

tmp.names <- select(TxDb.Mmusculus.UCSC.mm10.knownGene, keys = names(prom), 
                    columns=c("TXNAME"), keytype="GENEID")

tmp.names$transcriptName <- sapply(strsplit(tmp.names$TXNAME,"\\."), function(x) x[1])

tmp.names$entrezID <- transcripts[tmp.names$transcriptName,]$entrezgene_id

最終畫出了這個圖：

成圖代碼

options(repr.plot.width=6, repr.plot.height=6)
g <- ggplot(motif.bind, aes(x=pos, y=-log10(p.value))) + 
    facet_wrap( ~ gene, ncol=1) +
    geom_point() +
    geom_vline(xintercept=0, linetype="dashed", color = "blue") +
    geom_linerange(aes(x=pos, ymax=-log10(p.value), ymin=0),
                 position = position_jitter(height = 0L, seed = 1L), size=0.1) +
    theme_bw() +
    theme(strip.background = element_blank(), strip.text = element_text(face="italic", size = 15, color = "black")) +
    theme(axis.text.x  = element_text(face="plain", angle=0, size = 8, color = "black", vjust=-1),
            axis.text.y  = element_text(face="plain", size = 10, color = "black"),
            axis.title =element_text(size = 15)) +
    theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
          panel.background = element_blank()) +
    theme(legend.title = element_blank()) +
    labs(x = "\nRelative distance to TSS (bp)",y = "- log10(motif binding P-value)\n", title = " ") +
    scale_y_continuous(limits = c(0, 7), expand = c(0, 0)) #+
    #scale_fill_manual(values = brewer.pal(8,"Set1")) 
g

其實要是Homer有對應的TF motif數據，可以做得更加高效。不用提fasta，不用id轉換，一行代碼搞定。

findMotifs.pl  genelist.txt mouse motifResults/ -start -400 -end 100 -len 8,10,12 -find test.motif > result.txt

Homer和meme的對比

Homer用perl寫得，一個腳本包含了很多功能，你想提取其中一個子功能很難 vs meme是C++的，執行效率更高，功能都拆分了【一鍵化用Homer；精細化快速化用meme】
Homer似乎不受待見，JASPAR里下載格式就沒有Homer格式的，反而有meme的

展望：

這里只考慮了TSS flanking region

有ChIP-seq和ATAC-seq peak的數據可以confirm這個denovo的結果

再加入Hi-C等三維基因組的數據就可以整合promoter和enhancer了。　　

下一專題：motif格式專題

了解一下數據分析的生物學基礎：

『珍藏版』Nature綜述 | 基因表達調控的新模型

Organization and regulation of gene transcription

全基因組（~20000個基因）批量掃描1541個motif

# download motif
wget http://www.jstacs.de/downloads/motifs_jaspar.txt

# change format in sublime, must have ',' at the end
# motif_(\d+)
# motif_$1 ,
# example
>CTCF_ENCSR000AKB-1_motif_1 ,
10474 6567 9974 12621 4732 79 32529 2514 7204 39703 37 18365 2206 427 1630 7094 19180 4523 6659 16417
11903 15167 9425 5070 36198 46507 1697 26396 19496 1126 22 46 1279 18 1817 33939 2120 23129 16325 11761
12421 7303 16734 21187 2705 14 5847 16524 3318 3237 46565 28151 26304 46054 39302 1153 23209 14621 4807 13486
11854 17615 10519 7774 3017 52 6579 1218 16634 2586 28 90 16863 153 3903 4466 2143 4379 18861 4988

# get wanted
# now run all motif: 1541

# to meme format
jaspar2meme -pfm -bundle motifs_jaspar.sublime.txt > jaspar.all.meme
# check
cat jaspar.all.meme  | grep 'MOTIF' | wc -l

# align to all promoter
fimo --alpha 1 --max-strand -oc all.TF.targets jaspar.all.meme all.hs.hg19.TSS.up400.down100..fasta

# submit job
echo "fimo --alpha 1 --max-strand -oc all.TF.targets.mm10 jaspar.all.meme all.Mm.mm10.TSS.up400.down100.fasta" | qsub -V -N Mm10_motif -q small_ext -l nodes=1:ppn=2,walltime=60:00:00,mem=10gb

數據庫問題：

上面的DBcorrDB數據庫里只有179個TF

想要更多可以查JASPAR數據庫

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。