數據來源:Cytosolic acetyl-CoA promotes histone acetylation predominantly at H3K27 in Arabidopsis;GSE79524
我只試做了轉錄組分析那一部分。簡單概括就是為了評估乙酰化對基因表達的影響,對野生型和突變體都做了轉錄組分析(基因差異表達分析和GO注釋)。
1. 數據獲取及質控
#1.腳本查看
$ cat dir_6.txt
SRR3286802
SRR3286803
SRR3286804
SRR3286805
SRR3286806
SRR3286807
$ cat 1.sh
dir=$(cut -f1 dir_6.txt)
for sample in $dir
do
prefetch $sample
done
$ cat 2.sh
for i in `seq 2 7`
do
fastq-dump --gzip --split-3 --defline-qual '+' --defline-seq '@$ac-$si/$ri' ~/ncbi/public/sra/SRR328680${i}.sra -O /ifs1/Grp3/huangsiyuan/learn_rnaseq/mRNA-seq/data/ &
done
#2.下載及解壓
sh 1.sh
sh 2.sh
#解壓之后是這樣的,可以看出是雙端測序
$ ls
1.sh SRR3286802_1.fastq.gz SRR3286803_2.fastq.gz SRR3286805_1.fastq.gz SRR3286806_2.fastq.gz
2.sh SRR3286802_2.fastq.gz SRR3286804_1.fastq.gz SRR3286805_2.fastq.gz SRR3286807_1.fastq.gz
dir_6.txt SRR3286803_1.fastq.gz SRR3286804_2.fastq.gz SRR3286806_1.fastq.gz SRR3286807_2.fastq.gz
#3.質量檢測:fastqc,multiqc
ls *fastq.gz | while read id
do
fastqc ${id} &
done
multiqc *fastqc*
#4.過濾接頭序列及低質量reads片段
##就這組數據來說,質量檢測的結果表明數據質量很好,因此這里省略的過濾這一步。
2. 下載gff/gtf注釋文件並提取出感興趣的基因/轉錄本區間
一個基因可能對應不同的轉錄本,不同的轉錄本可能對應不同的功能。
以擬南芥的gff注釋文件為例:
#提取基因
$ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="gene") print $0 }' | cut -d ";" -f1 | head -n 5
1 araport11 gene 3631 5899 . + . ID=gene:AT1G01010
1 araport11 gene 6788 9130 . - . ID=gene:AT1G01020
1 araport11 gene 11649 13714 . - . ID=gene:AT1G01030
1 araport11 gene 23121 31227 . + . ID=gene:AT1G01040
1 araport11 gene 31170 33171 . - . ID=gene:AT1G01050
#提取轉錄本
$ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="mRNA") print $0 }' | cut -d ";" -f1 | head -n 10
1 araport11 mRNA 3631 5899 . + . ID=transcript:AT1G01010.1
1 araport11 mRNA 6788 8737 . - . ID=transcript:AT1G01020.6
1 araport11 mRNA 6788 8737 . - . ID=transcript:AT1G01020.2
1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.3
1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.5
1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.4
1 araport11 mRNA 6788 9130 . - . ID=transcript:AT1G01020.1
1 araport11 mRNA 11649 13714 . - . ID=transcript:AT1G01030.1
1 araport11 mRNA 11649 13714 . - . ID=transcript:AT1G01030.2
1 araport11 mRNA 23121 31227 . + . ID=transcript:AT1G01040.1
從ID可以看出AT1G01020基因有6個轉錄本。這篇文獻比較的是基因層面的表達差異,所以這里我提取的是基因gff,算上線粒體和葉綠體上的基因一共27655個。
$ less Arabidopsis_thaliana.TAIR10.42.gff3 | awk '{ if($3=="gene") print $0 }' > gene27655.gff
3. 將reads比對到參考基因組
3.1 為參考基因組建立索引
nohup ~/mysoft/hisat2-2.1.0/hisat2-build Arabidopsis_thaliana.TAIR10.dna.toplevel.fa Arabidopsis_thaliana &
3.2 比對
$ cat 3.sh
for i in `seq 2 7`
do
hisat2 -x ~/learn_rnaseq/mRNA-seq/ref/Arabidopsis_thaliana -p 8 \
-1 ~/learn_rnaseq/mRNA-seq/data/SRR328680${i}_1.fastq.gz \
-2 ~/learn_rnaseq/mRNA-seq/data/SRR328680${i}_2.fastq.gz \
-S ~/learn_rnaseq/mRNA-seq/map/SRR328680${i}.sam
done
$ nohup sh 3.sh &
從報告文件來看比對率都挺高的,97%以上。
3.3 sam轉bam, 並排序
$ cat 1.sh
for i in `seq 2 7`
do
/ifs1/Software/biosoft/samtools-1.9/samtools view -@ 8 -Sb SRR328680${i}.sam > SRR328680${i}.bam
/ifs1/Software/biosoft/samtools-1.9/samtools sort -@ 8 -n SRR328680${i}.bam > SRR328680${i}.n.bam
done
$ nohup sh 1.sh &
4. 計算表達量
Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ...
nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -a ~/learn_rnaseq/mRNA-seq/ref_ht/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/test ~/learn_rnaseq/mRNA-seq/map/SRR328680*.n.bam &
#上面這一步運行完之后會多出test.summary,test這兩個文件
nohup ~/mysoft/subread-1.6.3-source/bin/featureCounts -F GTF -t gene -g gene_id -T 4 -p -a ~/learn_rnaseq/mRNA-seq/ref_ht/gene27655.gff -o ~/learn_rnaseq/mRNA-seq/count/test2 ~/learn_rnaseq/mRNA-seq/map/SRR328680*.n.bam 2> log2 &
#上面這行只多了-p選項,為了看看在雙端測序中統計fragments和reads有什么區別。運行完了多出test2.summary,test2兩個文件。
從test2和text兩個矩陣文件中,可以看出加了p之后求的是fragments的計數,數值大概是reads計數的1/2,這很好理解,因為fragment的定義中就是包含了一對reads的。
$ lsx test2 |head -n 20 | tail -n 5
AT1G01140 1 64166 67774 - 3609 2309 2073 2323 1522 1742 1254
AT1G01150 1 69911 72138 - 2228 3 0 0 1 2 4
AT1G01160 1 72339 74096 + 1758 766 730 857 745 819 747
AT1G01170 1 73931 74737 - 807 1796 1758 1748 1061 1540 1340
AT1G01180 1 75390 76845 + 1456 374 315 311 358 284 312
$ lsx test |head -n 20 | tail -n 5
AT1G01140 1 64166 67774 - 3609 4587 4118 4613 3025 3469 2484
AT1G01150 1 69911 72138 - 2228 6 0 0 2 4 8
AT1G01160 1 72339 74096 + 1758 1449 1383 1637 1412 1538 1409
AT1G01170 1 73931 74737 - 807 3220 3121 3169 1914 2771 2369
AT1G01180 1 75390 76845 + 1456 745 619 619 708 558 621
將test2文件中的這7列提取出來即為表達矩陣。
grep "^AT" test2 | cut -f1,7,8,9,10,11,12 > 6sample_count_matrix.txt
5. 差異表達分析
使用DESeq2 包。
a <- as.matrix(read.table("6sample_count_matrix.txt",sep="\t",header = T,row.names=1))
condition <- factor(c(rep("WT",3),rep("acc1.5",3)), levels = c("WT","acc1.5"))
colData <- data.frame(row.names=colnames(a), condition)
#查看一下
> colData
condition
WT1 WT
WT2 WT
WT3 WT
acc1.51 acc1.5
acc1.52 acc1.5
acc1.53 acc1.5
#確保
> all(rownames(colData) == colnames(a))
[1] TRUE
dds <- DESeqDataSetFromMatrix(a, colData, design= ~ condition)
dds <- DESeq(dds)
#運行結束會報告
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
#得到初步結果並查看
res <- results(dds, contrast=c("condition", "acc1.5", "WT"))
##log2(fold change)也是一個衡量差異顯著性的指標
resLFC <- lfcShrink(dds, coef="condition_acc1.5_vs_WT", type="apeglm")
resLFC
##按照p值由小到大排列
resOrdered <- res[order(res$pvalue),]
resOrdered
##矯正p值小於0.001的有多少個差異基因
sum(res$padj < 0.001, na.rm=TRUE)
##畫個MA-plot看看大體趨勢
plotMA(res, ylim=c(-2,2))
plotMA(resLFC, ylim=c(-2,2))
#按照自定義的閾值提取差異基因並導出
diff_gene_deseq2 <-subset(res, padj < 0.001 & abs(log2FoldChange) > 1)
write.csv(diff_gene_deseq2,file= "DEG_acc1.5_vs_WT.csv")
6. 簡單的GO注釋
首選clusterProfiler包。
library(”clusterProfiler“)
library("org.At.tair.db")
e <- read.table("DEG_acc1.5_vs_WT.csv", header = T,sep = ",")
f <- e[,1]
#轉換ID並將ENTREZID編號作為后面的識別ID
ids <- bitr(f, fromType="TAIR", toType=c("SYMBOL","ENTREZID"), OrgDb="org.At.tair.db")
f <- ids[,3]
按照GO系統給基因分類
ggo <- groupGO(gene = f,
OrgDb = org.At.tair.db,
ont = "CC",
level = 3,
readable = TRUE)
> head(ggo)
ID Description Count GeneRatio
GO:0005886 GO:0005886 plasma membrane 237 237/718
GO:0005628 GO:0005628 prospore membrane 0 0/718
GO:0005789 GO:0005789 endoplasmic reticulum membrane 3 3/718
GO:0019867 GO:0019867 outer membrane 6 6/718
GO:0031090 GO:0031090 organelle membrane 63 63/718
GO:0034357 GO:0034357 photosynthetic membrane 23 23/718
富集分析
ego2 <- enrichGO(gene = ids$TAIR,
OrgDb = org.At.tair.db,
keyType = 'TAIR',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05,
readable = TRUE)
> head(ego2)
ID Description GeneRatio BgRatio pvalue p.adjust qvalue
GO:0009505 GO:0009505 plant-type cell wall 20/654 274/24998 3.989815e-05 0.002907596 0.002713546
GO:0048046 GO:0048046 apoplast 22/654 328/24998 5.995043e-05 0.002907596 0.002713546
geneID
GO:0009505 ATBXL2/ATPMEPCRA/LRX1/AtWAK1/ATPME2/GLL22/LRX2/SS3/ATPAP1/ATC4H/CASP1/BGLU15/RHS12/BGAL1/ATGRP-5/ATPCB/EARLI1/ATPGIP2/FLA13/PME5
GO:0048046 iPGAM1/ATDHAR1/AOAT1/ANN1/ATPEPC1/GDPDL1/CLE12/AGT/ENO2/GOX1/ATPCB/AtBG2/CRK9/CORI3/PMDH2/BETA/UDG4/ATSBT4.12/LTP3/AtPRX71/ATBXL4/ANNAT2
Count
GO:0009505 20
GO:0048046 22
GO基因集富集分析
geneList = 2^e[,3]
names(geneList)=as.character(ids[,3])
geneList = sort(geneList, decreasing = TRUE)
ego3 <- gseGO(geneList = geneList,
OrgDb = org.At.tair.db,
ont = "CC",
nPerm = 1000,
minGSSize = 100,
maxGSSize = 500,
pvalueCutoff = 0.05,
verbose = FALSE)
head(ego3)
enrichGO和gseGO這兩個都用得比較多。
可視化
barplot(ggo, drop=TRUE, showCategory=12)
dotplot(ego2)
自己的分析略顯簡單,沒有過多的細化,后面隨着學習的深入再來補充。注:第2步提取gene_feature這一步沒有必要,因為軟件可以自動識別feature。——2019.8.10
reference
Statistical analysis and visualization of functional profiles for genes and gene clusters:
http://www.bioconductor.org/packages/release/bioc/vignettes/clusterProfiler/inst/doc/clusterProfiler.html
Analyzing RNA-seq data with DESeq2:
http://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html
因水平有限,有錯誤的地方,歡迎批評指正!