一個全基因組重測序分析實戰


 

這里選取的是 GATK best practice 是目前認可度最高的全基因組重測序分析流程,尤其適用於 人類研究。

PS:其實本文應該屬於直播我的基因組系列,有兩個原因把它單獨拿出來,

  1. 首先,直播我的基因組閱讀量太低了,可能是大家覺得錯過了前面的,后面的看起來沒有必要,這里我可以肯定的告訴大家,這一講是獨立的,而且是全流程,你學好了這個,整個直播我的基因組就可以不用看了。

  2. 其次,最近有一些朋友寫了一些GATK的教程,但是大多不合我意,作為回應,我也寫一個,秀出我的教程風格。

流程介紹

bwa(MEM alignment)

picard(SortSam)

picard(MarkDuplicates)

picard(FixMateInfo)

GATK(RealignerTargetCreator)

GATK(IndelRealigner)

GATK(BaseRecalibrator)

GATK(PrintReads)

GATK(HaplotypeCaller)

GATK(GenotypeGVCFs)

在本文,我將會把我的 全基因組重測序數據走完上面所有的流程,並給出代碼和時間消耗情況。

准備工作

首先是軟件安裝

  1. ## Download and install BWA

  2. cd ~/biosoft

  3. mkdir bwa &&  cd bwa

  4. #http://sourceforge.net/projects/bio-bwa/files/

  5. wget https://sourceforge.net/projects/bio-bwa/files/bwa-0.7.15.tar.bz2

  6. tar xvfj bwa-0.7.15.tar.bz2 # x extracts, v is verbose (details of what it is doing), f skips prompting for each individual file, and j tells it to unzip .bz2 files

  7. cd bwa-0.7.15

  8. make

  9.  

  10. ## Download and install samtools

  11. ## http://samtools.sourceforge.net/

  12. ## http://www.htslib.org/doc/samtools.html

  13. cd ~/biosoft

  14. mkdir samtools &&  cd samtools

  15. wget https://github.com/samtools/samtools/releases/download/1.3.1/samtools-1.3.1.tar.bz2

  16. tar xvfj samtools-1.3.1.tar.bz2

  17. cd samtools-1.3.1

  18. ./configure --prefix=/home/jianmingzeng/biosoft/myBin

  19. make

  20. make install

  21. ~/biosoft/myBin/bin/samtools --help

  22. ~/biosoft/myBin/bin/plot-bamstats --help

  23. cd htslib-1.3.1

  24. ./configure --prefix=/home/jianmingzeng/biosoft/myBin

  25. make

  26. make install

  27. ~/biosoft/myBin/bin/tabix

  28.  

  29.  

  30. ## Download and install picardtools

  31. ## https://sourceforge.net/projects/picard/

  32. ## https://github.com/broadinstitute/picard

  33. cd ~/biosoft

  34. mkdir picardtools &&  cd picardtools

  35. wget http://ncu.dl.sourceforge.net/project/picard/picard-tools/1.119/picard-tools-1.119.zip

  36. unzip picard-tools-1.119.zip

  37. mkdir 2.9.2 && cd 2.9.2

  38. wget https://github.com/broadinstitute/picard/releases/download/2.9.2/picard.jar

  39.  

  40. ## GATK 需要自行申請下載,不能公開

其次是必備數據的下載:

  1. cd ~/reference

  2. mkdir -p  genome/human_g1k_v37  && cd genome/human_g1k_v37

  3. # http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/

  4. nohup wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz  &

  5. gunzip human_g1k_v37.fasta.gz

  6. wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai

  7. wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/README.human_g1k_v37.fasta.txt

  8. java -jar ~/biosoft/picardtools/picard-tools-1.119/CreateSequenceDictionary.jar R=human_g1k_v37.fasta O=human_g1k_v37.dict

  9.  

  10. cd ~/reference

  11. mkdir -p index/bwa && cd index/bwa   ~/reference/index/bwa/human_g1k_v37  ~/reference/genome/human_g1k_v37/human_g1k_v37.fasta 1>human_g1k_v37.bwa_index.log 2>&1   &

  12.  

  13. mkdir -p ~/biosoft/GATK/resources/bundle/b37

  14. cd ~/biosoft/GATK/resources/bundle/b37

  15. wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.gz

  16. wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/1000G_phase1.indels.b37.vcf.idx.gz

  17. wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

  18. wget ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz

  19. gunzip 1000G_phase1.indels.b37.vcf.idx.gz

  20. gunzip 1000G_phase1.indels.b37.vcf.gz

  21. gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.gz

  22. gunzip Mills_and_1000G_gold_standard.indels.b37.vcf.idx.gz

  23.  

  24. mkdir -p ~/annotation/variation/human/dbSNP

  25. cd ~/annotation/variation/human/dbSNP

  26. ## https://www.ncbi.nlm.nih.gov/projects/SNP/

  27. ## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh38p2/

  28. ## ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/

  29. nohup wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz &

  30. wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b147_GRCh37p13/VCF/All_20160601.vcf.gz.tbi

只有當軟件安裝完畢,還有參考基因組等必備文件准備齊全了,才能正式進入全基因組重測序分析流程!

全基因組重測序數據介紹

上面是我的全基因組數據fastq文件的截圖,測序分成了5條lane,每條lane的數據量不一致。

數據分析

fastq2bam

首先給出代碼:

  1. module load java/1.8.0_91

  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

  7. SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

  8. INDEL=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf.gz

  9. TMPDIR=/home/jianmingzeng/tmp/software

  10. ## samtools and bwa are in the environment

  11. ## samtools Version: 1.3.1 (using htslib 1.3.1)

  12. ## bwa Version: 0.7.15-r1140

  13.  

  14. : '

  15. '

  16. ## please keep the confige in three columns format, which are fq1 fq2 sampe

  17. cat $1 |while read id

  18. do

  19.    arr=($id)

  20.    fq1=${arr[0]}

  21.    fq2=${arr[1]}

  22.    sample=${arr[2]}

  23.    #####################################################

  24.    ################ Step 1 : Alignment #################

  25.    #####################################################

  26.    echo bwa `date`

  27.    bwa mem -t 5 -R "@RG\tID:$sample\tSM:$sample\tLB:WGS\tPL:Illumina" $INDEX $fq1 $fq2 > $sample.sam

  28.    echo bwa `date`

  29.    #####################################################

  30.    ################ Step 2: Sort and Index #############

  31.    #####################################################

  32.    echo SortSam `date`

  33.    java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD SortSam SORT_ORDER=coordinate INPUT=$sample.sam OUTPUT=$sample.bam

  34.    samtools index $sample.bam

  35.    echo SortSam `date`

  36.  

  37.    #####################################################

  38.    ################ Step 3: Basic Statistics ###########

  39.    #####################################################

  40.    echo stats `date`

  41.    samtools flagstat $sample.bam > ${sample}.alignment.flagstat

  42.    samtools stats  $sample.bam > ${sample}.alignment.stat

  43.    echo plot-bamstats -p ${sample}_QC  ${sample}.alignment.stat

  44.    echo stats `date`

  45.  

  46.    #####################################################

  47.    ####### Step 4: multiple filtering for bam files ####

  48.    #####################################################

  49.  

  50.    ###MarkDuplicates###

  51.    echo MarkDuplicates `date`

  52.    java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD MarkDuplicates \

  53.    INPUT=$sample.bam OUTPUT=${sample}_marked.bam METRICS_FILE=$sample.metrics  

  54.    echo MarkDuplicates `date`

  55.  

  56.  

  57.    ###FixMateInfo###

  58.    echo FixMateInfo `date`

  59.    java -Djava.io.tmpdir=$TMPDIR    -Xmx40g -jar $PICARD FixMateInformation \

  60.    INPUT=${sample}_marked.bam OUTPUT=${sample}_marked_fixed.bam SO=coordinate  

  61.    samtools index ${sample}_marked_fixed.bam

  62.    echo FixMateInfo `date`

  63.  

  64.    echo ${sample}_marked_fixed.bam >>files.bamlist

  65.  

  66.    rm $sample.sam $sample.bam ${sample}_marked.bam

  67. done

  68. samtools merge -@ 5  -b files.bamlist  merged.bam

  69. samtools index merged.bam

上面的代碼有一點長,希望大家能用心的來理解,其實就是一個批量處理,對5條lane的測序數據循環處理,其實正式流程里面我一般是並行的,而不是循環,這里是為了給大家秀一下時間消耗情況,讓大家對全基因組重測序分析有一個感性的認知。

時間消耗如下:

對L1來說,時間消耗如下:

  1. [main] Real time: 15870.794 sec; CPU: 77463.156 sec

  2. picard.sam.SortSam done. Elapsed time: 45.60 minutes.

  3. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 64.20 minutes.

  4. picard.sam.FixMateInformation done. Elapsed time: 58.05 minutes.

總共耗時約7.2小時,僅僅是對10G的fastq完成比對壓縮排序去PCR重復。

如果是其它文件大小的fastq輸入數據,那么這個流程耗時如下:

  1. [main] Real time: 9527.240 sec; CPU: 47758.233 sec

  2. [main] Real time: 16000.325 sec; CPU: 80595.629 sec

  3. [main] Real time: 29286.523 sec; CPU: 147524.841 sec

  4. [main] Real time: 28104.568 sec; CPU: 141519.377 sec

  5.  

  6. picard.sam.SortSam done. Elapsed time: 29.02 minutes.

  7. picard.sam.SortSam done. Elapsed time: 61.26 minutes.

  8. picard.sam.SortSam done. Elapsed time: 98.39 minutes.

  9. picard.sam.SortSam done. Elapsed time: 117.16 minutes.

  10.  

  11. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 35.52 minutes.

  12. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 54.41 minutes.

  13. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 90.40 minutes.

  14. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 93.03 minutes.

  15.  

  16. picard.sam.FixMateInformation done. Elapsed time: 35.92 minutes.

  17. picard.sam.FixMateInformation done. Elapsed time: 66.31 minutes.

  18. picard.sam.FixMateInformation done. Elapsed time: 131.65 minutes.

  19. picard.sam.FixMateInformation done. Elapsed time: 122.31 minutes.

前面我們說過,這5條lane的數據其實是可以並行完成這幾個步驟的,最長耗時約12小時。 每個數據處理我都分配了 5個線程, 40G的內存

GATK重新處理bam文件

主要是針對上一個步驟合並了5個lane之后的 merge.bam文件

  1. -rw-rw-r-- 1 jianmingzeng jianmingzeng  57G Jun  7 11:32 merged.bam

  2. -rw-rw-r-- 1 jianmingzeng jianmingzeng 8.4M Jun  7 12:05 merged.bam.bai

代碼是:

  1. module load java/1.8.0_91

  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

  7.  

  8. KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

  9. Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

  10. KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

  11.  

  12. TMPDIR=/home/jianmingzeng/tmp/software

  13. ## samtools and bwa are in the environment

  14. ## samtools Version: 1.3.1 (using htslib 1.3.1)

  15. ## bwa Version: 0.7.15-r1140

  16.  

  17. sample='merge'

  18.  

  19. ###RealignerTargetCreator###

  20. echo RealignerTargetCreator `date`

  21. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T RealignerTargetCreator \

  22. -I ${sample}.bam -R $GENOME -o ${sample}_target.intervals \

  23. -known $Mills_indels -known $KG_indels -nt 5

  24. echo RealignerTargetCreator `date`

  25.  

  26.  

  27. ###IndelRealigner###

  28. echo IndelRealigner `date`

  29. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T IndelRealigner \

  30. -I ${sample}.bam -R $GENOME -targetIntervals ${sample}_target.intervals \

  31. -o ${sample}_realigned.bam -known $Mills_indels -known $KG_indels

  32. echo IndelRealigner `date`

  33.  

  34.  

  35. ###BaseRecalibrator###

  36. echo BaseRecalibrator `date`

  37. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T BaseRecalibrator \

  38. -I ${sample}_realigned.bam -R $GENOME -o ${sample}_temp.table -knownSites $DBSNP

  39. echo BaseRecalibrator `date`

  40.  

  41.  

  42. ###PrintReads###

  43. echo PrintReads `date`

  44. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T PrintReads \

  45. -R $GENOME -I ${sample}_realigned.bam -o ${sample}_recal.bam -BQSR ${sample}_temp.table

  46. samtools index ${sample}_recal.bam

  47. echo PrintReads `date`

  48.  

  49. ###delete_intermediate_files###

對L1樣本來說,時間消耗如下:

  1. INFO  15:50:24,097 ProgressMeter - Total runtime 1165.34 secs, 19.42 min, 0.32 hours

  2. INFO  17:21:00,917 ProgressMeter - Total runtime 4265.44 secs, 71.09 min, 1.18 hours

  3. INFO  19:58:23,969 ProgressMeter - Total runtime 9436.69 secs, 157.28 min, 2.62 hours

  4. INFO  23:41:00,540 ProgressMeter - Total runtime 13349.77 secs, 222.50 min, 3.71 hours

可以看到最耗費時間的步驟是最后一個 PrintReads

如果是對5條lane合並的merged.bam來說,消耗時間如下:

  1. INFO  17:54:59,396 ProgressMeter - Total runtime 5194.10 secs, 86.57 min, 1.44 hours

  2. INFO  02:04:10,907 ProgressMeter - Total runtime 22558.84 secs, 375.98 min, 6.27 hours

  3. ···························

  4. ···························

可以看到時間消耗與輸入的bam文件大小有關,merged.bam是L1.bam的6倍大小,當然,時間上並沒有成正比。總之,對這個全基因組數據來說,時間消耗太誇張了,以至於我寫完這篇文章這GATK的4個bam操作還沒跑完。對L1需要約8小時,那么對merge.bam應該是需要40個小時。

variant calling by gatk hc

代碼是:

  1. module load java/1.8.0_91

  2. GENOME=/home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta

  3. INDEX=/home/jianmingzeng/reference/index/bwa/human_g1k_v37

  4. GATK=/home/jianmingzeng/biosoft/GATK/GenomeAnalysisTK.jar

  5. PICARD=/home/jianmingzeng/biosoft/picardtools/2.9.2/picard.jar

  6. DBSNP=/home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

  7.  

  8. KG_SNP=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf.gz

  9. Mills_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf

  10. KG_indels=/home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

  11.  

  12. TMPDIR=/home/jianmingzeng/tmp/software

  13. ## samtools and bwa are in the environment

  14. ## samtools Version: 1.3.1 (using htslib 1.3.1)

  15. ## bwa Version: 0.7.15-r1140

  16.  

  17. fq1=P_jmzeng_DHG09057_AH33KVALXX_L1_1.clean.fq.gz

  18. fq2=P_jmzeng_DHG09057_AH33KVALXX_L1_2.clean.fq.gz

  19. sample='merge'

  20.  

  21. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T HaplotypeCaller  \

  22. -R $GENOME -I ${sample}_recal.bam --dbsnp $DBSNP  \

  23. -stand_emit_conf 10 -o  ${sample}_recal_raw.snps.indels.vcf

  24.  

  25. java -Djava.io.tmpdir=$TMPDIR   -Xmx40g -jar $GATK -T HaplotypeCaller  \

  26. -R $GENOME -I ${sample}_realigned.bam --dbsnp $DBSNP  \

  27. -stand_emit_conf 10 -o  ${sample}_realigned_raw.snps.indels.vcf

時間消耗如下:

  1. INFO  20:40:49,063 ProgressMeter - Total runtime 39243.88 secs, 654.06 min, 10.90 hours

  2. INFO  08:53:17,633 ProgressMeter - Total runtime 43939.69 secs, 732.33 min, 12.21 hours

可以看到對 recal.bam的處理比 recal.bam時間上要少2個小時,但是時間均消耗很長。

全部流程走完輸出的文件如下(僅顯示L1的流程數據):

流程探究

如果只給代碼,那么這個教程意義不大,如果給出了input和output,還給出了時間消耗情況,那么這個教程可以說是中上水平了,讀者只需要拿到數據就可以自己重復出來,既能估算硬件配置又能對大致的時間消耗有所了解。

但,這仍然不夠,對我來說,我還可以介紹為什么要走每一個流程,以及每一個流程到底做了什么。可以這么說,你看完下面的流程探究,基本上就相當於你自己做過了一個全基因組重測序分析實戰

我這里就對 L1樣本進行解密:

首先的BWA

這個沒什么好說的,基因組數據的比對首選,耗時取決於fastq文件的reads數目。

  1. CMD: bwa mem -t 5 -R @RG\tID:L1\tSM:L1\tLB:WGS\tPL:Illumina /home/jianmingzeng/reference/index/bwa/human_g1k_v37 P_jmzeng_DHG09057_AH33KVALXX_L1_1.clean.fq.gz P_jmzeng_DHG09057_AH33KVALXX_L1_2.clean.fq.gz

  2. [main] Real time: 15870.794 sec; CPU: 77463.156 sec

接下來是PICARD

共3個步驟用到了這個軟件,消耗時間及內存分別如下:

  1. picard.sam.SortSam done. Elapsed time: 44.15 minutes. Runtime.totalMemory()=13184794624

  2. picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 53.71 minutes. Runtime.totalMemory()=39832256512

  3. picard.sam.FixMateInformation done. Elapsed time: 53.79 minutes. Runtime.totalMemory()=9425649664

比對得到的都是sam格式數據,文件占硬盤空間太大,一般需要壓縮成二進制的bam格式文件,用的是 SortSam 至於 FixMateInformation步驟僅僅是對bam文件增加了MC和MQ這兩個tags

  1. add MC (CIGAR string for mate) and MQ (mapping quality of the mate/next segment) tags

而 markduplicates 步驟就比較復雜了,因為沒有選擇 REMOVE_DUPLICATES=True 所以並不會去除reads,只是標記一下而已,就是把sam文件的第二列改一下。

  1. Read 119776742 records.

  2. INFO    2017-06-05 10:57:22     MarkDuplicates  Marking 14482525 records as duplicates.

  3. INFO    2017-06-05 10:57:22     MarkDuplicates  Found 943146 optical duplicate clusters.

下面列出了部分被改變的flag值,可以去下面的網頁去查看每個flag的含義。

  1. # https://broadinstitute.github.io/picard/explain-flags.html

  2. # diff  -y -W 50   |grep '|'

  3. 163              | 1187

  4. 83              | 1107

  5. 99              | 1123

  6. 163              | 1187

  7. 147              | 1171

  8. 83              | 1107

  9. 99              | 1123

  10. 99              | 1123

  11. 147              | 1171

  12. 147              | 1171

  13. 99              | 1123

  14. 147              | 1171

  15. 163              | 1187

  16. 83              | 1107

最后是GATK

SplitNCigarReads 這個步驟對基因組數據來說可以略去,主要是針對於轉錄組數據的

命令是:

  1. Program Args: -T SplitNCigarReads -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta \

  2. -I L1_marked_fixed.bam -o L1_marked_fixed_split.bam \

  3. -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

程序運行的log日志是:

  1. INFO  13:04:52,813 ProgressMeter - Total runtime 2398.74 secs, 39.98 min, 0.67 hours

  2. INFO  13:04:52,854 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)

  3. INFO  13:04:52,854 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  4. INFO  13:04:52,854 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

  5. INFO  13:04:52,855 MicroScheduler -   -> 0 reads (0.00% of total) failing ReassignOneMappingQualityFilter

可以看到,對全基因組測序數據來說,這個步驟毫無效果,而且還耗時40分鍾,應該略去。

然后是indel區域的重排,需要結合 RealignerTargetCreator 和 IndelRealigner

命令是:

  1. Program Args: -T RealignerTargetCreator -I L1_marked_fixed_split.bam \

  2. -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -o L1_target.intervals \

  3. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \

  4. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf -nt 5

程序運行的log日志是:

  1. INFO  15:50:24,097 ProgressMeter - Total runtime 1165.34 secs, 19.42 min, 0.32 hours

  2. INFO  15:50:24,097 MicroScheduler - 22094746 reads were filtered out during the traversal out of approximately 120826819 total reads (18.29%)

  3. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  4. INFO  15:50:24,104 MicroScheduler -   -> 1774279 reads (1.47% of total) failing BadMateFilter

  5. INFO  15:50:24,104 MicroScheduler -   -> 14006627 reads (11.59% of total) failing DuplicateReadFilter

  6. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

  7. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

  8. INFO  15:50:24,104 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

  9. INFO  15:50:24,105 MicroScheduler -   -> 6313840 reads (5.23% of total) failing MappingQualityZeroFilter

  10. INFO  15:50:24,105 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

  11. INFO  15:50:24,105 MicroScheduler -   -> 0 reads (0.00% of total) failing Platform454Filter

  12. INFO  15:50:24,105 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是:

  1. Program Args: -T IndelRealigner -I L1_marked_fixed_split.bam \

  2. -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta \

  3. -targetIntervals L1_target.intervals -o L1_realigned.bam \

  4. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.vcf \

  5. -known /home/jianmingzeng/biosoft/GATK/resources/bundle/b37/1000G_phase1.indels.b37.vcf

程序運行的log日志是:

  1. INFO  17:21:00,917 ProgressMeter - Total runtime 4265.44 secs, 71.09 min, 1.18 hours

  2. INFO  17:21:00,920 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)

  3. INFO  17:21:00,920 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  4. INFO  17:21:00,920 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

最后是鹼基質量的矯正,需要結合 BaseRecalibrator 和 PrintReads

命令是:

  1. Program Args: -T BaseRecalibrator -I L1_realigned.bam \

  2. -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -o L1_temp.table \

  3. -knownSites /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz

程序運行的log日志是:

  1. INFO  19:58:23,969 ProgressMeter - Total runtime 9436.69 secs, 157.28 min, 2.62 hours

  2. INFO  19:58:23,970 MicroScheduler - 21179430 reads were filtered out during the traversal out of approximately 120614036 total reads (17.56%)

  3. INFO  19:58:23,970 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  4. INFO  19:58:23,970 MicroScheduler -   -> 14073643 reads (11.67% of total) failing DuplicateReadFilter

  5. INFO  19:58:23,970 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

  6. INFO  19:58:23,970 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

  7. INFO  19:58:23,971 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

  8. INFO  19:58:23,971 MicroScheduler -   -> 7105787 reads (5.89% of total) failing MappingQualityZeroFilter

  9. INFO  19:58:23,971 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

  10. INFO  19:58:23,971 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是:

  1. Program Args: -T PrintReads -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_realigned.bam -o L1_recal.bam -BQSR L1_temp.table

程序運行的log日志是:

  1. INFO  23:41:00,540 ProgressMeter - Total runtime 13349.77 secs, 222.50 min, 3.71 hours

  2. INFO  23:41:00,542 MicroScheduler - 0 reads were filtered out during the traversal out of approximately 120614036 total reads (0.00%)

  3. INFO  23:41:00,542 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  4. INFO  23:41:00,542 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

可以看到這個步驟非常的耗時,而且bam文件的大小近乎翻倍了。

最后是GATK真正的功能,variant-calling

我這里不僅僅是對最后recal的bam進行variant-calling 步驟,同時也對realign的bam做了,所以下面顯示兩個時間消耗的記錄,因為GATK的 BaseRecalibrator 步驟太耗費時間,而且極大的增加了bam文件的存儲,所以有必要確認這個步驟是否有必要。

命令是:

  1. Program Args: -T HaplotypeCaller -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_recal.bam --dbsnp /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz -stand_emit_conf 10 -o L1_recal_raw.snps.indels.vcf

程序運行的log日志是:

  1. INFO  20:40:49,062 ProgressMeter -            done    3.101804739E9    10.9 h           12.0 s      100.0%    10.9 h       0.0 s

  2. INFO  20:40:49,063 ProgressMeter - Total runtime 39243.88 secs, 654.06 min, 10.90 hours

  3. INFO  20:40:49,064 MicroScheduler - 22384946 reads were filtered out during the traversal out of approximately 119776742 total reads (18.69%)

  4. INFO  20:40:49,064 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  5. INFO  20:40:49,064 MicroScheduler -   -> 13732328 reads (11.46% of total) failing DuplicateReadFilter

  6. INFO  20:40:49,065 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

  7. INFO  20:40:49,065 MicroScheduler -   -> 8652618 reads (7.22% of total) failing HCMappingQualityFilter

  8. INFO  20:40:49,066 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

  9. INFO  20:40:49,066 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

  10. INFO  20:40:49,066 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

  11. INFO  20:40:49,067 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

命令是:

  1. Program Args: -T HaplotypeCaller -R /home/jianmingzeng/reference/genome/human_g1k_v37/human_g1k_v37.fasta -I L1_realigned.bam --dbsnp /home/jianmingzeng/annotation/variation/human/dbSNP/All_20160601.vcf.gz -stand_emit_conf 10 -o L1_realigned_raw.snps.indels.vcf

程序運行的log日志是:

  1. INFO  08:53:17,633 ProgressMeter -            done    3.101804739E9    12.2 h           14.0 s      100.0%    12.2 h       0.0 s

  2. INFO  08:53:17,633 ProgressMeter - Total runtime 43939.69 secs, 732.33 min, 12.21 hours

  3. INFO  08:53:17,634 MicroScheduler - 22384946 reads were filtered out during the traversal out of approximately 119776742 total reads (18.69%)

  4. INFO  08:53:17,634 MicroScheduler -   -> 0 reads (0.00% of total) failing BadCigarFilter

  5. INFO  08:53:17,635 MicroScheduler -   -> 13732328 reads (11.46% of total) failing DuplicateReadFilter

  6. INFO  08:53:17,635 MicroScheduler -   -> 0 reads (0.00% of total) failing FailsVendorQualityCheckFilter

  7. INFO  08:53:17,635 MicroScheduler -   -> 8652618 reads (7.22% of total) failing HCMappingQualityFilter

  8. INFO  08:53:17,636 MicroScheduler -   -> 0 reads (0.00% of total) failing MalformedReadFilter

  9. INFO  08:53:17,636 MicroScheduler -   -> 0 reads (0.00% of total) failing MappingQualityUnavailableFilter

  10. INFO  08:53:17,636 MicroScheduler -   -> 0 reads (0.00% of total) failing NotPrimaryAlignmentFilter

  11. INFO  08:53:17,637 MicroScheduler -   -> 0 reads (0.00% of total) failing UnmappedReadFilter

 

如果想了解更多的全基因組重測序分析內容,歡迎點擊閱讀原文查看,也歡迎把此文轉載給有需要的朋友。

對我的全基因組重測序數據來說,處理這個GATK最佳實踐,耗時約12+40+15=67小時。

還有關於GATK調用多線程來加快處理步驟的事情,我就不多說了,你其實可以去GATK官網查看詳細閱讀說明。

猜你喜歡

工作資訊 | 學習課程 | 好書分享

 

菜鳥入門

Linux | Perl | R語言

 

數據分析

ChIP-seq(上)ChIP-seq(下)RNA-seq

WGS,WES,RNA-seq組與ChIP-seq之間的異同

 

編程實踐

第0題 | 探索人類基因組序列

 

直播基因組分析

我的基因組 | 解惑帖

一個標准的基因檢測報告目錄

生信技能樹


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM