SAMTOOLS使用 SAM BAM文件處理


 【怪毛匠子 整理】

samtools學習及使用范例,以及官方文檔詳解

 

 

#第一步:把sam文件轉換成bam文件,我們得到map.bam文件  

system"samtools view -bS map.sam > map.bam";  

#第二步:sort 一下 BAM 文件,得到map.sorted.bam  

system"samtools sort map.b/am map.sorted";  

#第三步:創建一個關於bam的索引文件,我們得到一個map.sorted.bam.bai的文件  

system"samtools index map.sorted.bam";  

#第四步:snp,這里用的是sort以后的bam文件,如果不是,就會不斷的報錯  

system"samtools mpileup -ugf TAIR10.fas map.sorted.bam | bcftools view -vcg -D100 ->snp.vcf"  //snp位點

perl 語言

如果我們要獲取全部的位點的信息,而不是僅僅snp位點,那么我們只需要把最后一行的-v去掉就可以了

 

  1. system"samtools mpileup -ugf TAIR10.fas map.sorted.bam | bcftools view -cg -D100 ->snp.vcf"  //所有位點

再下面有詳細的解釋-v的作用:Output variant sites only (force -c):這里有-v這個選項就只輸出snp位點,如果沒有-v那么就是輸出所有的位點(測序所包含的)

 

 

 

samtools 功能介紹

 (2012-10-25 22:06:30)

 

標簽: 

samtools

 

高通量測序

 

比對

 

雜談

分類: 生物信息

samtools  輸入bam文件,導出sam文件。同時可以進行排序,合並,建立索引等功能,並支持從特定區域內查找reads。

 

samtools 可以應用於linux的命令管道里。以“-”作為標准輸入或輸出。

 

view  從bam/sam文件中提取/打印部分比對結果。默認為所有的區域,也可以染色體區域(1-based,須sort並index)。

 

例如:

samtools view -bt ref_list.txt -o aln.bam aln.sam.gz

samtools view aln.sorted.bam chr2:20,100,000-20,200,000

 

tview

 

mpileup 列舉每條reads比對的indel,SNP等信息。

 

“.”表示match;

“,”表示反向match;

“<”或“>”表示reference skip;

"ATCGN"表示正向mismatch;

"atcgn"表示反向mismatch;

‘\+[0-9]+[ACGTNacgtn]+’ insertion;

‘-[0-9]+[ACGTNacgtn]+’ 表示deletion;

“^”標記reads起始;

“$”標記reads segment結尾;

 

samtools pileup -vcf ref.fasta aln.sorted.bam

samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam

 

merge 合並bam文件

samtools merge out.bam in1.bam in2.bam in3.bam

 

tview 圖形化的界面查看mapping的結果

? -help

g -到達指定的染色體區域

samtools tview aln.sorted.bam ref.fasta

 

sort 將比對結果排序

samtools sort aln.bam aln.sorted

 

reheader 將bam文件的頭文件替換。

 

index 為sort后的bam文件建立所以,這是在 view 比對結果時指定染色體區域必須做的。

samtools index aln.sorted.bam

 

rmdup 去除可能的PCR重復。將比對到相同位置的reads去重,僅保留比對質量最好的一個。一般只對paired end reads(fr)有效,並須指定ISIZE。

 

phase Call and phase heterozygous SNPs

 

 

 

samtools常用命令詳解

2014年01月26日 ⁄ Bioinformatics ⁄ 字號 小 中 大 ⁄ 評論 1 條 ⁄ 閱讀 12,457 次 [點擊加入在線收藏夾]

samtools的說明文檔:http://samtools.sourceforge.net/samtools.shtml

samtools是一個用於操作sam和bam文件的工具合集。包含有許多命令。以下是常用命令的介紹

1. view

view命令的主要功能是:將sam文件轉換成bam文件;然后對bam文件進行各種操作,比如數據的排序(不屬於本命令的功能)和提取(這些操作是對bam文件進行的,因而當輸入為sam文件的時候,不能進行該操作);最后將排序或提取得到的數據輸出為bam或sam(默認的)格式。

bam文件優點:bam文件為二進制文件,占用的磁盤空間比sam文本文件小;利用bam二進制文件的運算速度快。

view命令中,對sam文件頭部的輸入(-t或-T)和輸出(-h)是單獨的一些參數來控制的。

01

Usage: samtools view [options] <in.bam>|<in.sam> [region1 [...]]

 

02

默認情況下不加 region,則是輸出所有的 region.

 

03

 

 

04

Options: -b       output BAM

 

05

                  默認下輸出是 SAM 格式文件,該參數設置輸出 BAM 格式

 

06

         -h       print header for the SAM output

 

07

                  默認下輸出的 sam 格式文件不帶 header,該參數設定輸出sam文件時帶 header 信息

 

08

         -H       print header only (no alignments)

 

09

         -S       input is SAM

 

10

                  默認下輸入是 BAM 文件,若是輸入是 SAM 文件,則最好加該參數,否則有時候會報錯。

 

11

         -u       uncompressed BAM output (force -b)

 

12

                  該參數的使用需要有-b參數,能節約時間,但是需要更多磁盤空間。

 

13

         -c       Instead of printing the alignments, only count them and print the

 

14

                  total number. All filter options, such as ‘-f’, ‘-F’ and ‘-q’ ,

 

15

                  are taken into account.

 

16

         -1       fast compression (force -b)

 

17

         -x       output FLAG in HEX (samtools-C specific)

 

18

         -X       output FLAG in string (samtools-C specific)

 

19

         -c       print only the count of matching records

 

20

         -L FILE  output alignments overlapping the input BED FILE [null]

 

21

         -t FILE  list of reference names and lengths (force -S) [null]

 

22

                  使用一個list文件來作為header的輸入

 

23

         -T FILE  reference sequence file (force -S) [null]

 

24

                  使用序列fasta文件作為header的輸入

 

25

         -o FILE  output file name [stdout]

 

26

         -R FILE  list of read groups to be outputted [null]

 

27

         -f INT   required flag, 0 for unset [0]

 

28

         -F INT   filtering flag, 0 for unset [0]

 

29

                  Skip alignments with bits present in INT [0]

 

30

                  數字4代表該序列沒有比對到參考序列上

 

31

                  數字8代表該序列的mate序列沒有比對到參考序列上

 

32

         -q INT   minimum mapping quality [0]

 

33

         -l STR   only output reads in library STR [null]

 

34

         -r STR   only output reads in read group STR [null]

 

35

         -s FLOAT fraction of templates to subsample; integer part as seed [-1]

 

36

         -?       longer help

例子:

#將sam文件轉換成bam文件 $ samtools view -bS abc.sam > abc.bam $ samtools view -b -S abc.sam -o abc.bam

01

#提取比對到參考序列上的比對結果

 

02

$ samtools view -bF 4 abc.bam > abc.F.bam

 

03

 

 

04

#提取paired reads中兩條reads都比對到參考序列上的比對結果,只需要把兩個4+8的值12作為過濾參數即可

 

05

$ samtools view -bF 12 abc.bam > abc.F12.bam

 

06

 

 

07

#提取沒有比對到參考序列上的比對結果

 

08

$ samtools view -bf 4 abc.bam > abc.f.bam

 

09

 

 

10

#提取bam文件中比對到caffold1上的比對結果,並保存到sam文件格式

 

11

$ samtools view abc.bam scaffold1 > scaffold1.sam

 

12

 

 

13

#提取scaffold1上能比對到30k到100k區域的比對結果

 

14

$ samtools view abc.bam scaffold1:30000-100000 > scaffold1_30k-100k.sam

 

15

 

 

16

#根據fasta文件,將 header 加入到 sam 或 bam 文件中

 

17

$ samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam

2. sort

sort對bam文件進行排序。

1

Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix> 

 

2

-m 參數默認下是 500,000,000 即500M(不支持K,M,G等縮寫)。對於處理大數據時,如果內存夠用,則設置大點的值,以節約時間。

 

3

-n 設定排序方式按short reads的ID排序。默認下是按序列在fasta文件中的順序(即header)和序列從左往右的位點排序。

例子:

$ samtools sort abc.bam abc.sort $ samtools view abc.sort.bam | less -S

3.merge

2個或2個以上的已經sort了的bam文件融合成一個bam文件。融合后的文件不需要則是已經sort過了的。

01

Usage:   samtools merge [-nr] [-h inh.sam] <out.bam> <in1.bam> <in2.bam>[...]

 

02

 

 

03

Options: -n       sort by read names

 

04

         -r       attach RG tag (inferred from file names)

 

05

         -u       uncompressed BAM output

 

06

         -f       overwrite the output BAM if exist

 

07

         -1       compress level 1

 

08

         -R STR   merge file in the specified region STR [all]

 

09

         -h FILE  copy the header in FILE to <out.bam> [in1.bam]

 

10

 

 

11

Note: Samtools' merge does not reconstruct the @RG dictionary in the header. Users

 

12

      must provide the correct header with -h, or uses Picard which properly maintains

 

13

      the header dictionary in merging.

4.index

必須對bam文件進行默認情況下的排序后,才能進行index。否則會報錯。

建立索引后將產生后綴為.bai的文件,用於快速的隨機處理。很多情況下需要有bai文件的存在,特別是顯示序列比對情況下。比如samtool的tview命令就需要;gbrowse2顯示reads的比對圖形的時候也需要。

Usage: samtools index <in.bam> [out.index]

例子:

1

#以下兩種命令結果一樣

 

2

<span style="font-weight: inherit; font-style: inherit; color: #ff6600;">$ samtools index abc.sort.bam</span>

 

3

$ samtools index abc.sort.bam abc.sort.bam.bai

5. faidx

fasta文件建立索引,生成的索引文件以.fai后綴結尾。該命令也能依據索引文件快速提取fasta文件中的某一條(子)序列

1

Usage: samtools faidx <in.bam> [ [...]]

 

2

 

 

3

對基因組文件建立索引

 

4

$ samtools faidx genome.fasta

 

5

#生成了索引文件genome.fasta.fai,是一個文本文件,分成了5列。第一列是子序列的名稱;第二列是子序列的長度;個人認為“第三列是序列所在的位置”,因為該數字從上往下逐漸變大,最后的數字是genome.fasta文件的大小;第4和5列不知是啥意思。於是通過此文件,可以定位子序列在fasta文件在磁盤上的存放位置,直接快速調出子序列。

 

6

 

 

7

#由於有索引文件,可以使用以下命令很快從基因組中提取到fasta格式的子序列

 

8

$ samtools faidx genome.fasta scffold_10 > scaffold_10.fasta

6. tview

tview能直觀的顯示出reads比對基因組的情況,和基因組瀏覽器有點類似。

01

Usage: samtools tview <aln.bam> [ref.fasta]

 

02

 

 

03

當給出參考基因組的時候,會在第一排顯示參考基因組的序列,否則,第一排全用N表示。

 

04

按下 g ,則提示輸入要到達基因組的某一個位點。例子“scaffold_10:1000"表示到達第

 

05

10號scaffold的第1000個鹼基位點處。

 

06

使用H(左)J(上)K(下)L(右)移動顯示界面。大寫字母移動快,小寫字母移動慢。

 

07

使用空格建向左快速移動(和 L 類似),使用Backspace鍵向左快速移動(和 H 類似)。

 

08

Ctrl+H 向左移動1kb鹼基距離; Ctrl+L 向右移動1kb鹼基距離

 

09

可以用顏色標注比對質量,鹼基質量,核苷酸等。30~40的鹼基質量或比對質量使用白色表示;

 

10

20~30黃色;10~20綠色;0~10藍色。

 

11

使用點號'.'切換顯示鹼基和點號;使用r切換顯示read name等

 

12

還有很多其它的使用說明,具體按 ? 鍵來查看。

7. flagstat

給出BAM文件的比對結果

01

Usage: samtools flagstat <in.bam>

 

02

 

 

03

$ samtools flagstat example.bam

 

04

11945742 + 0 in total (QC-passed reads + QC-failed reads)

 

05

#總共的reads數

 

06

0 + 0 duplicates

 

07

7536364 + 0 mapped (63.09%:-nan%)

 

08

#總體上reads的匹配率

 

09

11945742 + 0 paired in sequencing

 

10

#有多少reads是屬於paired reads

 

11

5972871 + 0 read1

 

12

#reads1中的reads數

 

13

5972871 + 0 read2

 

14

#reads2中的reads數

 

15

6412042 + 0 properly paired (53.68%:-nan%)

 

16

#完美匹配的reads數:比對到同一條參考序列,並且兩條reads之間的距離符合設置的閾值

 

17

6899708 + 0 with itself and mate mapped

 

18

#paired reads中兩條都比對到參考序列上的reads數

 

19

636656 + 0 singletons (5.33%:-nan%)

 

20

#單獨一條匹配到參考序列上的reads數,和上一個相加,則是總的匹配上的reads數。

 

21

469868 + 0 with mate mapped to a different chr

 

22

#paired reads中兩條分別比對到兩條不同的參考序列的reads數

 

23

243047 + 0 with mate mapped to a different chr (mapQ>=5)

#同上一個,只是其中比對質量>=5的reads的數量

7. depth

得到每個鹼基位點的測序深度,並輸出到標准輸出。

1

Usage: bam2depth [-r reg] [-q baseQthres] [-Q mapQthres] [-b in.bed] <in1.bam> [...]

8. 其它有用的命令

reheader 替換bam文件的頭

1

$ samtools reheader <in.header.sam> <in.bam>

cat 連接多個bam文件,適用於非sorted的bam文件

1

$ samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]

idxstats 統計一個表格,4列,分別為”序列名,序列長度,比對上的reads數,unmapped reads number”。第4列應該是paired reads中有一端能匹配到該scaffold上,而另外一端不匹配到任何scaffolds上的reads數。

1

$ samtools idxstats <aln.bam>

9. 將bam文件轉換為fastq文件

有時候,我們需要提取出比對到一段參考序列的reads,進行小范圍的分析,以利於debug等。這時需要將bam或sam文件轉換為fastq格式。

該網站提供了一個bam轉換為fastq的程序:http://www.hudsonalpha.org/gsl/information/software/bam2fastq

1

$ wget http://www.hudsonalpha.org/gsl/static/software/bam2fastq-1.1.0.tgz

 

2

$ tar zxf bam2fastq-1.1.0.tgz

 

3

$ cd bam2fastq-1.1.0

 

4

$ make

 

5

$ ./bam2fastq <in.bam>

10. mpileup

samtools還有個非常重要的命令mpileup,以前為pileup。該命令用於生成bcf文件,再使用bcftools進行SNP和Indel的分析。bcftools是samtool中附帶的軟件,在samtools的安裝文件夾中可以找到。

最常用的參數有2:

-f 來輸入有索引文件的fasta參考序列;

-g 輸出到bcf格式。用法和最簡單的例子如下

1

Usage: samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

 

2

 

 

3

$ samtools mpileup -f genome.fasta abc.bam > abc.txt

 

4

$ samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf

 

5

$ samtools mpileup -guSDf genome.fasta abc.bam | \

 

6

 bcftools view -cvNg - > abc.vcf

mpileup不使用-u或-g參數時,則不生成二進制的bcf文件,而生成一個文本文件(輸出到標准輸出)。該文本文件統計了參考序列中每個鹼基位點的比對情況;該文件每一行代表了參考序列中某一個鹼基位點的比對結果。比如:

01

scaffold_1      2841    A       11      ,,,...,....     BHIGDGIJ?FF

 

02

scaffold_1      2842    C       12      ,$,,...,....^I. CFGEGEGGCFF+

 

03

scaffold_1      2843    G       11      ,,...,.....     FDDDDCD?DD+

 

04

scaffold_1      2844    G       11      ,,...,.....     FA?AAAA<AA+

 

05

scaffold_1      2845    G       11      ,,...,.....     F656666166*

 

06

scaffold_1      2846    A       11      ,,...,.....     (1.1111)11*

 

07

scaffold_1      2847    A       11      ,,+9acggtgaag.+9ACGGTGAAT.+9ACGGTGAAG.+9ACGGTGAAG,+9acggtgaag.+9ACGGTGAAG.+9ACGGTGAAG.+9ACGGTGAAG.+9ACGGTGAAG.+9ACGGTGAAG       %.+....-..)

 

08

scaffold_1      2848    N       11      agGGGgGGGGG     !!$!!!!!!!!

 

09

scaffold_1      2849    A       11      c$,...,.....    !0000000000

 

10

scaffold_1      2850    A       10      ,...,.....      353333333

mpileup生成的結果包含6行:參考序列名;位置;參考鹼基;比對上的reads數;比對情況;比對上的鹼基的質量。其中第5列比較復雜,解釋如下:

1 ‘.’代表與參考序列正鏈匹配。

2 ‘,’代表與參考序列負鏈匹配。

3 ‘ATCGN’代表在正鏈上的不匹配。

4 ‘atcgn’代表在負鏈上的不匹配。

5 ‘*’代表模糊鹼基

6 ‘^’代表匹配的鹼基是一個read的開始;’^'后面緊跟的ascii碼減去33代表比對質量;這兩個符號修飾的是后面的鹼基,其后緊跟的鹼基(.,ATCGatcgNn)代表該read的第一個鹼基。

7 ‘$’代表一個read的結束,該符號修飾的是其前面的鹼基。

8 正則式’\+[0-9]+[ACGTNacgtn]+’代表在該位點后插入的鹼基;比如上例中在scaffold_1的2847后插入了9個長度的鹼基acggtgaag。表明此處極可能是indel。

9 正則式’-[0-9]+[ACGTNacgtn]+’代表在該位點后缺失的鹼基;

pileup具體的參數如下:

01

#輸入參數

 

02

-6 Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairsin variant calling.

 

03

-B Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.

 

04

-b FILE List of input BAM files, one file per line [null]

 

05

-C INT Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0]

 

06

-d INT At a position, read maximally INT reads per input BAM. [250]

 

07

-E Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit.

 

08

-f FILE The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null]

 

09

-l FILE BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null]

 

10

-M INT       cap mapping quality at INT [60]

 

11

-q INT Minimum mapping quality for an alignment to be used [0]

 

12

-Q INT Minimum base quality for a base to be considered [13]

 

13

-r STR Only generate pileup in region STR [all sites]

 

14

 

 

15

輸出參數

 

16

-D Output per-sample read depth (require -g/-u)

 

17

-g Compute genotype likelihoods and output them in the binary call format (BCF).

 

18

-S Output per-sample Phred-scaled strand bias P-value (require -g/-u)

 

19

-u Similar to -g except that the output is uncompressed BCF, which is preferred for piping.

 

20

 

 

21

Options for Genotype Likelihood Computation (for -g or -u):

 

22

-e INT Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. [20]

 

23

-h INT Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100]

 

24

-I Do not perform INDEL calling

 

25

-L INT Skip INDEL calling if the average per-sample depth is above INT. [250]

 

26

-o INT Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40]

 

27

-P STR Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]

11. 使用bcftools

bcftools和samtools類似,用於處理vcf(variant call format)文件和bcf(binary call format)文件。前者為文本文件,后者為其二進制文件。

bcftools使用簡單,最主要的命令是view命令,其次還有index和cat等命令。index和cat命令和samtools中類似。此處主講使用view命令來進行SNP和Indel calling。該命令的使用方法和例子為:

1

$ bcftools view [-AbFGNQSucgv] [-D seqDict] [-l listLoci] [-s listSample]

 

2

          [-i gapSNPratio] [-t mutRate] [-p varThres] [-P prior]

 

3

          [-1 nGroup1] [-d minFrac] [-U nPerm] [-X permThres]

 

4

          [-T trioType] in.bcf [region]

 

5

 

 

6

$ bcftools view -cvNg abc.bcf > snp_indel.vcf

生成的結果文件為vcf格式,有10列,分別是:

1 參考序列名;

2 variant所在的left-most位置;

3 variant的ID(默認未設置,用’.'表示);

4 參考序列的allele;

5 variant的allele(有多個alleles,則用’,'分隔);

6 variant/reference QUALity;

7 FILTers applied;

8 variant的信息,使用分號隔開;

9 FORMAT of the genotype fields, separated by colon (optional);

10 SAMPLE genotypes and per-sample information (optional)。

例如:

1

scaffold_1      2847    .       A       AACGGTGAAG      194     .       INDEL;DP=11;VDB=0.0401;AF1=1;AC1=2;DP4=0,0,8,3;MQ=35;FQ=-67.5   GT:PL:GQ        1/1:235,33,0:63

 

2

scaffold_1      3908    .       G       A       111     .       DP=13;VDB=0.0085;AF1=1;AC1=2;DP4=0,0,5,7;MQ=42;FQ=-63   GT:PL:GQ        1/1:144,36,0:69

 

3

scaffold_1      4500    .       A       G       31.5    .       DP=8;VDB=0.0034;AF1=1;AC1=2;DP4=0,0,1,3;MQ=42;FQ=-39    GT:PL:GQ        1/1:64,12,0:21

 

4

scaffold_1      4581    .       TGGNGG  TGG     145     .       INDEL;DP=8;VDB=0.0308;AF1=1;AC1=2;DP4=0,0,0,8;MQ=42;FQ=-58.5    GT:PL:GQ        1/1:186,24,0:45

 

5

scaffold_1      4644    .       G       A       195     .       DP=21;VDB=0.0198;AF1=1;AC1=2;DP4=0,0,10,10;MQ=42;FQ=-87 GT:PL:GQ        1/1:228,60,0:99

 

6

scaffold_1      4827    .       NACAAAGA        NA      4.42    .       INDEL;DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=40;FQ=-37.5       GT:PL:GQ        0/1:40,3,0:3

 

7

scaffold_1      4854    .       A       G       48      .       DP=6;VDB=0.0085;AF1=1;AC1=2;DP4=0,0,2,1;MQ=41;FQ=-36    GT:PL:GQ        1/1:80,9,0:16

 

8

scaffold_1      5120    .       A       G       85      .       DP=8;VDB=0.0355;AF1=1;AC1=2;DP4=0,0,5,3;MQ=42;FQ=-51    GT:PL:GQ        1/1:118,24,0:45

8列中顯示了對variants的信息描述,比較重要,其中的 Tag 的描述如下:

01

Tag   Format   Description

 

02

AF1   double   Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele

 

03

DP int   Raw read depth (without quality filtering)

 

04

DP4   int[4]   # high-quality reference forward bases, ref reverse, alternate for and alt rev bases

 

05

FQ int   Consensus quality. Positive: sample genotypes different; negative: otherwise

 

06

MQ int   Root-Mean-Square mapping quality of covering reads

 

07

PC2   int[2]   Phred probability of AF in group1 samples being larger (,smaller) than ingroup2

 

08

PCHI2 double   Posterior weighted chi^2 P-value between group1 and group2 samples

 

09

PV4   double[4]   P-value for strand bias, baseQ bias, mapQ bias and tail distance bias

 

10

QCHI2 int   Phred-scaled PCHI2

 

11

RP int   # permutations yielding a smaller PCHI2

 

12

CLR   int   Phred log ratio of genotype likelihoods with and without the trio/pair constraint

 

13

UGT   string   Most probable genotype configuration without the trio constraint

 

14

CGT   string   Most probable configuration with the trio constraint

bcftools view 的具體參數如下:

01

Input/Output Options:

 

02

-A Retain all possible alternate alleles at variant sites. By default, the view commanddiscards unlikely alleles.

 

03

-b Output in the BCF format. The default is VCF.

 

04

-D FILE Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null]

 

05

-F Indicate PL is generated by r921 or before (ordering is different).

 

06

-G Suppress all individual genotype information.

 

07

-l FILE List of sites at which information are outputted [all sites]

 

08

-N Skip sites where the REF field is not A/C/G/T

 

09

-Q Output the QCALL likelihood format

 

10

-s FILE List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null]

 

11

-S The input is VCF instead of BCF.

 

12

-u Uncompressed BCF output (force -b).

 

13

 

 

14

Consensus/Variant Calling Options:

 

15

-c Call variants using Bayesian inference. This option automatically invokes option -e.

 

16

-d FLOAT When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0]

 

17

        當有多個sample用於variants calling時,比如多個轉錄組數據或多個重測序

 

18

        數據需要比對到參考基因組上,設置該值,表明至少有該<float 0~1>比例的

 

19

        samples在該位點都有覆蓋才計算入variant.所以對於只有一個sample的情況

 

20

        下,該值設置在0~1之間沒有意義,大於1則得不到任何結果。

 

21

-e Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT.

 

22

-g Call per-sample genotypes at variant sites (force -c)

 

23

-i FLOAT Ratio of INDEL-to-SNP mutation rate [0.15]

 

24

-p FLOAT A site is considered to be a variant if P(ref|D)

 

25

-t FLOAT Scaled muttion rate for variant calling [0.001]

 

26

-T STR Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null]

 

27

-v Output variant sites only (force -c)

 

28

 

 

29

Contrast Calling and Association Test Options:

 

30

-1 INT   Number of group-1 samples. This option is used for dividing the samples into twogroups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0]

 

31

-U INT   Number of permutations for association test (effective only with -1) [0]

 

32

-X FLOAT Only perform permutations for P(chi^2)

使用bcftools得到variant calling結果后。需要對結果再次進行過濾。主要依據比對結果中第8列信息。其中的 DP4 一行尤為重要,提供了4個數據:1 比對結果和正鏈一致的reads數、2 比對結果和負鏈一致的reads數、3 比對結果在正鏈的variant上的reads數、4 比對結果在負鏈的variant上的reads數。可以設定 (value3 + value4)大於某一閾值,才算是variant。比如:

1

$ perl -ne 'print $_ if /DP4=(\d+),(\d+),(\d+),(\d+)/ && ($3+$4)>=10 && ($3+$4)/($1+$2+$3+$4)>=0.8' snp_indel.vcf > snp_indel.final.vcf

12. samtools rmdup

NGS上機測序前需要進行PCR一步,使一個模板擴增出一簇,從而在上機測序的時候表現出為1個點,即一個reads。若一個模板擴增出了多簇,結果得到了多個reads,這些reads的坐標(coordinates)是相近的。在進行了reads比對后需要將這些由PCR duplicates獲得的reads去掉,並只保留最高比對質量的read。使用rmdup命令即可完成.

1

Usage:  samtools rmdup [-sS] 

 

2

-s 對single-end reads。默認情況下,只對paired-end reads

 

3

-S 將Paired-end reads作為single-end reads處理。

 

4

 

 

5

$ samtools input.sorted.bam output.bam

本文來自:http://www.chenlianfu.com/?p=1399

 

網址如下:http://samtools.sourceforge.net/samtools.shtml

Manual Reference Pages  - 

NAME

samtools - Utilities for the Sequence Alignment/Map (SAM) format

bcftools - Utilities for the Binary Call Format (BCF) and VCF

CONTENTS

Synopsis

Description

Samtools Commands And Options

Bcftools Commands And Options

Sam Format

Vcf Format

Examples

Limitations

Author

See Also

SYNOPSIS(大綱):這個大綱其實詳細的說明了運行的命令,如果沒有特殊要求就可以直接采用了。下面的東西都是針對這個的描述。

samtools view -bt ref_list.txt -o aln.bam aln.sam.gz

samtools sort aln.bam aln.sorted  :這個是sort的命令,需要的是aln.bam時你要sort的文件,后面跟的是你可以自己命名的最好和前面保持一致

samtools index aln.sorted.bam

     :sort以后要用建立一個索引文件就直接用這個命令

samtools idxstats aln.sorted.bam

samtools view aln.sorted.bam chr2:20,100,000-20,200,000

samtools merge out.bam in1.bam in2.bam in3.bam

samtools faidx ref.fasta

samtools pileup -vcf ref.fasta aln.sorted.bam

samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam

     :我們再上面用過的最后snp的提取里

samtools tview aln.sorted.bam ref.fasta

bcftools index in.bcf

bcftools view in.bcf chr2:100-200 > out.vcf

bcftools view -vc in.bcf > out.vcf 2> out.afs

DESCRIPTION

Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly.

Samtools is designed to work on a stream. It regards an input file ‘-’ as the standard input (stdin) and an output file ‘-’ as the standard output (stdout). Several commands can thus be combined with Unix pipes. Samtools always output warning and error messages to the standard error output (stderr).

Samtools is also able to open a BAM (not SAM) file on a remote FTP or HTTP server if the BAM file name starts with ‘ftp://’ or ‘http://’. Samtools checks the current working directory for the index file and will download the index upon absence. Samtools does not retrieve the entire alignment file unless it is asked to do so.

SAMTOOLS COMMANDS AND OPTIONS

view

samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]]

Extract/print all or sub alignments in SAM or BAM format. If no region is specified, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. An alignment may be given multiple times if it is overlapping several regions. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’ (region between 1,000,000 and 2,000,000bp including the end points). The coordinate is 1-based.

OPTIONS:

tview

samtools tview <in.sorted.bam> [ref.fasta]

Text alignment viewer (based on the ncurses library). In the viewer, press ‘?’ for help and press ‘g’ to check the alignment start from a region in the format like ‘chr10:10,000,000’ or ‘=10,000,000’ when viewing the same reference sequence.

這個命令是查看的命令,看到的是map以后覆蓋度的文件,samtools tview .bam文件 .ref文件

mpileup

samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ][-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]]

Generate BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample.

In the pileup format (without -uor-g), each line represents a genomic position, consisting of chromosome name, coordinate, reference base, read bases, read qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a ’>’ or ’<’ for a reference skip, ‘ACGTN’ for a mismatch on the forward strand and ‘acgtn’ for a mismatch on the reverse strand. A pattern ‘\+[0-9]+[ACGTNacgtn]+’ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern ‘-[0-9]+[ACGTNacgtn]+’ represents a deletion from the reference. The deleted bases will be presented as ‘*’ in the following lines. Also at the read base column, a symbol ‘^’ marks the start of a read. The ASCII of the character following ‘^’ minus 33 gives the mapping quality. A symbol ‘$’ marks the end of a read segment.

Input Options:

reheader

samtools reheader <in.header.sam> <in.bam>

Replace the header in in.bam with the header in in.header.sam. This command is much faster than replacing the header with a BAM->SAM->BAM conversion.

cat

samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ]

Concatenate BAMs. The sequence dictionary of each input BAM must be identical, although this command does not check this. This command uses a similar trick toreheader which enables fast BAM concatenation.

sort

samtools sort [-no] [-m maxMem] <in.bam> <out.prefix>

Sort alignments by leftmost coordinates. File <out.prefix>.bam will be created. This command may also create temporary files <out.prefix>.%d.bam when the whole alignment cannot be fitted into memory (controlled by option -m).

OPTIONS:

merge

samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam> <in1.bam> <in2.bam> [...]

Merge multiple sorted alignments. The header reference lists of all the input BAM files, and the @SQ headers of inh.sam, if any, must all refer to the same set of reference sequences. The header reference list and (unless overridden by -h) ‘@’ headers of in1.bam will be copied to out.bam, and the headers of other files will be ignored.

OPTIONS:

index

samtools index <aln.bam>

Index sorted alignment for fast random access. Index file <aln.bam>.bai will be created.

idxstats

samtools idxstats <aln.bam>

Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads.

faidx

samtools faidx <ref.fasta> [region1 [...]]

Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create<ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format.

fixmate

samtools fixmate <in.nameSrt.bam> <out.bam>

Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.

rmdup

samtools rmdup [-sS] <input.srt.bam> <out.bam>

Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads).

OPTIONS:

calmd

samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta>

Generate the MD tag. If the MD tag is already present, this command will give a warning if the MD tag generated is different from the existing tag. Output SAM by default.

OPTIONS:

targetcut

samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam>

This command identifies target regions by examining the continuity of read depth, computes haploid consensus sequences of targets and outputs a SAM with each sequence corresponding to a target. When option -f is in use, BAQ will be applied. This command is only designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)].

 

 

 

 

 

 

 

-b

Output in the BAM format.我們第一步把sam轉換成bam的中-bS中-b表示的就是要輸出bam的文件

-f INT

Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]

-F INT

Skip alignments with bits present in INT [0]

-h

Include the header in the output.(再輸出文件中包含頭文件)

-H

Output the header only.(只輸出頭文件)

-l STR

Only output reads in library STR [null]

-o FILE

Output file [stdout]

-q INT

Skip alignments with MAPQ smaller than INT [0]

-r STR

Only output reads in read group STR [null]

-R FILE

Output reads in read groups listed in FILE [null]

-S

Input is in SAM. If @SQ header lines are absent, the ‘-t’ option is required.這里S表示的就是輸入的是SAM的格式,如果sam中沒有頭文件,那么就要用到-t的選項

-c

Instead of printing the alignments, only count them and print the total number. All filter options, such as ‘-f’, ‘-F’ and ‘-q’ , are taken into account.

-t FILE

This file is TAB-delimited. Each line must contain the reference name and the length of the reference, one line for each distinct reference; additional fields are ignored. This file also defines the order of the reference sequences in sorting. If you run ‘samtools faidx <ref.fa>’, the resultant index file <ref.fa>.fai can be used as this <in.ref_list> file.

-u

Output uncompressed BAM. This option saves time spent on compression/decomprssion and is thus preferred when the output is piped to another samtools command.

 

 

 

 

 

 

 

-6

Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairs in variant calling.

-B

Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments.

-b FILE

List of input BAM files, one file per line [null]

-C INT

Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0]

-d INT

At a position, read maximally INT reads per input BAM. [250]

-E

Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit.

-f FILE

The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null]:要有一個參考序列

-l FILE

BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null]

-q INT

Minimum mapping quality for an alignment to be used [0]

-Q INT

Minimum base quality for a base to be considered [13]

-r STR

Only generate pileup in region STR [all sites]

Output Options:輸出選項

 

 

-D

Output per-sample read depth 讀取的深度,可以設定值比如-D100

-g

Compute genotype likelihoods and output them in the binary call format (BCF).

-S

Output per-sample Phred-scaled strand bias P-value

-u

Similar to -g except that the output is uncompressed(未壓縮的) BCF, which is preferred for piping.

Options for Genotype Likelihood Computation (for -g or -u):

 

 

-e INT

Phred-scaled gap extension sequencing error probability. Reducing INTleads to longer indels. [20]

-h INT

Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100]

-I

Do not perform INDEL calling

-L INT

Skip INDEL calling if the average per-sample depth is above INT. [250]

-o INT

Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40]

-P STR

Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all]

 

 

 

 

 

 

 

-o

Output the final alignment to the standard output.

-n

Sort by read names rather than by chromosomal coordinates

-m INT

Approximately the maximum required memory. [500000000]

 

 

 

 

 

 

 

-1

Use zlib compression level 1 to comrpess the output

-f

Force to overwrite the output file if present.

-h FILE

Use the lines of FILE as ‘@’ headers to be copied to out.bam, replacing any header lines that would otherwise be copied from in1.bam. (FILE is actually in SAM format, though any alignment records it may contain are ignored.)

-n

The input alignments are sorted by read names rather than by chromosomal coordinates

-R STR

Merge files in the specified region indicated by STR [null]

-r

Attach an RG tag to each alignment. The tag value is inferred from file names.

-u

Uncompressed BAM output

 

 

 

 

 

 

 

-s

Remove duplicate for single-end reads. By default, the command works for paired-end reads only.

-S

Treat paired-end reads and single-end reads.

 

 

 

 

 

 

 

-A

When used jointly with -r this option overwrites the original base quality.

-e

Convert a the read base to = if it is identical to the aligned reference base. Indel caller does not support the = bases at the moment.

-u

Output uncompressed BAM

-b

Output compressed BAM

-S

The input is SAM with header lines

-C INT

Coefficient to cap mapping quality of poorly mapped reads. See the pileupcommand for details. [0]

-r

Compute the BQ tag (without -A) or cap base quality by BAQ (with -A).

-E

Extended BAQ calculation. This option trades specificity for sensitivity, though the effect is minor.

 

 

 

phase

samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam>

Call and phase heterozygous SNPs. OPTIONS:

 

 

 

 

 

 

 

-A

Drop reads with ambiguous phase.

-b STR

Prefix of BAM output. When this option is in use, phase-0 reads will be saved in fileSTR.0.bam and phase-1 reads in STR.1.bam. Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads with switch errors will be saved inSTR.chimeric.bam. [null]

-F

Do not attempt to fix chimeric reads.

-k INT

Maximum length for local phasing. [13]

-q INT

Minimum Phred-scaled LOD to call a heterozygote. [40]

-Q INT

Minimum base quality to be used in het calling. [13]

 

 

BCFTOOLS COMMANDS AND OPTIONS

view

bcftools view [-AbFGNQSucgv] [-D seqDict] [-l listLoci] [-s listSample] [-igapSNPratio] [-t mutRate] [-p varThres] [-P prior] [-1 nGroup1] [-d minFrac] [-UnPerm] [-X permThres] [-T trioType] in.bcf [region]

Convert between BCF and VCF, call variant candidates and estimate allele frequencies.

index

bcftools index in.bcf

Index sorted BCF for random access.

 

 

 

Input/Output Options: 

-A

 

Retain all possible alternate alleles at variant sites. By default, the view command discards unlikely alleles.

 

 

 

-b

Output in the BCF format. The default is VCF.

-D FILE

Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null]

-F

Indicate PL is generated by r921 or before (ordering is different).

-G

Suppress all individual genotype information.

-l FILE

List of sites at which information are outputted [all sites]

-N

Skip sites where the REF field is not A/C/G/T

-Q

Output the QCALL likelihood format

-s FILE

List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null]

-S

The input is VCF instead of BCF.

-u

Uncompressed BCF output (force -b).

Consensus/Variant Calling Options: 

-c

 

Call variants using Bayesian inference. This option automatically invokes option -e.

-d FLOAT

When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0]

-e

Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT.

-g

Call per-sample genotypes at variant sites (force -c)

-i FLOAT

Ratio of INDEL-to-SNP mutation rate [0.15]

-p FLOAT

A site is considered to be a variant if P(ref|D)<FLOAT [0.5]

-P STR

Prior or initial allele frequency spectrum. If STR can be full, cond2,flat or the file consisting of error output from a previous variant calling run.

-t FLOAT

Scaled muttion rate for variant calling [0.001]

-T STR

Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null]

-v

Output variant sites only (force -c):這里有-v這個選項就只輸出snp位點,如果沒有-v那么就是輸出所有的位點(測序所包含的)

Contrast Calling and Association Test Options: 

-1 INT

 

Number of group-1 samples. This option is used for dividing the samples into two groups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0]

-U INT

Number of permutations for association test (effective only with -1) [0]

-X FLOAT

Only perform permutations for P(chi^2)<FLOAT (effective only with -U) [0.01]

 

 

 

cat

bcftools cat in1.bcf ["in2.bcf "[..."]]]"

Concatenate BCF files. The input files are required to be sorted and have identical samples appearing in the same order.

 

 

SAM FORMAT

Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the ‘@’ symbol, each alignment line consists of:

Col

Field

Description

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

QNAME

Query template/pair NAME

 

 

 

2

FLAG

bitwise FLAG

 

 

 

3

RNAME

Reference sequence NAME

 

 

 

4

POS

1-based leftmost POSition/coordinate of clipped sequence

 

 

 

5

MAPQ

MAPping Quality (Phred-scaled)

 

 

 

6

CIAGR

extended CIGAR string

 

 

 

7

MRNM

Mate Reference sequence NaMe (‘=’ if same as RNAME)

 

 

 

8

MPOS

1-based Mate POSistion

 

 

 

9

TLEN

inferred Template LENgth (insert size)

 

 

 

10

SEQ

query SEQuence on the same strand as the reference

 

 

 

11

QUAL

query QUALity (ASCII-33 gives the Phred base quality)

 

 

 

12+

OPT

variable OPTional fields in the format TAG:VTYPE:VALUE

 

 

 

Each bit in the FLAG field is defined as:

Flag

Chr

Description

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0x0001

p

the read is paired in sequencing

 

 

 

0x0002

P

the read is mapped in a proper pair

 

 

 

0x0004

u

the query sequence itself is unmapped

 

 

 

0x0008

U

the mate is unmapped

 

 

 

0x0010

r

strand of the query (1 for reverse)

 

 

 

0x0020

R

strand of the mate

 

 

 

0x0040

1

the read is the first read in a pair

 

 

 

0x0080

2

the read is the second read in a pair

 

 

 

0x0100

s

the alignment is not primary

 

 

 

0x0200

f

the read fails platform/vendor quality checks

 

 

 

0x0400

d

the read is either a PCR or an optical duplicate

 

 

 

where the second column gives the string representation of the FLAG field.

VCF FORMAT

The Variant Call Format (VCF) is a TAB-delimited format with each data line consists of the following fields:

Col

Field

Description

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

CHROM

CHROMosome name

 

 

 

2

POS

the left-most POSition of the variant

 

 

 

3

ID

unique variant IDentifier

 

 

 

4

REF

the REFerence allele

 

 

 

5

ALT

the ALTernate allele(s), separated by comma

 

 

 

6

QUAL

variant/reference QUALity

 

 

 

7

FILTER

FILTers applied

 

 

 

8

INFO

INFOrmation related to the variant, separated by semi-colon

 

 

 

9

FORMAT

FORMAT of the genotype fields, separated by colon (optional)

 

 

 

10+

SAMPLE

SAMPLE genotypes and per-sample information (optional)

 

 

 

The following table gives the INFO tags used by samtools and bcftools.

Tag

Format

Description

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

AF1

double

Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele

 

 

 

DP

int

Raw read depth (without quality filtering)

 

 

 

DP4

int[4]

# high-quality reference forward bases, ref reverse, alternate for and alt rev bases

 

 

 

FQ

int

Consensus quality. Positive: sample genotypes different; negative: otherwise

 

 

 

MQ

int

Root-Mean-Square mapping quality of covering reads

 

 

 

PC2

int[2]

Phred probability of AF in group1 samples being larger (,smaller) than in group2

 

 

 

PCHI2

double

Posterior weighted chi^2 P-value between group1 and group2 samples

 

 

 

PV4

double[4]

P-value for strand bias, baseQ bias, mapQ bias and tail distance bias

 

 

 

QCHI2

int

Phred-scaled PCHI2

 

 

 

RP

int

# permutations yielding a smaller PCHI2

 

 

 

CLR

int

Phred log ratio of genotype likelihoods with and without the trio/pair constraint

 

 

 

UGT

string

Most probable genotype configuration without the trio constraint

 

 

 

CGT

string

Most probable configuration with the trio constraint

 

 

 

EXAMPLES

o

Import SAM to BAM when @SQ lines are present in the header:

samtools view -bS aln.sam > aln.bam

If @SQ lines are absent:

samtools faidx ref.fa 

samtools view -bt ref.fa.fai aln.sam > aln.bam

where ref.fa.fai is generated automatically by the faidx command.

o

Attach the RG tag while merging sorted alignments:

perl -e ’print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"’ > rg.txt 

samtools merge -rh rg.txt merged.bam ga.bam 454.bam

The value in a RG tag is determined by the file name the read is coming from. In this example, in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from454.bam will be attached RG:Z:454.

o

Call SNPs and short INDELs for one diploid individual:

samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf 

bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf

The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers.

o

Generate the consensus sequence for one diploid individual:

samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq

o

Call somatic mutations from a pair of samples:

samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair - > var.bcf

In the output INFO field, CLR gives the Phred-log ratio between the likelihood by treating the two samples independently, and the likelihood by requiring the genotype to be identical. This CLR is effectively a score measuring the confidence of somatic calls. The higher the better.

o

Call de novo and somatic mutations from a family trio:

samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -s samples.txt - > var.bcf

File samples.txt should consist of three lines specifying the member and order of samples (in the order of child-father-mother). Similarly, CLR gives the Phred-log likelihood ratio with and without the trio constraint. UGT shows the most likely genotype configuration without the trio constraint, and CGT gives the most likely genotype configuration satisfying the trio constraint.

o

Phase one individual:

samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out

The calmd command is used to reduce false heterozygotes around INDELs.

o

Call SNPs and short indels for multiple diploid individuals:

samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf 

bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf

Individuals are identified from the SM tags in the @RG header lines. Individuals can be pooled in one alignment file; one individual can also be separated into multiple files. The-P option specifies that indel candidates should be collected only from read groups with the @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads sequenced by an indel-prone technology may affect the performance of indel calling.

o

Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals:

samtools mpileup -Igf ref.fa *.bam > all.bcf 

bcftools view -bl sites.list all.bcf > sites.bcf 

bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs 

bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs 

bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs 

......

where sites.list contains the list of sites with each line consisting of the reference sequence name and position. The following bcftools commands estimate AFS by EM.

o

Dump BAQ applied alignment for other SNP callers:

samtools calmd -bAr aln.bam > aln.baq.bam

It adds and corrects the NM and MD tags at the same time. The calmd command also comes with the -C option, the same as the one in pileup and mpileup. Apply if it helps.

 

 

LIMITATIONS

o

Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c.

o

Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard’s MarkDuplicate which correctly handles these cases, although a little slower.

 

 

AUTHOR

Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue Ruan from Beijing Genomics Institute wrote the RAZF library. John Marshall and Petr Danecek contribute to the source code and various people from the 1000 Genomes Project have contributed to the SAM format specification.

SEE ALSO

Samtools website: <http://samtools.sourceforge.net>

samtools-0.1.17

samtools (1)

05 July 2011

 

 

 

上面的內容參考了一下網址的內容

http://www.plob.org/2012/02/04/1703.html

http://www.plob.org/2012/02/04/1700.html

http://www.hudsonalpha.org/gsl/software/bam2fastq.php

http://liucheng.name/670/

http://samtools.sourceforge.net/samtools.shtml

http://blog.sina.com.cn/s/blog_4af3f0d20100xvq1.html

閱讀排行

評論排行


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM