【怪毛匠子 整理】
#第一步:把sam文件轉換成bam文件,我們得到map.bam文件
system"samtools view -bS map.sam > map.bam";
#第二步:sort 一下 BAM 文件,得到map.sorted.bam
system"samtools sort map.b/am map.sorted";
#第三步:創建一個關於bam的索引文件,我們得到一個map.sorted.bam.bai的文件
system"samtools index map.sorted.bam";
#第四步:找snp,這里用的是sort以后的bam文件,如果不是,就會不斷的報錯
system"samtools mpileup -ugf TAIR10.fas map.sorted.bam | bcftools view -vcg -D100 ->snp.vcf" //snp位點
perl 語言
如果我們要獲取全部的位點的信息,而不是僅僅snp位點,那么我們只需要把最后一行的-v去掉就可以了
- system"samtools mpileup -ugf TAIR10.fas map.sorted.bam | bcftools view -cg -D100 ->snp.vcf" //所有位點
再下面有詳細的解釋-v的作用:Output variant sites only (force -c):這里有-v這個選項就只輸出snp位點,如果沒有-v那么就是輸出所有的位點(測序所包含的)
samtools 功能介紹
(2012-10-25 22:06:30)
| 標簽: samtools
高通量測序
比對
雜談 |
分類: 生物信息 |
samtools 輸入bam文件,導出sam文件。同時可以進行排序,合並,建立索引等功能,並支持從特定區域內查找reads。
samtools 可以應用於linux的命令管道里。以“-”作為標准輸入或輸出。
view 從bam/sam文件中提取/打印部分比對結果。默認為所有的區域,也可以染色體區域(1-based,須sort並index)。
例如:
samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
samtools view aln.sorted.bam chr2:20,100,000-20,200,000
tview
mpileup 列舉每條reads比對的indel,SNP等信息。
“.”表示match;
“,”表示反向match;
“<”或“>”表示reference skip;
"ATCGN"表示正向mismatch;
"atcgn"表示反向mismatch;
‘\+[0-9]+[ACGTNacgtn]+’ insertion;
‘-[0-9]+[ACGTNacgtn]+’ 表示deletion;
“^”標記reads起始;
“$”標記reads segment結尾;
samtools pileup -vcf ref.fasta aln.sorted.bam
samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
merge 合並bam文件
samtools merge out.bam in1.bam in2.bam in3.bam
tview 圖形化的界面查看mapping的結果
? -help
g -到達指定的染色體區域
samtools tview aln.sorted.bam ref.fasta
sort 將比對結果排序
samtools sort aln.bam aln.sorted
reheader 將bam文件的頭文件替換。
index 為sort后的bam文件建立所以,這是在 view 比對結果時指定染色體區域必須做的。
samtools index aln.sorted.bam
rmdup 去除可能的PCR重復。將比對到相同位置的reads去重,僅保留比對質量最好的一個。一般只對paired end reads(fr)有效,並須指定ISIZE。
phase Call and phase heterozygous SNPs
samtools常用命令詳解
2014年01月26日 ⁄ Bioinformatics ⁄ 字號 小 中 大 ⁄ 評論 1 條 ⁄ 閱讀 12,457 次 [點擊加入在線收藏夾]
samtools的說明文檔:http://samtools.sourceforge.net/samtools.shtml
samtools是一個用於操作sam和bam文件的工具合集。包含有許多命令。以下是常用命令的介紹
1. view
view命令的主要功能是:將sam文件轉換成bam文件;然后對bam文件進行各種操作,比如數據的排序(不屬於本命令的功能)和提取(這些操作是對bam文件進行的,因而當輸入為sam文件的時候,不能進行該操作);最后將排序或提取得到的數據輸出為bam或sam(默認的)格式。
bam文件優點:bam文件為二進制文件,占用的磁盤空間比sam文本文件小;利用bam二進制文件的運算速度快。
view命令中,對sam文件頭部的輸入(-t或-T)和輸出(-h)是單獨的一些參數來控制的。
| 01 |
Usage: samtools view [options] <in.bam>|<in.sam> [region1 [...]] |
| 02 |
默認情況下不加 region,則是輸出所有的 region. |
| 03 |
|
| 04 |
Options: -b output BAM |
| 05 |
默認下輸出是 SAM 格式文件,該參數設置輸出 BAM 格式 |
| 06 |
-h print header for the SAM output |
| 07 |
默認下輸出的 sam 格式文件不帶 header,該參數設定輸出sam文件時帶 header 信息 |
| 08 |
-H print header only (no alignments) |
| 09 |
-S input is SAM |
| 10 |
默認下輸入是 BAM 文件,若是輸入是 SAM 文件,則最好加該參數,否則有時候會報錯。 |
| 11 |
-u uncompressed BAM output (force -b) |
| 12 |
該參數的使用需要有-b參數,能節約時間,但是需要更多磁盤空間。 |
| 13 |
-c Instead of printing the alignments, only count them and print the |
| 14 |
total number. All filter options, such as ‘-f’, ‘-F’ and ‘-q’ , |
| 15 |
are taken into account. |
| 16 |
-1 fast compression (force -b) |
| 17 |
-x output FLAG in HEX (samtools-C specific) |
| 18 |
-X output FLAG in string (samtools-C specific) |
| 19 |
-c print only the count of matching records |
| 20 |
-L FILE output alignments overlapping the input BED FILE [null] |
| 21 |
-t FILE list of reference names and lengths (force -S) [null] |
| 22 |
使用一個list文件來作為header的輸入 |
| 23 |
-T FILE reference sequence file (force -S) [null] |
| 24 |
使用序列fasta文件作為header的輸入 |
| 25 |
-o FILE output file name [stdout] |
| 26 |
-R FILE list of read groups to be outputted [null] |
| 27 |
-f INT required flag, 0 for unset [0] |
| 28 |
-F INT filtering flag, 0 for unset [0] |
| 29 |
Skip alignments with bits present in INT [0] |
| 30 |
數字4代表該序列沒有比對到參考序列上 |
| 31 |
數字8代表該序列的mate序列沒有比對到參考序列上 |
| 32 |
-q INT minimum mapping quality [0] |
| 33 |
-l STR only output reads in library STR [null] |
| 34 |
-r STR only output reads in read group STR [null] |
| 35 |
-s FLOAT fraction of templates to subsample; integer part as seed [-1] |
| 36 |
-? longer help |
例子:
#將sam文件轉換成bam文件 $ samtools view -bS abc.sam > abc.bam $ samtools view -b -S abc.sam -o abc.bam
| 01 |
#提取比對到參考序列上的比對結果 |
| 02 |
$ samtools view -bF 4 abc.bam > abc.F.bam |
| 03 |
|
| 04 |
#提取paired reads中兩條reads都比對到參考序列上的比對結果,只需要把兩個4+8的值12作為過濾參數即可 |
| 05 |
$ samtools view -bF 12 abc.bam > abc.F12.bam |
| 06 |
|
| 07 |
#提取沒有比對到參考序列上的比對結果 |
| 08 |
$ samtools view -bf 4 abc.bam > abc.f.bam |
| 09 |
|
| 10 |
#提取bam文件中比對到caffold1上的比對結果,並保存到sam文件格式 |
| 11 |
$ samtools view abc.bam scaffold1 > scaffold1.sam |
| 12 |
|
| 13 |
#提取scaffold1上能比對到30k到100k區域的比對結果 |
| 14 |
$ samtools view abc.bam scaffold1:30000-100000 > scaffold1_30k-100k.sam |
| 15 |
|
| 16 |
#根據fasta文件,將 header 加入到 sam 或 bam 文件中 |
| 17 |
$ samtools view -T genome.fasta -h scaffold1.sam > scaffold1.h.sam |
2. sort
sort對bam文件進行排序。
| 1 |
Usage: samtools sort [-n] [-m <maxMem>] <in.bam> <out.prefix> |
| 2 |
-m 參數默認下是 500,000,000 即500M(不支持K,M,G等縮寫)。對於處理大數據時,如果內存夠用,則設置大點的值,以節約時間。 |
| 3 |
-n 設定排序方式按short reads的ID排序。默認下是按序列在fasta文件中的順序(即header)和序列從左往右的位點排序。 |
例子:
$ samtools sort abc.bam abc.sort $ samtools view abc.sort.bam | less -S
3.merge
將2個或2個以上的已經sort了的bam文件融合成一個bam文件。融合后的文件不需要則是已經sort過了的。
| 01 |
Usage: samtools merge [-nr] [-h inh.sam] <out.bam> <in1.bam> <in2.bam>[...] |
| 02 |
|
| 03 |
Options: -n sort by read names |
| 04 |
-r attach RG tag (inferred from file names) |
| 05 |
-u uncompressed BAM output |
| 06 |
-f overwrite the output BAM if exist |
| 07 |
-1 compress level 1 |
| 08 |
-R STR merge file in the specified region STR [all] |
| 09 |
-h FILE copy the header in FILE to <out.bam> [in1.bam] |
| 10 |
|
| 11 |
Note: Samtools' merge does not reconstruct the @RG dictionary in the header. Users |
| 12 |
must provide the correct header with -h, or uses Picard which properly maintains |
| 13 |
the header dictionary in merging. |
4.index
必須對bam文件進行默認情況下的排序后,才能進行index。否則會報錯。
建立索引后將產生后綴為.bai的文件,用於快速的隨機處理。很多情況下需要有bai文件的存在,特別是顯示序列比對情況下。比如samtool的tview命令就需要;gbrowse2顯示reads的比對圖形的時候也需要。
Usage: samtools index <in.bam> [out.index]
例子:
| 1 |
#以下兩種命令結果一樣 |
| 2 |
<span style="font-weight: inherit; font-style: inherit; color: #ff6600;">$ samtools index abc.sort.bam</span> |
| 3 |
$ samtools index abc.sort.bam abc.sort.bam.bai |
5. faidx
對fasta文件建立索引,生成的索引文件以.fai后綴結尾。該命令也能依據索引文件快速提取fasta文件中的某一條(子)序列
| 1 |
Usage: samtools faidx <in.bam> [ [...]] |
| 2 |
|
| 3 |
對基因組文件建立索引 |
| 4 |
$ samtools faidx genome.fasta |
| 5 |
#生成了索引文件genome.fasta.fai,是一個文本文件,分成了5列。第一列是子序列的名稱;第二列是子序列的長度;個人認為“第三列是序列所在的位置”,因為該數字從上往下逐漸變大,最后的數字是genome.fasta文件的大小;第4和5列不知是啥意思。於是通過此文件,可以定位子序列在fasta文件在磁盤上的存放位置,直接快速調出子序列。 |
| 6 |
|
| 7 |
#由於有索引文件,可以使用以下命令很快從基因組中提取到fasta格式的子序列 |
| 8 |
$ samtools faidx genome.fasta scffold_10 > scaffold_10.fasta |
6. tview
tview能直觀的顯示出reads比對基因組的情況,和基因組瀏覽器有點類似。
| 01 |
Usage: samtools tview <aln.bam> [ref.fasta] |
| 02 |
|
| 03 |
當給出參考基因組的時候,會在第一排顯示參考基因組的序列,否則,第一排全用N表示。 |
| 04 |
按下 g ,則提示輸入要到達基因組的某一個位點。例子“scaffold_10:1000"表示到達第 |
| 05 |
10號scaffold的第1000個鹼基位點處。 |
| 06 |
使用H(左)J(上)K(下)L(右)移動顯示界面。大寫字母移動快,小寫字母移動慢。 |
| 07 |
使用空格建向左快速移動(和 L 類似),使用Backspace鍵向左快速移動(和 H 類似)。 |
| 08 |
Ctrl+H 向左移動1kb鹼基距離; Ctrl+L 向右移動1kb鹼基距離 |
| 09 |
可以用顏色標注比對質量,鹼基質量,核苷酸等。30~40的鹼基質量或比對質量使用白色表示; |
| 10 |
20~30黃色;10~20綠色;0~10藍色。 |
| 11 |
使用點號'.'切換顯示鹼基和點號;使用r切換顯示read name等 |
| 12 |
還有很多其它的使用說明,具體按 ? 鍵來查看。 |
7. flagstat
給出BAM文件的比對結果
| 01 |
Usage: samtools flagstat <in.bam> |
| 02 |
|
| 03 |
$ samtools flagstat example.bam |
| 04 |
11945742 + 0 in total (QC-passed reads + QC-failed reads) |
| 05 |
#總共的reads數 |
| 06 |
0 + 0 duplicates |
| 07 |
7536364 + 0 mapped (63.09%:-nan%) |
| 08 |
#總體上reads的匹配率 |
| 09 |
11945742 + 0 paired in sequencing |
| 10 |
#有多少reads是屬於paired reads |
| 11 |
5972871 + 0 read1 |
| 12 |
#reads1中的reads數 |
| 13 |
5972871 + 0 read2 |
| 14 |
#reads2中的reads數 |
| 15 |
6412042 + 0 properly paired (53.68%:-nan%) |
| 16 |
#完美匹配的reads數:比對到同一條參考序列,並且兩條reads之間的距離符合設置的閾值 |
| 17 |
6899708 + 0 with itself and mate mapped |
| 18 |
#paired reads中兩條都比對到參考序列上的reads數 |
| 19 |
636656 + 0 singletons (5.33%:-nan%) |
| 20 |
#單獨一條匹配到參考序列上的reads數,和上一個相加,則是總的匹配上的reads數。 |
| 21 |
469868 + 0 with mate mapped to a different chr |
| 22 |
#paired reads中兩條分別比對到兩條不同的參考序列的reads數 |
| 23 |
243047 + 0 with mate mapped to a different chr (mapQ>=5) |
#同上一個,只是其中比對質量>=5的reads的數量
7. depth
得到每個鹼基位點的測序深度,並輸出到標准輸出。
| 1 |
Usage: bam2depth [-r reg] [-q baseQthres] [-Q mapQthres] [-b in.bed] <in1.bam> [...] |
8. 其它有用的命令
reheader 替換bam文件的頭
| 1 |
$ samtools reheader <in.header.sam> <in.bam> |
cat 連接多個bam文件,適用於非sorted的bam文件
| 1 |
$ samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ] |
idxstats 統計一個表格,4列,分別為”序列名,序列長度,比對上的reads數,unmapped reads number”。第4列應該是paired reads中有一端能匹配到該scaffold上,而另外一端不匹配到任何scaffolds上的reads數。
| 1 |
$ samtools idxstats <aln.bam> |
9. 將bam文件轉換為fastq文件
有時候,我們需要提取出比對到一段參考序列的reads,進行小范圍的分析,以利於debug等。這時需要將bam或sam文件轉換為fastq格式。
該網站提供了一個bam轉換為fastq的程序:http://www.hudsonalpha.org/gsl/information/software/bam2fastq
| 1 |
$ wget http://www.hudsonalpha.org/gsl/static/software/bam2fastq-1.1.0.tgz |
| 2 |
$ tar zxf bam2fastq-1.1.0.tgz |
| 3 |
$ cd bam2fastq-1.1.0 |
| 4 |
$ make |
| 5 |
$ ./bam2fastq <in.bam> |
10. mpileup
samtools還有個非常重要的命令mpileup,以前為pileup。該命令用於生成bcf文件,再使用bcftools進行SNP和Indel的分析。bcftools是samtool中附帶的軟件,在samtools的安裝文件夾中可以找到。
最常用的參數有2:
-f 來輸入有索引文件的fasta參考序列;
-g 輸出到bcf格式。用法和最簡單的例子如下
| 1 |
Usage: samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ] [-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] |
| 2 |
|
| 3 |
$ samtools mpileup -f genome.fasta abc.bam > abc.txt |
| 4 |
$ samtools mpileup -gSDf genome.fasta abc.bam > abc.bcf |
| 5 |
$ samtools mpileup -guSDf genome.fasta abc.bam | \ |
| 6 |
bcftools view -cvNg - > abc.vcf |
mpileup不使用-u或-g參數時,則不生成二進制的bcf文件,而生成一個文本文件(輸出到標准輸出)。該文本文件統計了參考序列中每個鹼基位點的比對情況;該文件每一行代表了參考序列中某一個鹼基位點的比對結果。比如:
| 01 |
scaffold_1 2841 A 11 ,,,...,.... BHIGDGIJ?FF |
| 02 |
scaffold_1 2842 C 12 ,$,,...,....^I. CFGEGEGGCFF+ |
| 03 |
scaffold_1 2843 G 11 ,,...,..... FDDDDCD?DD+ |
| 04 |
scaffold_1 2844 G 11 ,,...,..... FA?AAAA<AA+ |
| 05 |
scaffold_1 2845 G 11 ,,...,..... F656666166* |
| 06 |
scaffold_1 2846 A 11 ,,...,..... (1.1111)11* |
| 07 |
scaffold_1 2847 A 11 ,,+9acggtgaag.+9ACGGTGAAT.+9ACGGTGAAG.+9ACGGTGAAG,+9acggtgaag.+9ACGGTGAAG.+9ACGGTGAAG.+9ACGGTGAAG.+9ACGGTGAAG.+9ACGGTGAAG %.+....-..) |
| 08 |
scaffold_1 2848 N 11 agGGGgGGGGG !!$!!!!!!!! |
| 09 |
scaffold_1 2849 A 11 c$,...,..... !0000000000 |
| 10 |
scaffold_1 2850 A 10 ,...,..... 353333333 |
mpileup生成的結果包含6行:參考序列名;位置;參考鹼基;比對上的reads數;比對情況;比對上的鹼基的質量。其中第5列比較復雜,解釋如下:
1 ‘.’代表與參考序列正鏈匹配。
2 ‘,’代表與參考序列負鏈匹配。
3 ‘ATCGN’代表在正鏈上的不匹配。
4 ‘atcgn’代表在負鏈上的不匹配。
5 ‘*’代表模糊鹼基
6 ‘^’代表匹配的鹼基是一個read的開始;’^'后面緊跟的ascii碼減去33代表比對質量;這兩個符號修飾的是后面的鹼基,其后緊跟的鹼基(.,ATCGatcgNn)代表該read的第一個鹼基。
7 ‘$’代表一個read的結束,該符號修飾的是其前面的鹼基。
8 正則式’\+[0-9]+[ACGTNacgtn]+’代表在該位點后插入的鹼基;比如上例中在scaffold_1的2847后插入了9個長度的鹼基acggtgaag。表明此處極可能是indel。
9 正則式’-[0-9]+[ACGTNacgtn]+’代表在該位點后缺失的鹼基;
pileup具體的參數如下:
| 01 |
#輸入參數 |
| 02 |
-6 Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairsin variant calling. |
| 03 |
-B Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments. |
| 04 |
-b FILE List of input BAM files, one file per line [null] |
| 05 |
-C INT Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0] |
| 06 |
-d INT At a position, read maximally INT reads per input BAM. [250] |
| 07 |
-E Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit. |
| 08 |
-f FILE The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null] |
| 09 |
-l FILE BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null] |
| 10 |
-M INT cap mapping quality at INT [60] |
| 11 |
-q INT Minimum mapping quality for an alignment to be used [0] |
| 12 |
-Q INT Minimum base quality for a base to be considered [13] |
| 13 |
-r STR Only generate pileup in region STR [all sites] |
| 14 |
|
| 15 |
輸出參數 |
| 16 |
-D Output per-sample read depth (require -g/-u) |
| 17 |
-g Compute genotype likelihoods and output them in the binary call format (BCF). |
| 18 |
-S Output per-sample Phred-scaled strand bias P-value (require -g/-u) |
| 19 |
-u Similar to -g except that the output is uncompressed BCF, which is preferred for piping. |
| 20 |
|
| 21 |
Options for Genotype Likelihood Computation (for -g or -u): |
| 22 |
-e INT Phred-scaled gap extension sequencing error probability. Reducing INT leads to longer indels. [20] |
| 23 |
-h INT Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100] |
| 24 |
-I Do not perform INDEL calling |
| 25 |
-L INT Skip INDEL calling if the average per-sample depth is above INT. [250] |
| 26 |
-o INT Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40] |
| 27 |
-P STR Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all] |
11. 使用bcftools
bcftools和samtools類似,用於處理vcf(variant call format)文件和bcf(binary call format)文件。前者為文本文件,后者為其二進制文件。
bcftools使用簡單,最主要的命令是view命令,其次還有index和cat等命令。index和cat命令和samtools中類似。此處主講使用view命令來進行SNP和Indel calling。該命令的使用方法和例子為:
| 1 |
$ bcftools view [-AbFGNQSucgv] [-D seqDict] [-l listLoci] [-s listSample] |
| 2 |
[-i gapSNPratio] [-t mutRate] [-p varThres] [-P prior] |
| 3 |
[-1 nGroup1] [-d minFrac] [-U nPerm] [-X permThres] |
| 4 |
[-T trioType] in.bcf [region] |
| 5 |
|
| 6 |
$ bcftools view -cvNg abc.bcf > snp_indel.vcf |
生成的結果文件為vcf格式,有10列,分別是:
1 參考序列名;
2 variant所在的left-most位置;
3 variant的ID(默認未設置,用’.'表示);
4 參考序列的allele;
5 variant的allele(有多個alleles,則用’,'分隔);
6 variant/reference QUALity;
7 FILTers applied;
8 variant的信息,使用分號隔開;
9 FORMAT of the genotype fields, separated by colon (optional);
10 SAMPLE genotypes and per-sample information (optional)。
例如:
| 1 |
scaffold_1 2847 . A AACGGTGAAG 194 . INDEL;DP=11;VDB=0.0401;AF1=1;AC1=2;DP4=0,0,8,3;MQ=35;FQ=-67.5 GT:PL:GQ 1/1:235,33,0:63 |
| 2 |
scaffold_1 3908 . G A 111 . DP=13;VDB=0.0085;AF1=1;AC1=2;DP4=0,0,5,7;MQ=42;FQ=-63 GT:PL:GQ 1/1:144,36,0:69 |
| 3 |
scaffold_1 4500 . A G 31.5 . DP=8;VDB=0.0034;AF1=1;AC1=2;DP4=0,0,1,3;MQ=42;FQ=-39 GT:PL:GQ 1/1:64,12,0:21 |
| 4 |
scaffold_1 4581 . TGGNGG TGG 145 . INDEL;DP=8;VDB=0.0308;AF1=1;AC1=2;DP4=0,0,0,8;MQ=42;FQ=-58.5 GT:PL:GQ 1/1:186,24,0:45 |
| 5 |
scaffold_1 4644 . G A 195 . DP=21;VDB=0.0198;AF1=1;AC1=2;DP4=0,0,10,10;MQ=42;FQ=-87 GT:PL:GQ 1/1:228,60,0:99 |
| 6 |
scaffold_1 4827 . NACAAAGA NA 4.42 . INDEL;DP=1;AF1=1;AC1=2;DP4=0,0,1,0;MQ=40;FQ=-37.5 GT:PL:GQ 0/1:40,3,0:3 |
| 7 |
scaffold_1 4854 . A G 48 . DP=6;VDB=0.0085;AF1=1;AC1=2;DP4=0,0,2,1;MQ=41;FQ=-36 GT:PL:GQ 1/1:80,9,0:16 |
| 8 |
scaffold_1 5120 . A G 85 . DP=8;VDB=0.0355;AF1=1;AC1=2;DP4=0,0,5,3;MQ=42;FQ=-51 GT:PL:GQ 1/1:118,24,0:45 |
第8列中顯示了對variants的信息描述,比較重要,其中的 Tag 的描述如下:
| 01 |
Tag Format Description |
| 02 |
AF1 double Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele |
| 03 |
DP int Raw read depth (without quality filtering) |
| 04 |
DP4 int[4] # high-quality reference forward bases, ref reverse, alternate for and alt rev bases |
| 05 |
FQ int Consensus quality. Positive: sample genotypes different; negative: otherwise |
| 06 |
MQ int Root-Mean-Square mapping quality of covering reads |
| 07 |
PC2 int[2] Phred probability of AF in group1 samples being larger (,smaller) than ingroup2 |
| 08 |
PCHI2 double Posterior weighted chi^2 P-value between group1 and group2 samples |
| 09 |
PV4 double[4] P-value for strand bias, baseQ bias, mapQ bias and tail distance bias |
| 10 |
QCHI2 int Phred-scaled PCHI2 |
| 11 |
RP int # permutations yielding a smaller PCHI2 |
| 12 |
CLR int Phred log ratio of genotype likelihoods with and without the trio/pair constraint |
| 13 |
UGT string Most probable genotype configuration without the trio constraint |
| 14 |
CGT string Most probable configuration with the trio constraint |
bcftools view 的具體參數如下:
| 01 |
Input/Output Options: |
| 02 |
-A Retain all possible alternate alleles at variant sites. By default, the view commanddiscards unlikely alleles. |
| 03 |
-b Output in the BCF format. The default is VCF. |
| 04 |
-D FILE Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null] |
| 05 |
-F Indicate PL is generated by r921 or before (ordering is different). |
| 06 |
-G Suppress all individual genotype information. |
| 07 |
-l FILE List of sites at which information are outputted [all sites] |
| 08 |
-N Skip sites where the REF field is not A/C/G/T |
| 09 |
-Q Output the QCALL likelihood format |
| 10 |
-s FILE List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null] |
| 11 |
-S The input is VCF instead of BCF. |
| 12 |
-u Uncompressed BCF output (force -b). |
| 13 |
|
| 14 |
Consensus/Variant Calling Options: |
| 15 |
-c Call variants using Bayesian inference. This option automatically invokes option -e. |
| 16 |
-d FLOAT When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0] |
| 17 |
當有多個sample用於variants calling時,比如多個轉錄組數據或多個重測序 |
| 18 |
數據需要比對到參考基因組上,設置該值,表明至少有該<float 0~1>比例的 |
| 19 |
samples在該位點都有覆蓋才計算入variant.所以對於只有一個sample的情況 |
| 20 |
下,該值設置在0~1之間沒有意義,大於1則得不到任何結果。 |
| 21 |
-e Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT. |
| 22 |
-g Call per-sample genotypes at variant sites (force -c) |
| 23 |
-i FLOAT Ratio of INDEL-to-SNP mutation rate [0.15] |
| 24 |
-p FLOAT A site is considered to be a variant if P(ref|D) |
| 25 |
-t FLOAT Scaled muttion rate for variant calling [0.001] |
| 26 |
-T STR Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null] |
| 27 |
-v Output variant sites only (force -c) |
| 28 |
|
| 29 |
Contrast Calling and Association Test Options: |
| 30 |
-1 INT Number of group-1 samples. This option is used for dividing the samples into twogroups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0] |
| 31 |
-U INT Number of permutations for association test (effective only with -1) [0] |
| 32 |
-X FLOAT Only perform permutations for P(chi^2) |
使用bcftools得到variant calling結果后。需要對結果再次進行過濾。主要依據比對結果中第8列信息。其中的 DP4 一行尤為重要,提供了4個數據:1 比對結果和正鏈一致的reads數、2 比對結果和負鏈一致的reads數、3 比對結果在正鏈的variant上的reads數、4 比對結果在負鏈的variant上的reads數。可以設定 (value3 + value4)大於某一閾值,才算是variant。比如:
| 1 |
$ perl -ne 'print $_ if /DP4=(\d+),(\d+),(\d+),(\d+)/ && ($3+$4)>=10 && ($3+$4)/($1+$2+$3+$4)>=0.8' snp_indel.vcf > snp_indel.final.vcf |
12. samtools rmdup
NGS上機測序前需要進行PCR一步,使一個模板擴增出一簇,從而在上機測序的時候表現出為1個點,即一個reads。若一個模板擴增出了多簇,結果得到了多個reads,這些reads的坐標(coordinates)是相近的。在進行了reads比對后需要將這些由PCR duplicates獲得的reads去掉,並只保留最高比對質量的read。使用rmdup命令即可完成.
| 1 |
Usage: samtools rmdup [-sS] |
| 2 |
-s 對single-end reads。默認情況下,只對paired-end reads |
| 3 |
-S 將Paired-end reads作為single-end reads處理。 |
| 4 |
|
| 5 |
$ samtools input.sorted.bam output.bam |
本文來自:http://www.chenlianfu.com/?p=1399
網址如下:http://samtools.sourceforge.net/samtools.shtml
Manual Reference Pages -
NAME
samtools - Utilities for the Sequence Alignment/Map (SAM) format
bcftools - Utilities for the Binary Call Format (BCF) and VCF
CONTENTS
SYNOPSIS(大綱):這個大綱其實詳細的說明了運行的命令,如果沒有特殊要求就可以直接采用了。下面的東西都是針對這個的描述。
samtools view -bt ref_list.txt -o aln.bam aln.sam.gz
samtools sort aln.bam aln.sorted :這個是sort的命令,需要的是aln.bam時你要sort的文件,后面跟的是你可以自己命名的最好和前面保持一致
samtools index aln.sorted.bam
:sort以后要用建立一個索引文件就直接用這個命令
samtools idxstats aln.sorted.bam
samtools view aln.sorted.bam chr2:20,100,000-20,200,000
samtools merge out.bam in1.bam in2.bam in3.bam
samtools faidx ref.fasta
samtools pileup -vcf ref.fasta aln.sorted.bam
samtools mpileup -C50 -gf ref.fasta -r chr3:1,000-2,000 in1.bam in2.bam
:我們再上面用過的最后snp的提取里
samtools tview aln.sorted.bam ref.fasta
bcftools index in.bcf
bcftools view in.bcf chr2:100-200 > out.vcf
bcftools view -vc in.bcf > out.vcf 2> out.afs
DESCRIPTION
Samtools is a set of utilities that manipulate alignments in the BAM format. It imports from and exports to the SAM (Sequence Alignment/Map) format, does sorting, merging and indexing, and allows to retrieve reads in any regions swiftly.
Samtools is designed to work on a stream. It regards an input file ‘-’ as the standard input (stdin) and an output file ‘-’ as the standard output (stdout). Several commands can thus be combined with Unix pipes. Samtools always output warning and error messages to the standard error output (stderr).
Samtools is also able to open a BAM (not SAM) file on a remote FTP or HTTP server if the BAM file name starts with ‘ftp://’ or ‘http://’. Samtools checks the current working directory for the index file and will download the index upon absence. Samtools does not retrieve the entire alignment file unless it is asked to do so.
SAMTOOLS COMMANDS AND OPTIONS
| view |
samtools view [-bchuHS] [-t in.refList] [-o output] [-f reqFlag] [-F skipFlag] [-q minMapQ] [-l library] [-r readGroup] [-R rgFile] <in.bam>|<in.sam> [region1 [...]] Extract/print all or sub alignments in SAM or BAM format. If no region is specified, all the alignments will be printed; otherwise only alignments overlapping the specified regions will be output. An alignment may be given multiple times if it is overlapping several regions. A region can be presented, for example, in the following format: ‘chr2’ (the whole chr2), ‘chr2:1000000’ (region starting from 1,000,000bp) or ‘chr2:1,000,000-2,000,000’ (region between 1,000,000 and 2,000,000bp including the end points). The coordinate is 1-based. OPTIONS: |
| tview |
samtools tview <in.sorted.bam> [ref.fasta] Text alignment viewer (based on the ncurses library). In the viewer, press ‘?’ for help and press ‘g’ to check the alignment start from a region in the format like ‘chr10:10,000,000’ or ‘=10,000,000’ when viewing the same reference sequence. 這個命令是查看的命令,看到的是map以后覆蓋度的文件,samtools tview .bam文件 .ref文件 |
| mpileup |
samtools mpileup [-EBug] [-C capQcoef] [-r reg] [-f in.fa] [-l list] [-M capMapQ][-Q minBaseQ] [-q minMapQ] in.bam [in2.bam [...]] Generate BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample. In the pileup format (without -uor-g), each line represents a genomic position, consisting of chromosome name, coordinate, reference base, read bases, read qualities and alignment mapping qualities. Information on match, mismatch, indel, strand, mapping quality and start and end of a read are all encoded at the read base column. At this column, a dot stands for a match to the reference base on the forward strand, a comma for a match on the reverse strand, a ’>’ or ’<’ for a reference skip, ‘ACGTN’ for a mismatch on the forward strand and ‘acgtn’ for a mismatch on the reverse strand. A pattern ‘\+[0-9]+[ACGTNacgtn]+’ indicates there is an insertion between this reference position and the next reference position. The length of the insertion is given by the integer in the pattern, followed by the inserted sequence. Similarly, a pattern ‘-[0-9]+[ACGTNacgtn]+’ represents a deletion from the reference. The deleted bases will be presented as ‘*’ in the following lines. Also at the read base column, a symbol ‘^’ marks the start of a read. The ASCII of the character following ‘^’ minus 33 gives the mapping quality. A symbol ‘$’ marks the end of a read segment. Input Options: |
| reheader |
samtools reheader <in.header.sam> <in.bam> Replace the header in in.bam with the header in in.header.sam. This command is much faster than replacing the header with a BAM->SAM->BAM conversion. |
| cat |
samtools cat [-h header.sam] [-o out.bam] <in1.bam> <in2.bam> [ ... ] Concatenate BAMs. The sequence dictionary of each input BAM must be identical, although this command does not check this. This command uses a similar trick toreheader which enables fast BAM concatenation. |
| sort |
samtools sort [-no] [-m maxMem] <in.bam> <out.prefix> Sort alignments by leftmost coordinates. File <out.prefix>.bam will be created. This command may also create temporary files <out.prefix>.%d.bam when the whole alignment cannot be fitted into memory (controlled by option -m). OPTIONS: |
| merge |
samtools merge [-nur1f] [-h inh.sam] [-R reg] <out.bam> <in1.bam> <in2.bam> [...] Merge multiple sorted alignments. The header reference lists of all the input BAM files, and the @SQ headers of inh.sam, if any, must all refer to the same set of reference sequences. The header reference list and (unless overridden by -h) ‘@’ headers of in1.bam will be copied to out.bam, and the headers of other files will be ignored. OPTIONS: |
| index |
samtools index <aln.bam> Index sorted alignment for fast random access. Index file <aln.bam>.bai will be created. |
| idxstats |
samtools idxstats <aln.bam> Retrieve and print stats in the index file. The output is TAB delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. |
| faidx |
samtools faidx <ref.fasta> [region1 [...]] Index reference sequence in the FASTA format or extract subsequence from indexed reference sequence. If no region is specified, faidx will index the file and create<ref.fasta>.fai on the disk. If regions are speficified, the subsequences will be retrieved and printed to stdout in the FASTA format. The input file can be compressed in the RAZF format. |
| fixmate |
samtools fixmate <in.nameSrt.bam> <out.bam> Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment. |
| rmdup |
samtools rmdup [-sS] <input.srt.bam> <out.bam> Remove potential PCR duplicates: if multiple read pairs have identical external coordinates, only retain the pair with highest mapping quality. In the paired-end mode, this command ONLY works with FR orientation and requires ISIZE is correctly set. It does not work for unpaired reads (e.g. two ends mapped to different chromosomes or orphan reads). OPTIONS: |
| calmd |
samtools calmd [-EeubSr] [-C capQcoef] <aln.bam> <ref.fasta> Generate the MD tag. If the MD tag is already present, this command will give a warning if the MD tag generated is different from the existing tag. Output SAM by default. OPTIONS: |
| targetcut |
samtools targetcut [-Q minBaseQ] [-i inPenalty] [-0 em0] [-1 em1] [-2 em2] [-f ref] <in.bam> This command identifies target regions by examining the continuity of read depth, computes haploid consensus sequences of targets and outputs a SAM with each sequence corresponding to a target. When option -f is in use, BAQ will be applied. This command is only designed for cutting fosmid clones from fosmid pool sequencing [Ref. Kitzman et al. (2010)]. |
|
|
|
|
|
|
|
|
|
| -b |
Output in the BAM format.我們第一步把sam轉換成bam的中-bS中-b表示的就是要輸出bam的文件 |
| -f INT |
Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0] |
| -F INT |
Skip alignments with bits present in INT [0] |
| -h |
Include the header in the output.(再輸出文件中包含頭文件) |
| -H |
Output the header only.(只輸出頭文件) |
| -l STR |
Only output reads in library STR [null] |
| -o FILE |
Output file [stdout] |
| -q INT |
Skip alignments with MAPQ smaller than INT [0] |
| -r STR |
Only output reads in read group STR [null] |
| -R FILE |
Output reads in read groups listed in FILE [null] |
| -S |
Input is in SAM. If @SQ header lines are absent, the ‘-t’ option is required.這里S表示的就是輸入的是SAM的格式,如果sam中沒有頭文件,那么就要用到-t的選項 |
| -c |
Instead of printing the alignments, only count them and print the total number. All filter options, such as ‘-f’, ‘-F’ and ‘-q’ , are taken into account. |
| -t FILE |
This file is TAB-delimited. Each line must contain the reference name and the length of the reference, one line for each distinct reference; additional fields are ignored. This file also defines the order of the reference sequences in sorting. If you run ‘samtools faidx <ref.fa>’, the resultant index file <ref.fa>.fai can be used as this <in.ref_list> file. |
| -u |
Output uncompressed BAM. This option saves time spent on compression/decomprssion and is thus preferred when the output is piped to another samtools command. |
|
|
|
|
|
|
|
|
|
| -6 |
Assume the quality is in the Illumina 1.3+ encoding. -A Do not skip anomalous read pairs in variant calling. |
| -B |
Disable probabilistic realignment for the computation of base alignment quality (BAQ). BAQ is the Phred-scaled probability of a read base being misaligned. Applying this option greatly helps to reduce false SNPs caused by misalignments. |
| -b FILE |
List of input BAM files, one file per line [null] |
| -C INT |
Coefficient for downgrading mapping quality for reads containing excessive mismatches. Given a read with a phred-scaled probability q of being generated from the mapped position, the new mapping quality is about sqrt((INT-q)/INT)*INT. A zero value disables this functionality; if enabled, the recommended value for BWA is 50. [0] |
| -d INT |
At a position, read maximally INT reads per input BAM. [250] |
| -E |
Extended BAQ computation. This option helps sensitivity especially for MNPs, but may hurt specificity a little bit. |
| -f FILE |
The faidx-indexed reference file in the FASTA format. The file can be optionally compressed by razip. [null]:要有一個參考序列 |
| -l FILE |
BED or position list file containing a list of regions or sites where pileup or BCF should be generated [null] |
| -q INT |
Minimum mapping quality for an alignment to be used [0] |
| -Q INT |
Minimum base quality for a base to be considered [13] |
| -r STR |
Only generate pileup in region STR [all sites] |
| Output Options:輸出選項 |
|
|
|
|
| -D |
Output per-sample read depth 讀取的深度,可以設定值比如-D100 |
| -g |
Compute genotype likelihoods and output them in the binary call format (BCF). |
| -S |
Output per-sample Phred-scaled strand bias P-value |
| -u |
Similar to -g except that the output is uncompressed(未壓縮的) BCF, which is preferred for piping. |
| Options for Genotype Likelihood Computation (for -g or -u): |
|
|
|
|
| -e INT |
Phred-scaled gap extension sequencing error probability. Reducing INTleads to longer indels. [20] |
| -h INT |
Coefficient for modeling homopolymer errors. Given an l-long homopolymer run, the sequencing error of an indel of size s is modeled as INT*s/l. [100] |
| -I |
Do not perform INDEL calling |
| -L INT |
Skip INDEL calling if the average per-sample depth is above INT. [250] |
| -o INT |
Phred-scaled gap open sequencing error probability. Reducing INT leads to more indel calls. [40] |
| -P STR |
Comma dilimited list of platforms (determined by @RG-PL) from which indel candidates are obtained. It is recommended to collect indel candidates from sequencing technologies that have low indel error rate such as ILLUMINA. [all] |
|
|
|
|
|
|
|
|
|
| -o |
Output the final alignment to the standard output. |
| -n |
Sort by read names rather than by chromosomal coordinates |
| -m INT |
Approximately the maximum required memory. [500000000] |
|
|
|
|
|
|
|
|
|
| -1 |
Use zlib compression level 1 to comrpess the output |
| -f |
Force to overwrite the output file if present. |
| -h FILE |
Use the lines of FILE as ‘@’ headers to be copied to out.bam, replacing any header lines that would otherwise be copied from in1.bam. (FILE is actually in SAM format, though any alignment records it may contain are ignored.) |
| -n |
The input alignments are sorted by read names rather than by chromosomal coordinates |
| -R STR |
Merge files in the specified region indicated by STR [null] |
| -r |
Attach an RG tag to each alignment. The tag value is inferred from file names. |
| -u |
Uncompressed BAM output |
|
|
|
|
|
|
|
|
|
| -s |
Remove duplicate for single-end reads. By default, the command works for paired-end reads only. |
| -S |
Treat paired-end reads and single-end reads. |
|
|
|
|
|
|
|
|
|
| -A |
When used jointly with -r this option overwrites the original base quality. |
| -e |
Convert a the read base to = if it is identical to the aligned reference base. Indel caller does not support the = bases at the moment. |
| -u |
Output uncompressed BAM |
| -b |
Output compressed BAM |
| -S |
The input is SAM with header lines |
| -C INT |
Coefficient to cap mapping quality of poorly mapped reads. See the pileupcommand for details. [0] |
| -r |
Compute the BQ tag (without -A) or cap base quality by BAQ (with -A). |
| -E |
Extended BAQ calculation. This option trades specificity for sensitivity, though the effect is minor. |
|
|
|
| phase |
samtools phase [-AF] [-k len] [-b prefix] [-q minLOD] [-Q minBaseQ] <in.bam> Call and phase heterozygous SNPs. OPTIONS: |
|
|
|
|
|
|
|
|
|
| -A |
Drop reads with ambiguous phase. |
| -b STR |
Prefix of BAM output. When this option is in use, phase-0 reads will be saved in fileSTR.0.bam and phase-1 reads in STR.1.bam. Phase unknown reads will be randomly allocated to one of the two files. Chimeric reads with switch errors will be saved inSTR.chimeric.bam. [null] |
| -F |
Do not attempt to fix chimeric reads. |
| -k INT |
Maximum length for local phasing. [13] |
| -q INT |
Minimum Phred-scaled LOD to call a heterozygote. [40] |
| -Q INT |
Minimum base quality to be used in het calling. [13] |
|
|
|
BCFTOOLS COMMANDS AND OPTIONS
| view |
bcftools view [-AbFGNQSucgv] [-D seqDict] [-l listLoci] [-s listSample] [-igapSNPratio] [-t mutRate] [-p varThres] [-P prior] [-1 nGroup1] [-d minFrac] [-UnPerm] [-X permThres] [-T trioType] in.bcf [region] Convert between BCF and VCF, call variant candidates and estimate allele frequencies. |
| index |
bcftools index in.bcf Index sorted BCF for random access. |
|
|
|
| Input/Output Options: -A |
|
|
|
Retain all possible alternate alleles at variant sites. By default, the view command discards unlikely alleles. |
|
|
|
| -b |
Output in the BCF format. The default is VCF. |
| -D FILE |
Sequence dictionary (list of chromosome names) for VCF->BCF conversion [null] |
| -F |
Indicate PL is generated by r921 or before (ordering is different). |
| -G |
Suppress all individual genotype information. |
| -l FILE |
List of sites at which information are outputted [all sites] |
| -N |
Skip sites where the REF field is not A/C/G/T |
| -Q |
Output the QCALL likelihood format |
| -s FILE |
List of samples to use. The first column in the input gives the sample names and the second gives the ploidy, which can only be 1 or 2. When the 2nd column is absent, the sample ploidy is assumed to be 2. In the output, the ordering of samples will be identical to the one in FILE. [null] |
| -S |
The input is VCF instead of BCF. |
| -u |
Uncompressed BCF output (force -b). |
| Consensus/Variant Calling Options: -c |
|
|
|
Call variants using Bayesian inference. This option automatically invokes option -e. |
| -d FLOAT |
When -v is in use, skip loci where the fraction of samples covered by reads is below FLOAT. [0] |
| -e |
Perform max-likelihood inference only, including estimating the site allele frequency, testing Hardy-Weinberg equlibrium and testing associations with LRT. |
| -g |
Call per-sample genotypes at variant sites (force -c) |
| -i FLOAT |
Ratio of INDEL-to-SNP mutation rate [0.15] |
| -p FLOAT |
A site is considered to be a variant if P(ref|D)<FLOAT [0.5] |
| -P STR |
Prior or initial allele frequency spectrum. If STR can be full, cond2,flat or the file consisting of error output from a previous variant calling run. |
| -t FLOAT |
Scaled muttion rate for variant calling [0.001] |
| -T STR |
Enable pair/trio calling. For trio calling, option -s is usually needed to be applied to configure the trio members and their ordering. In the file supplied to the option -s, the first sample must be the child, the second the father and the third the mother. The valid values of STR are ‘pair’, ‘trioauto’, ‘trioxd’ and ‘trioxs’, where ‘pair’ calls differences between two input samples, and ‘trioxd’ (‘trioxs’) specifies that the input is from the X chromosome non-PAR regions and the child is a female (male). [null] |
| -v |
Output variant sites only (force -c):這里有-v這個選項就只輸出snp位點,如果沒有-v那么就是輸出所有的位點(測序所包含的) |
| Contrast Calling and Association Test Options: -1 INT |
|
|
|
Number of group-1 samples. This option is used for dividing the samples into two groups for contrast SNP calling or association test. When this option is in use, the following VCF INFO will be outputted: PC2, PCHI2 and QCHI2. [0] |
| -U INT |
Number of permutations for association test (effective only with -1) [0] |
| -X FLOAT |
Only perform permutations for P(chi^2)<FLOAT (effective only with -U) [0.01] |
|
|
|
| cat |
bcftools cat in1.bcf ["in2.bcf "[..."]]]" Concatenate BCF files. The input files are required to be sorted and have identical samples appearing in the same order. |
|
|
|
SAM FORMAT
Sequence Alignment/Map (SAM) format is TAB-delimited. Apart from the header lines, which are started with the ‘@’ symbol, each alignment line consists of:
| Col |
Field |
Description |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
QNAME |
Query template/pair NAME |
|
|
|
| 2 |
FLAG |
bitwise FLAG |
|
|
|
| 3 |
RNAME |
Reference sequence NAME |
|
|
|
| 4 |
POS |
1-based leftmost POSition/coordinate of clipped sequence |
|
|
|
| 5 |
MAPQ |
MAPping Quality (Phred-scaled) |
|
|
|
| 6 |
CIAGR |
extended CIGAR string |
|
|
|
| 7 |
MRNM |
Mate Reference sequence NaMe (‘=’ if same as RNAME) |
|
|
|
| 8 |
MPOS |
1-based Mate POSistion |
|
|
|
| 9 |
TLEN |
inferred Template LENgth (insert size) |
|
|
|
| 10 |
SEQ |
query SEQuence on the same strand as the reference |
|
|
|
| 11 |
QUAL |
query QUALity (ASCII-33 gives the Phred base quality) |
|
|
|
| 12+ |
OPT |
variable OPTional fields in the format TAG:VTYPE:VALUE |
|
|
|
Each bit in the FLAG field is defined as:
| Flag |
Chr |
Description |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 0x0001 |
p |
the read is paired in sequencing |
|
|
|
| 0x0002 |
P |
the read is mapped in a proper pair |
|
|
|
| 0x0004 |
u |
the query sequence itself is unmapped |
|
|
|
| 0x0008 |
U |
the mate is unmapped |
|
|
|
| 0x0010 |
r |
strand of the query (1 for reverse) |
|
|
|
| 0x0020 |
R |
strand of the mate |
|
|
|
| 0x0040 |
1 |
the read is the first read in a pair |
|
|
|
| 0x0080 |
2 |
the read is the second read in a pair |
|
|
|
| 0x0100 |
s |
the alignment is not primary |
|
|
|
| 0x0200 |
f |
the read fails platform/vendor quality checks |
|
|
|
| 0x0400 |
d |
the read is either a PCR or an optical duplicate |
|
|
|
where the second column gives the string representation of the FLAG field.
VCF FORMAT
The Variant Call Format (VCF) is a TAB-delimited format with each data line consists of the following fields:
| Col |
Field |
Description |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
CHROM |
CHROMosome name |
|
|
|
| 2 |
POS |
the left-most POSition of the variant |
|
|
|
| 3 |
ID |
unique variant IDentifier |
|
|
|
| 4 |
REF |
the REFerence allele |
|
|
|
| 5 |
ALT |
the ALTernate allele(s), separated by comma |
|
|
|
| 6 |
QUAL |
variant/reference QUALity |
|
|
|
| 7 |
FILTER |
FILTers applied |
|
|
|
| 8 |
INFO |
INFOrmation related to the variant, separated by semi-colon |
|
|
|
| 9 |
FORMAT |
FORMAT of the genotype fields, separated by colon (optional) |
|
|
|
| 10+ |
SAMPLE |
SAMPLE genotypes and per-sample information (optional) |
|
|
|
The following table gives the INFO tags used by samtools and bcftools.
| Tag |
Format |
Description |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| AF1 |
double |
Max-likelihood estimate of the site allele frequency (AF) of the first ALT allele |
|
|
|
| DP |
int |
Raw read depth (without quality filtering) |
|
|
|
| DP4 |
int[4] |
# high-quality reference forward bases, ref reverse, alternate for and alt rev bases |
|
|
|
| FQ |
int |
Consensus quality. Positive: sample genotypes different; negative: otherwise |
|
|
|
| MQ |
int |
Root-Mean-Square mapping quality of covering reads |
|
|
|
| PC2 |
int[2] |
Phred probability of AF in group1 samples being larger (,smaller) than in group2 |
|
|
|
| PCHI2 |
double |
Posterior weighted chi^2 P-value between group1 and group2 samples |
|
|
|
| PV4 |
double[4] |
P-value for strand bias, baseQ bias, mapQ bias and tail distance bias |
|
|
|
| QCHI2 |
int |
Phred-scaled PCHI2 |
|
|
|
| RP |
int |
# permutations yielding a smaller PCHI2 |
|
|
|
| CLR |
int |
Phred log ratio of genotype likelihoods with and without the trio/pair constraint |
|
|
|
| UGT |
string |
Most probable genotype configuration without the trio constraint |
|
|
|
| CGT |
string |
Most probable configuration with the trio constraint |
|
|
|
EXAMPLES
| o |
Import SAM to BAM when @SQ lines are present in the header: samtools view -bS aln.sam > aln.bam If @SQ lines are absent: samtools faidx ref.fa samtools view -bt ref.fa.fai aln.sam > aln.bam where ref.fa.fai is generated automatically by the faidx command. |
| o |
Attach the RG tag while merging sorted alignments: perl -e ’print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n@RG\tID:454\tSM:hs\tLB:454\tPL:454\n"’ > rg.txt samtools merge -rh rg.txt merged.bam ga.bam 454.bam The value in a RG tag is determined by the file name the read is coming from. In this example, in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from454.bam will be attached RG:Z:454. |
| o |
Call SNPs and short INDELs for one diploid individual: samtools mpileup -ugf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf The -D option of varFilter controls the maximum read depth, which should be adjusted to about twice the average read depth. One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps BWA-short but may not other mappers. |
| o |
Generate the consensus sequence for one diploid individual: samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq |
| o |
Call somatic mutations from a pair of samples: samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair - > var.bcf In the output INFO field, CLR gives the Phred-log ratio between the likelihood by treating the two samples independently, and the likelihood by requiring the genotype to be identical. This CLR is effectively a score measuring the confidence of somatic calls. The higher the better. |
| o |
Call de novo and somatic mutations from a family trio: samtools mpileup -DSuf ref.fa aln.bam | bcftools view -bvcgT pair -s samples.txt - > var.bcf File samples.txt should consist of three lines specifying the member and order of samples (in the order of child-father-mother). Similarly, CLR gives the Phred-log likelihood ratio with and without the trio constraint. UGT shows the most likely genotype configuration without the trio constraint, and CGT gives the most likely genotype configuration satisfying the trio constraint. |
| o |
Phase one individual: samtools calmd -AEur aln.bam ref.fa | samtools phase -b prefix - > phase.out The calmd command is used to reduce false heterozygotes around INDELs. |
| o |
Call SNPs and short indels for multiple diploid individuals: samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bcvg - > var.raw.bcf bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf Individuals are identified from the SM tags in the @RG header lines. Individuals can be pooled in one alignment file; one individual can also be separated into multiple files. The-P option specifies that indel candidates should be collected only from read groups with the @RG-PL tag set to ILLUMINA. Collecting indel candidates from reads sequenced by an indel-prone technology may affect the performance of indel calling. |
| o |
Derive the allele frequency spectrum (AFS) on a list of sites from multiple individuals: samtools mpileup -Igf ref.fa *.bam > all.bcf bcftools view -bl sites.list all.bcf > sites.bcf bcftools view -cGP cond2 sites.bcf > /dev/null 2> sites.1.afs bcftools view -cGP sites.1.afs sites.bcf > /dev/null 2> sites.2.afs bcftools view -cGP sites.2.afs sites.bcf > /dev/null 2> sites.3.afs ...... where sites.list contains the list of sites with each line consisting of the reference sequence name and position. The following bcftools commands estimate AFS by EM. |
| o |
Dump BAQ applied alignment for other SNP callers: samtools calmd -bAr aln.bam > aln.baq.bam It adds and corrects the NM and MD tags at the same time. The calmd command also comes with the -C option, the same as the one in pileup and mpileup. Apply if it helps. |
|
|
|
LIMITATIONS
| o |
Unaligned words used in bam_import.c, bam_endian.h, bam.c and bam_aux.c. |
| o |
Samtools paired-end rmdup does not work for unpaired reads (e.g. orphan reads or ends mapped to different chromosomes). If this is a concern, please use Picard’s MarkDuplicate which correctly handles these cases, although a little slower. |
|
|
|
AUTHOR
Heng Li from the Sanger Institute wrote the C version of samtools. Bob Handsaker from the Broad Institute implemented the BGZF library and Jue Ruan from Beijing Genomics Institute wrote the RAZF library. John Marshall and Petr Danecek contribute to the source code and various people from the 1000 Genomes Project have contributed to the SAM format specification.
SEE ALSO
Samtools website: <http://samtools.sourceforge.net>
| samtools-0.1.17 |
samtools (1) |
05 July 2011 |
上面的內容參考了一下網址的內容
http://www.plob.org/2012/02/04/1703.html
http://www.plob.org/2012/02/04/1700.html
http://www.hudsonalpha.org/gsl/software/bam2fastq.php
http://liucheng.name/670/
http://samtools.sourceforge.net/samtools.shtml
http://blog.sina.com.cn/s/blog_4af3f0d20100xvq1.html
閱讀排行
- 目睹一個程序員精神失常的經歷(37572)
- 請把你的時間精力用於創造價值(35838)
- samtools學習及使用范例,以及官方文檔詳解(8562)
- 用R的基礎作圖系統和ggplot2做常用圖(6329)
- vim自動補全,vim自動補全括號等總結及應用(4747)
- 歐拉計划:第21題計算10000以下所有親和數之和,22題文件中所有名字的得分之和(4585)
- R語言-Knitr包的詳細使用說明(4452)
- TexLive 2013 安裝使用過程(4050)
- TeXmacs對中文的支持(3297)
- R作圖之 annotation詳解!(2485)
評論排行
