VCF文件處理工具PyVCF

本文轉載自查看原文 2018-07-18 20:52 1529

vcf格式示例

##fileformat=VCFv4.1

##FILTER=<ID=LowQual,Description=”Low quality”>

##FORMAT=<ID=AD,Number=.,Type=Integer,Description=”Allelic depths for the ref and alt alleles in the order listed”>

##FORMAT=<ID=DP,Number=1,Type=Integer,Description=”Approximate read depth (reads with MQ=255 or with bad mates are filtered)”>

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=”Genotype Quality”>

##FORMAT=<ID=GT,Number=1,Type=String,Description=”Genotype”>

##FORMAT=<ID=PL,Number=G,Type=Integer,Description=”Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification”>

##INFO=<ID=AC,Number=A,Type=Integer,Description=”Allele count in genotypes, for each ALT allele, in the same order as listed”>

##INFO=<ID=AF,Number=A,Type=Float,Description=”Allele Frequency, for each ALT allele, in the same order as listed”>

##INFO=<ID=AN,Number=1,Type=Integer,Description=”Total number of alleles in called genotypes”>

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description=”Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities”>

##INFO=<ID=DP,Number=1,Type=Integer,Description=”Approximate read depth; some reads may have been filtered”>

##INFO=<ID=DS,Number=0,Type=Flag,Description=”Were any of the samples downsampled?”>

##INFO=<ID=Dels,Number=1,Type=Float,Description=”Fraction of Reads Containing Spanning Deletions”>

##INFO=<ID=FS,Number=1,Type=Float,Description=”Phred-scaled p-value using Fisher’s exact test to detect strand bias”>

##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description=”Consistency of the site with at most two segregating haplotypes”>

##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description=”Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation”>

##INFO=<ID=MLEAC,Number=A,Type=Integer,Description=”Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed”>

##INFO=<ID=MLEAF,Number=A,Type=Float,Description=”Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed”>

##INFO=<ID=MQ,Number=1,Type=Float,Description=”RMS Mapping Quality”>

##INFO=<ID=MQ0,Number=1,Type=Integer,Description=”Total Mapping Quality Zero Reads”>

##INFO=<ID=MQRankSum,Number=1,Type=Float,Description=”Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities”>

##INFO=<ID=QD,Number=1,Type=Float,Description=”Variant Confidence/Quality by Depth”>

##INFO=<ID=RPA,Number=.,Type=Integer,Description=”Number of times tandem repeat unit is repeated, for each allele (including reference)”>

##INFO=<ID=RU,Number=1,Type=String,Description=”Tandem repeat unit (bases)”>

##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description=”Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias”>

##INFO=<ID=SOR,Number=1,Type=Float,Description=”Symmetric Odds Ratio of 2x2 contingency table to detect strand bias”>

##INFO=<ID=STR,Number=0,Type=Flag,Description=”Variant is a short tandem repeat”>

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0

0:48:1:51,51 1

0:48:8:51,51 1/1:43:5:.,.

vcf字段詳解

CHROM： 表示變異位點是在哪個contig 里call出來的，如果是人類全基因組的話那就是chr1…chr22，chrX,Y,M。

POS： 變異位點相對於參考基因組所在的位置，如果是indel，就是第一個鹼基所在的位置。

ID： 如果call出來的SNP存在於dbSNP數據庫里，就會顯示相應的dbSNP里的rs編號。

REF和REF： 在這個變異位點處，參考基因組中所對應的鹼基和研究對象基因組中所對應的鹼基。

QUAL： 可以理解為所call出來的變異位點的質量值。Q=-10lgP，Q表示質量值；P表示這個位點發生錯誤的概率。因此，如果想把錯誤率從控制在90%以上，P的閾值就是1/10，那lg（1/10）=-1，Q=（-10）*（-1）=10。同理，當Q=20時，錯誤率就控制在了0.01。

FILTER： 理想情況下，QUAL這個值應該是用所有的錯誤模型算出來的，這個值就可以代表正確的變異位點了，但是事實是做不到的。因此，還需要對原始變異位點做進一步的過濾。無論你用什么方法對變異位點進行過濾，過濾完了之后，在FILTER一欄都會留下過濾記錄，如果是通過了過濾標准，那么這些通過標准的好的變異位點的FILTER一欄就會注釋一個PASS，如果沒有通過過濾，就會在FILTER這一欄提示除了PASS的其他信息。如果這一欄是一個“.”的話，就說明沒有進行過任何過濾。

FORMAT標簽詳解

GT: 表示這個樣本的基因型，對於一個二倍體生物，GT值表示的是這個樣本在這個位點所攜帶的兩個等位基因。0表示跟REF一樣；1表示表示跟ALT一樣；2表示第二個ALT。當只有一個ALT 等位基因的時候，0/0表示純和且跟REF一致；0/1表示雜合，兩個allele一個是ALT一個是REF；1/1表示純和且都為ALT； The most common format subfield is GT (genotype) data. If the GT subfield is present, it must be the first subfield. In the sample data, genotype alleles are numeric: the REF allele is 0, the first ALT allele is 1, and so on. The allele separator is ‘/’ for unphased genotypes and ‘

’ for phased genotypes.

0 - reference call

1 - alternative call 1

2 - alternative call 2

AD: 對應兩個以逗號隔開的值，這兩個值分別表示覆蓋到REF和ALT鹼基的reads數，相當於支持REF和支持ALT的測序深度。

DP: 覆蓋到這個位點的總的reads數量，相當於這個位點的深度（並不是多有的reads數量，而是大概一定質量值要求的reads數）。

PL: 對應3個以逗號隔開的值，這三個值分別表示該位點基因型是0/0，0/1，1/1的沒經過先驗的標准化Phred-scaled似然值（L）。如果轉換成支持該基因型概率（P）的話，由於L=-10lgP，那么P=10^（-L/10），因此，當L值為0時，P=10^0=1。因此，這個值越小，支持概率就越大，也就是說是這個基因型的可能性越大。

GQ: 表示最可能的基因型的質量值。表示的意義同QUAL。

#CHROM  POS ID      REF ALT QUAL    FILTER  INFO    FORMAT  NA12878
chr1    873762  .       T   G   5231.78 PASS    AC=1;AF=0.50;AN=2;DP=315;Dels=0.00;HRun=2;HaplotypeScore=15.11;MQ=91.05;MQ0=15;QD=16.61;SB=-1533.02;VQSLOD=-1.5473 GT:AD:DP:GQ:PL   0/1:173,141:282:99:255,0,255
chr1    877664  rs3828047   A   G   3931.66 PASS    AC=2;AF=1.00;AN=2;DB;DP=105;Dels=0.00;HRun=1;HaplotypeScore=1.59;MQ=92.52;MQ0=4;QD=37.44;SB=-1152.13;VQSLOD= 0.1185 GT:AD:DP:GQ:PL  1/1:0,105:94:99:255,255,0
chr1    899282  rs28548431  C   T   71.77   PASS    AC=1;AF=0.50;AN=2;DB;DP=4;Dels=0.00;HRun=0;HaplotypeScore=0.00;MQ=99.00;MQ0=0;QD=17.94;SB=-46.55;VQSLOD=-1.9148 GT:AD:DP:GQ:PL  0/1:1,3:4:25.92:103,0,26
chr1    974165  rs9442391   T   C   29.84   LowQual AC=1;AF=0.50;AN=2;DB;DP=18;Dels=0.00;HRun=1;HaplotypeScore=0.16;MQ=95.26;MQ0=0;QD=1.66;SB=-0.98 GT:AD:DP:GQ:PL  0/1:14,4:14:60.91:61,0,255

我們就可以解釋上面的例子：

chr1：873762 是一個新發現的T/G變異，並且有很高的可信度（qual=5231.78）。

chr1：877664 是一個已知的變異為A/G 的SNP位點，名字rs3828047，並且具有很高的可信度（qual=3931.66）。

chr1：899282 是一個已知的變異為C/T的SNP位點，名字rs28548431，但可信度較低（qual=71.77）。在這個位點，GT=0/1，也就是說這個位點的基因型是C/T；GQ=25.92，質量值並不算太高，可能是因為cover到這個位點的reads數太少，DP=4，也就是說只有4條reads支持這個地方的變異；AD=1,3，也就是說支持REF的read有一條，支持ALT的有3條；在PL里，這個位點基因型的不確定性就表現的更突出了，0/1的PL值為0，雖然支持0/1的概率很高；但是1/1的PL值只有26，也就是說還有10^(-2.6)=0.25%的可能性是1/1；但幾乎不可能是0/0，因為支持0/0的概率只有10^(-10.3)=5*10^-11。

chr1：974165 是一個已知的變異為T/C的SNP位點，名字rs9442391，但是這個位點的質量值很低，被標成了“LowQual”，在后續分析中可以被過濾掉。

VCF格式文件操作工具PyVCF

1.安裝

$ pip install pyvcf

2.使用

a.作為python模塊使用

>>> import vcf
>>> vcf_reader = vcf.Reader(open('vcf/test/example-4.0.vcf', 'r'))
>>> for record in vcf_reader:
...     print record
Record(CHROM=20, POS=14370, REF=G, ALT=[A])
Record(CHROM=20, POS=17330, REF=T, ALT=[A])
Record(CHROM=20, POS=1110696, REF=A, ALT=[G, T])
Record(CHROM=20, POS=1230237, REF=T, ALT=[None])
Record(CHROM=20, POS=1234567, REF=GTCT, ALT=[G, GTACT])

b.Shell命令行使用

pyvcf自帶兩個兩個命令腳本vcf_filter.py和vcf_melt,前一個是過濾腳本，后面一個使vcf格式轉換為tab分割字段的文件的腳本

$ vcf_melt < vcf/test/gatk.vcf
SAMPLE      AD      DP      GQ      GT      PL      FILTER  CHROM   POS     REF     ALT     ID      info.AC info.AF info.AN info.BaseQRankSum       info.DB info.DP info.DS info.Dels       info.FS info.HRun       info.HaplotypeScore     info.InbreedingCoeff    info.MQ info.MQ0        info.MQRankSum  info.QD info.ReadPosRankSum
BLANK       6,0     6       18.04   0/0     0,18,211        .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
NA12878     138,107 250     99.0    0/1     1961,0,3049     .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
NA12891     169,77  250     99.0    0/1     1038,0,3533     .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
NA12892     249,0   250     99.0    0/0     0,600,5732      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
NA19238     248,1   250     99.0    0/0     0,627,6191      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
NA19239     250,0   250     99.0    0/0     0,615,5899      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
NA19240     250,0   250     99.0    0/0     0,579,5674      .       chr22   42522392        G       [A]     rs28371738      2       0.143   14      0.375   True    1506    True    0.0     0.0     0       123.5516                253.92  0       0.685   5.9     0.59
BLANK       13,4    17      62.64   0/1     63,0,296        .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731
NA12878     118,127 246     99.0    0/1     2396,0,1719     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731
NA12891     241,0   244     99.0    0/0     0,459,4476      .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731
NA12892     161,85  246     99.0    0/1     1489,0,2353     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731
NA19238     110,132 242     99.0    0/1     2561,0,1488     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731
NA19239     106,135 242     99.0    0/1     2613,0,1389     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731
NA19240     116,126 243     99.0    0/1     2489,0,1537     .       chr22   42522613        G       [C]     rs1135840       6       0.429   14      16.289  True    1518    True    0.03    0.0     0       142.5716                242.46  0       2.01    9.16    -1.731

參考文章

http://blog.sina.com.cn/s/blog_12d5e3d3c0101qv1u.html

PyVCF

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 文本處理工具 - TextBlob Java身份證處理工具 C# Excel處理工具 java 時間處理工具類字符串處理工具StringUtils python中數字類型與處理工具 VisionPro CogImageConvertTool工具圖像處理工具 VisionPro CogIPOneImageTool工具圖像處理工具 iOS垃圾文件清理工具圖片處理——圖片處理工具ImageJ(轉)