Annovar 信息注釋

ANNOVAR簡介

ANNOVAR是由王凱編寫的一個注釋軟件，可以對SNP和indel進行注釋，也可以進行變異的過濾篩選。

ANNOVAR能夠利用最新的數據來分析各種基因組中的遺傳變異。主要包含三種不同的注釋方法，Gene-based Annotation（基於基因的注釋）、Region-based Annotation（基於區域的注釋）、Filter-based Annotation（基於篩選的注釋）。

ANNOVAR由Perl編寫。

優點：提供多個數據可直接下載、支持多種格式、注釋直觀；

缺點：沒有數據庫的物種無法注釋。

ANNOVAR結構

ANNOVAR
│  annotate_variation.pl #主程序，功能包括下載數據庫，三種不同的注釋
│  coding_change.pl #可用來推斷蛋白質序列
│  convert2annovar.pl #將多種格式轉為.avinput的程序
│  retrieve_seq_from_fasta.pl #用於自行建立其他物種的轉錄本
│  table_annovar.pl #注釋程序，可一次性完成三種類型的注釋
│  variants_reduction.pl #可用來更靈活地定制過濾注釋流程
│
├─example #存放示例文件
│
└─humandb #人類注釋數據庫

ANNOVAR下載數據庫

命令示例

[kaiwang@biocluster ~/]$ Perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/ # -buildver 表示version # -downdb 下載數據庫的指令 # -webfrom annovar 從annovar提供的鏡像下載，不加此參數將尋找數據庫本身的源 # humandb/ 存放於humandb/目錄下

ANNOVAR的官方文檔列出了可供下載的數據庫及版本、更新日期等信息，可用-downdb avdblist參數查看。

數據庫目錄

ANNOVAR輸入格式

[kaiwang@biocluster ~/]$ cat example/ex1.avinput 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

ANNOVAR使用.avinput格式，如以上代碼所示，該格式每列以tab分割，最重要的地方為前5列，分別是

染色體(Chromosome)
起始位置(Start)
結束位置(End)
參考等位基因(Reference Allele)
替代等位基因(Alternative Allele)
剩下為注釋部分（可選）。

ANNOVAR主要也是依靠這5處信息對數據庫進行比對，進而注釋變異。

ANNOVAR格式轉換

命令示例

$ convert2annovar.pl -format vcf4 example/ex2.vcf > ex2.avinput
# -format vcf4 指定格式為vcf

ANNOVAR主要使用convert2annovar.pl程序進行轉換，轉換后文件是精簡過的，主要包含前面提到的5列內容，如果要將原格式的文件的所有內容都包含在轉換后的.avinput文件中，可以使用-includeinfo參數；如果需要分開每個sample輸出單一的.avinput文件，可以使用-allsample參數，等等。

ANNOVAR還主要支持以下格式轉換：

SAMtools pileup format
Complete Genomics format
GFF3-SOLiD calling format
SOAPsnp calling format
MAQ calling format
CASAVA calling format

ANNOVAR注釋功能

用`table_annovar.pl`進行注釋（可一次性完成三種類型的注釋）

命令示例

[kaiwang@biocluster ~/]$ table_annovar.pl example/ex1.avinput humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,genomicSuperDups,esp6500siv2_all,1000g2014oct_all,1000g2014oct_afr,1000g2014oct_eas,1000g2014oct_eur,snp138,ljb26_all -operation g,r,r,f,f,f,f,f,f,f -nastring . -csvout # -buildver hg19 表示使用hg19版本 # -out myanno 表示輸出文件的前綴為myanno # -remove 表示刪除注釋過程中的臨時文件 # -protocol 表示注釋使用的數據庫，用逗號隔開，且要注意順序 # -operation 表示對應順序的數據庫的類型（g代表gene-based、r代表region-based、f代表filter-based），用逗號隔開，注意順序 # -nastring . 表示用點號替代缺省的值 # -csvout 表示最后輸出.csv文件

輸出的csv文件將包含輸入的5列主要信息以及各個數據庫里的注釋，此外，table_annoval.pl可以直接對vcf文件進行注釋（不需要轉換格式），注釋的內容將會放在vcf文件的“INFO”那一欄。

Gene-based Annotation(基於基因的注釋)

基於基因的注釋（gene-based annotation）揭示variant與已知基因直接的關系以及對其產生的功能性影響，需要使用for gene-based的數據庫。

命令示例

[kaiwang@biocluster ~/]$ annotate_variation.pl -geneanno -dbtype refGene -out ex1 -build hg19 example/ex1.avinput humandb/ # -geneanno 表示使用基於基因的注釋 # -dbtype refGene 表示使用"refGene"數據庫 # -out ex1 表示輸出文件以ex1為前綴

因為annotate_variation.pl默認使用gene-based注釋類型以及refGene數據庫，所以上面的命令可以缺省-geneanno -dbtype refGene。

運行命令后將會生成3個文件：

ex1.variant_function 注釋所有變異所在基因及位置
ex1.exonic_variant_function 詳細注釋外顯子區域的變異功能、類型、氨基酸改變等
ex1.ann.log log文件，包含運行的命令行及運行提示，所用數據庫文件

`ex1.variant_function`

第一個文件以.variant_function結尾，主要的內容如下

[kaiwang@biocluster ~/]$ cat ex1.variant_function UTR5 ISG15(NM_005101:c.-33T>C) 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15 UTR3 ATAD3C(NM_001039211:c.*91G>T) 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C splicing NPHP4(NM_001291593:exon19:c.1279-2T>A,NM_001291594:exon18:c.1282-2T>A,NM_015102:exon22:c.2818-2T>A) 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4 intronic DDR2 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays intronic DNASE2B 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays intergenic LOC645354(dist=11566),LOC391003(dist=116902) 1 13211293 13211294 TC - comments: rs59770105, a 2-bp deletion intergenic UBIAD1(dist=55105),PTCHD2(dist=135699) 1 11403596 11403596 - AT comments: rs35561142, a 2-bp insertion intergenic LOC100129138(dist=872538),NONE(dist=NONE) 1 105492231 105492231 A ATAAA comments: rs10552169, a block substitution exonic IL23R 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease exonic ATG16L1 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease exonic NOD2 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2 exonic NOD2 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2 exonic NOD2 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2 exonic GJB2 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss exonic CRYL1,GJB6 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

注釋后輸出的文件，同樣每列以tab分割，第1列為變異所在的類型，如外顯子（exonic）、UTR5、UTR3等（官方文檔有詳細的類型列表）。

如果第1列的為外顯子、內含子或者非編碼RNA，第二行將是對應的基因名（有多個基因名則會以逗號隔開）；否則第二列將會給出相鄰的兩個基因以及對應的距離。

從第3列開始至第7列為輸入的那5列主要信息，剩余為注釋信息。

需要注意的是，如果該變異找到多種注釋，ANNOVAR將會對它進行比較，以exonic = splicing > ncRNA > UTR5/UTR3 > intron > upstream/downstream > intergenic 的優先權重，取最優的表示，如果你想ANNOVAR列出該變異所有注釋，可以使用--separate參數。

`ex1.exonic_variant_function`

第二個輸出文件以.exonic_variant_function結尾，只列出外顯子（氨基酸會改變）的變異，主要內容如下

[kaiwang@biocluster ~/]$ cat ex1.exonic_variant_function line9 nonsynonymous SNV IL23R:NM_144701:exon9:c.G1142A:p.R381Q, 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease line10 nonsynonymous SNV ATG16L1:NM_001190267:exon9:c.A550G:p.T184A,ATG16L1:NM_017974:exon8:c.A841G:p.T281A,ATG16L1:NM_001190266:exon9:c.A646G:p.T216A,ATG16L1:NM_030803:exon9:c.A898G:p.T300A,ATG16L1:NM_198890:exon5:c.A409G:p.T137A, 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease line11 nonsynonymous SNV NOD2:NM_022162:exon4:c.C2104T:p.R702W,NOD2:NM_001293557:exon3:c.C2023T:p.R675W, 16 50745926 50745926 C comments: rs2066844 (R702W), a non-synonymous SNP in NOD2 line12 nonsynonymous SNV NOD2:NM_022162:exon8:c.G2722C:p.G908R,NOD2:NM_001293557:exon7:c.G2641C:p.G881R, 16 50756540 50756540 G comments: rs2066845 (G908R), a non-synonymous SNP in NOD2 line13 frameshift insertion NOD2:NM_022162:exon11:c.3017dupC:p.A1006fs,NOD2:NM_001293557:exon10:c.2936dupC:p.A979fs, 16 50763778 5076377comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2 line14 frameshift deletion GJB2:NM_004004:exon2:c.35delG:p.G12fs, 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss line15 frameshift deletion GJB6:NM_001110221:wholegene,GJB6:NM_001110220:wholegene,GJB6:NM_001110219:wholegene,CRYL1:NM_015974:wholegene,GJB6:NM_006783:wholegene, 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

該文件的第1列為.variant_function文件中該變異所在的行號；第2列為該變異的功能性后果，如非同義SNV、同義SNV、移碼插入等（官方文檔同樣有詳細的類型列表）；第3列包括基因名稱、轉錄識別標志和相應的轉錄本的序列變化。第四列開始為輸入文件的內容。

Region-based Annotation（基於區域的注釋）

基於過濾的注釋精確匹配查詢變異與數據庫中的記錄：如果它們有相同的染色體，起始位置，結束位置，REF的等位基因和ALT的等位基因，才能認為匹配。基於區域的注釋看起來更像一個區域的查詢（這個區域也可以是一個單一的位點），在一個數據庫中，它不在乎位置的精確匹配，它不在乎核苷酸的識別。

基於區域的注釋（region-based annotation）揭示variant與不同基因組特定段的關系，例如：它是否落在已知的保守基因組區域。基於區域的注釋的數據庫一般由UCSC提供。

命令示例

[kaiwang@biocluster ~/]$ annotate_variation.pl -regionanno -build hg19 -out ex1 -dbtype phastConsElements46way example/ex1.avinput humandb/ # -regionanno 表示使用基於區域的注釋 # -dbtype phastConsElements46way 表示使用"phastConsElements46way"數據庫，注意需要使用Region-based的數據庫

輸出文件是ex1.hg19_phastConsElements46way，可以看到，Region-based 注釋將會生成以注釋數據庫為后綴的注釋文件。該文件主要內容有

[kaiwang@biocluster ~/]$ cat ex1.hg19_phastConsElements46way phastConsElements46way Score=387;Name=lod=50 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease phastConsElements46way Score=420;Name=lod=68 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2 phastConsElements46way Score=385;Name=lod=49 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2 phastConsElements46way Score=395;Name=lod=54 13 20763686 20763686 G - comments: rs1801002 (del35G), a frameshift mutation in GJB2, associated with hearing loss phastConsElements46way Score=545;Name=lod=218 13 20797176 21105944 0 - comments: a 342kb deletion encompassing GJB6, associated with hearing loss

輸出的注釋文件第1列為“phastConsElements46way”，對應注釋的類型，這里的phastCons 46-way alignments屬於保守的基因組區域的注釋；第二列包含評分和名稱，評分來自UCSC，可以使用--score_threshold和--normscore_threshold來過濾評分低的變異，“Name=lod=x”名稱表示該區域的名稱；剩余的部分為輸入文件的內容。

Filter-based Annotation（基於過濾的注釋）

filter-based和region-based主要的區別是，filter-based針對mutation（核苷酸的變化）而region-based針對染色體上的位置。例如region-based比對chr1:1000-1000而filter-based比對chr1:1000-1000上的A->G。

基於過濾的注釋，使用不同的過濾數據庫，可以給出這個variant的一系列信息。如在全基因組數據中的變異頻率，可使用1000g2015aug、kaviar_20150923等數據庫；在全外顯組數據中的變異頻率，可使用exac03、esp6500siv2等；在孤立的或者低代表人群中的變異頻率，可使用ajews等數據庫。（在ANNOVAR官方文檔中也有詳細的介紹）

命令示例

[kaiwang@biocluster ~/]$ annotate_variation.pl -filter -dbtype 1000g2012apr_eur -buildver hg19 -out ex1 example/ex1.avinput humandb/
# -filter 使用基於過濾的注釋
# -dbtype 1000g2012apr_eur 使用"1000g2012apr_eur"數據庫

運行命令后，已知的變異會被寫入一個*dropped結尾的文件，而沒有在數據庫中找到的變異將會被寫入*filtered結尾的文件，*dropped文件是我們所需要的結果。這個文件內容如下

[kaiwang@biocluster ~/]$ cat ex1.hg19_EUR.sites.2012_04_dropped
1000g2012apr_eur 0.04 1 1404001 1404001 G T comments: rs149123833, a SNP in 3' UTR of ATAD3C
1000g2012apr_eur 0.87 1 162736463 162736463 C T comments: rs1000050, a SNP in Illumina SNP arrays
1000g2012apr_eur 0.81 1 5935162 5935162 A T comments: rs1287637, a splice site variant in NPHP4
1000g2012apr_eur 0.06 1 67705958 67705958 G A comments: rs11209026 (R381Q), a SNP in IL23R associated with Crohn's disease
1000g2012apr_eur 0.54 1 84875173 84875173 C T comments: rs6576700 or SNP_A-1780419, a SNP in Affymetrix SNP arrays
1000g2012apr_eur 0.96 1 948921 948921 T C comments: rs15842, a SNP in 5' UTR of ISG15
1000g2012apr_eur 0.05 16 50745926 50745926 C T comments: rs2066844 (R702W), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50756540 50756540 G C comments: rs2066845 (G908R), a non-synonymous SNP in NOD2
1000g2012apr_eur 0.01 16 50763778 50763778 - C comments: rs2066847 (c.3016_3017insC), a frameshift SNP in NOD2
1000g2012apr_eur 0.53 2 234183368 234183368 A G comments: rs2241880 (T300A), a SNP in the ATG16L1 associated with Crohn's disease

*dropped文件第1列如region-based注釋的結果一樣以數據庫命名；第二列為等位基因頻率，我們可以用-maf 0.05參數來過濾掉低於0.05的變異，；第三列開始同樣是輸入文件的內容。

需要注意的是，我們也可以使用-maf 0.05 -reverse過濾掉高於0.05的變異；但是過濾ALT等位基因的頻率，我們更提倡使用-score_threshold參數。

ANNOVAR其他程序

ANNOVAR包里還有

Variants_Reduction: prioritizing causal variants
Coding_Change: Infer mutated protein sequence
Retrieve_Seq_from_FASTA: Retrieve nucleotide/protein sequences

三個程序沒有介紹，可以參考官方文檔的Accessory Programs自行了解。

參考文獻：

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 annovar 注釋除人類以外的SNP 制作annovar注釋依賴的cosmic數據庫用Annovar注釋非人類基因組，如小鼠mm9 ANNOVAR工具 Mysql查詢表注釋和字段注釋信息 git修改push的注釋信息 ClickHouse 添加注釋信息 oracle查看所有表信息和字段信息以及注釋信息等 IDEA設置默認(指定)的注釋作者信息 Git修改已提交commit的注釋信息

ANNOVAR 注釋軟件

ANNOVAR簡介

ANNOVAR結構

ANNOVAR下載數據庫

ANNOVAR輸入格式

ANNOVAR格式轉換

ANNOVAR注釋功能

用table_annovar.pl進行注釋（可一次性完成三種類型的注釋）

Gene-based Annotation(基於基因的注釋)

ex1.variant_function

ex1.exonic_variant_function

Region-based Annotation（基於區域的注釋）

Filter-based Annotation（基於過濾的注釋）

ANNOVAR其他程序

參考文獻：

免責聲明！

用`table_annovar.pl`進行注釋（可一次性完成三種類型的注釋）

`ex1.variant_function`

`ex1.exonic_variant_function`