之前的文章:構建NCBI本地BLAST數據庫 (NR NT等) | blastx/diamond使用方法 | blast構建索引 | makeblastdb
本地運行blast時,需要指定out format。
常見的網頁版blast結果可以參照:Blast結果的詳細解析
*** Formatting options -outfmt <String> alignment view options: 0 = Pairwise, 1 = Query-anchored showing identities, 2 = Query-anchored no identities, 3 = Flat query-anchored showing identities, 4 = Flat query-anchored no identities, 5 = BLAST XML, 6 = Tabular, 7 = Tabular with comment lines, 8 = Seqalign (Text ASN.1), 9 = Seqalign (Binary ASN.1), 10 = Comma-separated values, 11 = BLAST archive (ASN.1), 12 = Seqalign (JSON), 13 = Multiple-file BLAST JSON, 14 = Multiple-file BLAST XML2, 15 = Single-file BLAST JSON, 16 = Single-file BLAST XML2, 18 = Organism Report Options 6, 7 and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers. The supported format specifiers are: qseqid means Query Seq-id qgi means Query GI qacc means Query accesion qaccver means Query accesion.version qlen means Query sequence length sseqid means Subject Seq-id sallseqid means All subject Seq-id(s), separated by a ';' sgi means Subject GI sallgi means All subject GIs sacc means Subject accession saccver means Subject accession.version sallacc means All subject accessions slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive-scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive-scoring matches frames means Query and subject frames separated by a '/' qframe means Query frame sframe means Subject frame btop means Blast traceback operations (BTOP) staxid means Subject Taxonomy ID ssciname means Subject Scientific Name scomname means Subject Common Name sblastname means Subject Blast Name sskingdom means Subject Super Kingdom staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) sscinames means unique Subject Scientific Name(s), separated by a ';' scomnames means unique Subject Common Name(s), separated by a ';' sblastnames means unique Subject Blast Name(s), separated by a ';' (in alphabetical order) sskingdoms means unique Subject Super Kingdom(s), separated by a ';' (in alphabetical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' sstrand means Subject Strand qcovs means Query Coverage Per Subject qcovhsp means Query Coverage Per HSP qcovus means Query Coverage Per Unique Subject (blastn only) When not provided, the default value is: 'qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore', which is equivalent to the keyword 'std' Default = `0'
默認是0,也就是會輸出比對的結果。
但是這樣的結果顯然不適合批量處理,批量處理的文件格式顯然必須是dataframe。
所以網上有人推薦“outfmt 7 or 10 works perfect”,所以一般就選10吧。
Diamond的輸出格式:
--outfmt (-f) output format 0 = BLAST pairwise 5 = BLAST XML 6 = BLAST tabular 100 = DIAMOND alignment archive (DAA) 101 = SAM Value 6 may be followed by a space-separated list of these keywords: qseqid means Query Seq - id qlen means Query sequence length sseqid means Subject Seq - id sallseqid means All subject Seq - id(s), separated by a ';' slen means Subject sequence length qstart means Start of alignment in query qend means End of alignment in query sstart means Start of alignment in subject send means End of alignment in subject qseq means Aligned part of query sequence sseq means Aligned part of subject sequence evalue means Expect value bitscore means Bit score score means Raw score length means Alignment length pident means Percentage of identical matches nident means Number of identical matches mismatch means Number of mismatches positive means Number of positive - scoring matches gapopen means Number of gap openings gaps means Total number of gaps ppos means Percentage of positive - scoring matches qframe means Query frame btop means Blast traceback operations(BTOP) staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order) stitle means Subject Title salltitles means All Subject Title(s), separated by a '<>' qcovhsp means Query Coverage Per HSP qtitle means Query title Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
diamond就選6吧,便於批量處理。
diamond 比對轉錄本到Pfam庫的部分結果,可以看到,格式6非常適合做批量處理。
wgs.RNAseq_00000687 M1AFS7.1 60.9 179 65 2 337 861 102 279 8.1e-55 221.5 wgs.RNAseq_00000687 A0A0B2PIQ0.1 61.0 177 66 2 337 861 319 494 4.0e-54 219.2 wgs.RNAseq_00000687 A0A0L9UG27.1 61.2 178 65 3 337 861 315 491 4.0e-54 219.2 wgs.RNAseq_00000687 A0A0L9UG27.1 63.2 38 14 0 6 119 250 287 1.0e-04 55.1 wgs.RNAseq_00000688 A0A0D2QMP5.1 65.8 114 35 2 218 550 317 429 1.7e-34 153.3 wgs.RNAseq_00000688 A0A0D2QMP5.1 53.2 62 26 1 36 221 256 314 8.7e-10 71.2 wgs.RNAseq_00000688 A0A0D2TR34.1 65.8 114 35 2 218 550 322 434 1.7e-34 153.3 wgs.RNAseq_00000688 A0A0D2TR34.1 53.2 62 26 1 36 221 261 319 8.7e-10 71.2 wgs.RNAseq_00000688 F6H0G0.1 65.8 111 36 2 221 550 320 429 5.6e-33 148.3
每一列是什么呢? BLASTn output format 6
Column headers:qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
1. | qseqid | query (e.g., gene) sequence id |
2. | sseqid | subject (e.g., reference genome) sequence id |
3. | pident | percentage of identical matches |
4. | length | alignment length |
5. | mismatch | number of mismatches |
6. | gapopen | number of gap openings |
7. | qstart | start of alignment in query |
8. | qend | end of alignment in query |
9. | sstart | start of alignment in subject |
10. | send | end of alignment in subject |
11. | evalue | expect value |
12. | bitscore | bit score |
比對結果的第三列和第四列非常有用,尤其是在鑒別軟件stringtie等預測出來的轉錄本是否為有效轉錄本時,其實預測出來的轉錄本大部分都是沒有意義的,但是又能部分hit到蛋白上,這時我們就只能選出比對最長的那個轉錄本,其余的可以看作是無效的轉錄本。
操作比較簡單,先對比對長度排個序,從長到短。
cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort
其次就是利用python的panda模塊去冗余就好了。
import pandas as pd infile = "extract_no_N_200.fasta.diamond.nr.sort" #infile = "test.sort" df = pd.read_csv(infile, sep="\t", header=None) df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"] df1 = df.sort_values(['l4'],ascending=False) df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first') df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False) df3 = df2.sort_values(['l3'],ascending=False) df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first') df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)
這樣我們得到的轉錄本就基本上是無冗余的了。
然后對每個轉錄本取identical match比例最高的蛋白就好了。其實信息是可以全部保留的。
blast和diamond怎么只輸出最優的hit?
blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1
待續~