blast | diamond 輸出結果選擇和解析 | 比對


之前的文章:構建NCBI本地BLAST數據庫 (NR NT等) | blastx/diamond使用方法 | blast構建索引 | makeblastdb

本地運行blast時,需要指定out format。

常見的網頁版blast結果可以參照:Blast結果的詳細解析

*** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report

   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
   	    qseqid means Query Seq-id
   	       qgi means Query GI
   	      qacc means Query accesion
   	   qaccver means Query accesion.version
   	      qlen means Query sequence length
   	    sseqid means Subject Seq-id
   	 sallseqid means All subject Seq-id(s), separated by a ';'
   	       sgi means Subject GI
   	    sallgi means All subject GIs
   	      sacc means Subject accession
   	   saccver means Subject accession.version
   	   sallacc means All subject accessions
   	      slen means Subject sequence length
   	    qstart means Start of alignment in query
   	      qend means End of alignment in query
   	    sstart means Start of alignment in subject
   	      send means End of alignment in subject
   	      qseq means Aligned part of query sequence
   	      sseq means Aligned part of subject sequence
   	    evalue means Expect value
   	  bitscore means Bit score
   	     score means Raw score
   	    length means Alignment length
   	    pident means Percentage of identical matches
   	    nident means Number of identical matches
   	  mismatch means Number of mismatches
   	  positive means Number of positive-scoring matches
   	   gapopen means Number of gap openings
   	      gaps means Total number of gaps
   	      ppos means Percentage of positive-scoring matches
   	    frames means Query and subject frames separated by a '/'
   	    qframe means Query frame
   	    sframe means Subject frame
   	      btop means Blast traceback operations (BTOP)
   	    staxid means Subject Taxonomy ID
   	  ssciname means Subject Scientific Name
   	  scomname means Subject Common Name
   	sblastname means Subject Blast Name
   	 sskingdom means Subject Super Kingdom
   	   staxids means unique Subject Taxonomy ID(s), separated by a ';'
   			 (in numerical order)
   	 sscinames means unique Subject Scientific Name(s), separated by a ';'
   	 scomnames means unique Subject Common Name(s), separated by a ';'
   	sblastnames means unique Subject Blast Name(s), separated by a ';'
   			 (in alphabetical order)
   	sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
   			 (in alphabetical order)
   	    stitle means Subject Title
   	salltitles means All Subject Title(s), separated by a '<>'
   	   sstrand means Subject Strand
   	     qcovs means Query Coverage Per Subject
   	   qcovhsp means Query Coverage Per HSP
   	    qcovus means Query Coverage Per Unique Subject (blastn only)
   When not provided, the default value is:
   'qaccver saccver pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'

默認是0,也就是會輸出比對的結果。

但是這樣的結果顯然不適合批量處理,批量處理的文件格式顯然必須是dataframe。

所以網上有人推薦“outfmt 7 or 10 works perfect”,所以一般就選10吧。

Diamond的輸出格式:

--outfmt (-f)          output format
	0   = BLAST pairwise
	5   = BLAST XML
	6   = BLAST tabular
	100 = DIAMOND alignment archive (DAA)
	101 = SAM

	Value 6 may be followed by a space-separated list of these keywords:

	qseqid means Query Seq - id
	qlen means Query sequence length
	sseqid means Subject Seq - id
	sallseqid means All subject Seq - id(s), separated by a ';'
	slen means Subject sequence length
	qstart means Start of alignment in query
	qend means End of alignment in query
	sstart means Start of alignment in subject
	send means End of alignment in subject
	qseq means Aligned part of query sequence
	sseq means Aligned part of subject sequence
	evalue means Expect value
	bitscore means Bit score
	score means Raw score
	length means Alignment length
	pident means Percentage of identical matches
	nident means Number of identical matches
	mismatch means Number of mismatches
	positive means Number of positive - scoring matches
	gapopen means Number of gap openings
	gaps means Total number of gaps
	ppos means Percentage of positive - scoring matches
	qframe means Query frame
	btop means Blast traceback operations(BTOP)
	staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
	stitle means Subject Title
	salltitles means All Subject Title(s), separated by a '<>'
	qcovhsp means Query Coverage Per HSP
	qtitle means Query title

	Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

diamond就選6吧,便於批量處理。

 

diamond 比對轉錄本到Pfam庫的部分結果,可以看到,格式6非常適合做批量處理。

wgs.RNAseq_00000687     M1AFS7.1        60.9    179     65      2       337     861     102     279     8.1e-55 221.5
wgs.RNAseq_00000687     A0A0B2PIQ0.1    61.0    177     66      2       337     861     319     494     4.0e-54 219.2
wgs.RNAseq_00000687     A0A0L9UG27.1    61.2    178     65      3       337     861     315     491     4.0e-54 219.2
wgs.RNAseq_00000687     A0A0L9UG27.1    63.2    38      14      0       6       119     250     287     1.0e-04 55.1
wgs.RNAseq_00000688     A0A0D2QMP5.1    65.8    114     35      2       218     550     317     429     1.7e-34 153.3
wgs.RNAseq_00000688     A0A0D2QMP5.1    53.2    62      26      1       36      221     256     314     8.7e-10 71.2
wgs.RNAseq_00000688     A0A0D2TR34.1    65.8    114     35      2       218     550     322     434     1.7e-34 153.3
wgs.RNAseq_00000688     A0A0D2TR34.1    53.2    62      26      1       36      221     261     319     8.7e-10 71.2
wgs.RNAseq_00000688     F6H0G0.1        65.8    111     36      2       221     550     320     429     5.6e-33 148.3  

 

每一列是什么呢? BLASTn output format 6

Column headers:
qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

 

 1.  qseqid  query (e.g., gene) sequence id
 2.  sseqid  subject (e.g., reference genome) sequence id
 3.  pident  percentage of identical matches
 4.  length  alignment length
 5.  mismatch  number of mismatches
 6.  gapopen  number of gap openings
 7.  qstart  start of alignment in query
 8.  qend  end of alignment in query
 9.  sstart  start of alignment in subject
 10.  send  end of alignment in subject
 11.  evalue  expect value
 12.  bitscore  bit score

 

 

比對結果的第三列和第四列非常有用,尤其是在鑒別軟件stringtie等預測出來的轉錄本是否為有效轉錄本時,其實預測出來的轉錄本大部分都是沒有意義的,但是又能部分hit到蛋白上,這時我們就只能選出比對最長的那個轉錄本,其余的可以看作是無效的轉錄本。

操作比較簡單,先對比對長度排個序,從長到短。

cat extract_no_N_200.fasta.diamond.nr | sort -n -r -k4 > extract_no_N_200.fasta.diamond.nr.sort

其次就是利用python的panda模塊去冗余就好了。

import pandas as pd

infile = "extract_no_N_200.fasta.diamond.nr.sort"
#infile = "test.sort"

df = pd.read_csv(infile, sep="\t", header=None)
df.columns = ["l1","l2","l3","l4","l5","l6","l7","l8","l9","l10","l11","l12"]

df1 = df.sort_values(['l4'],ascending=False)

df2 = df1.drop_duplicates(subset=[df1.columns[1]], keep = 'first')

df2.to_csv("rm_dup_protein_"+infile, sep="\t", index=False, header=False)

df3 = df2.sort_values(['l3'],ascending=False)

df4 = df3.drop_duplicates(subset=[df3.columns[0]], keep = 'first')

df4.to_csv("unique_protein_"+infile, sep="\t", index=False, header=False)

  

這樣我們得到的轉錄本就基本上是無冗余的了。

 

然后對每個轉錄本取identical match比例最高的蛋白就好了。其實信息是可以全部保留的。

blast和diamond怎么只輸出最優的hit?

 

blastn -query transcripts.fa -out transcripts.blast.txt -task megablast -db refseq_rna -num_threads 12 -evalue 1e-10 -best_hit_score_edge 0.05 -best_hit_overhang 0.25 -outfmt 7 -perc_identity 50 -max_target_seqs 1

  

 

待續~


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM