構建NCBI本地BLAST數據庫 (NR NT等) | blastx/diamond使用方法 | blast構建索引 | makeblastdb


參考鏈接
FTP README
如何下載 NCBI NR NT數據庫?
下載blast:ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+
先了解BLAST Databases:BLAST FTP Site
 

如何下載NCBI blast數據庫?

NCBI提供了一個非常智能化的腳本update_blastdb.pl來自動下載所有blast數據庫。
腳本使用方法:
perl update_blastdb.pl nr

有哪些可供下載的blast數據庫?

perl update_blastdb.pl --showall
該命令會顯示所有可供下載的blast數據庫,請自行選擇:
16SMicrobial
cdd_delta
env_nr
env_nt
est
est_human
est_mouse
est_others
gss
gss_annot
htgs
human_genomic
landmark
nr
nt
other_genomic
pataa
patnt
pdbaa
pdbnt
ref_prok_rep_genomes
ref_viroids_rep_genomes
ref_viruses_rep_genomes
refseq_genomic
refseq_protein
refseq_rna
refseqgene
sts
swissprot
taxdb
tsa_nr
tsa_nt
vector
這里我選擇的是nr數據庫。
nohup perl update_blastdb.pl --decompress nr >out.log 2>&1 &
自動在后台下載,然后自動解壓。(下載到一半斷網了,在運行會接着下載,而不會覆蓋已經下載好的文件)

blast如何使用?

這里只演示blastx的使用方法。
image
剛才下載的nr庫就是蛋白庫,blastx就是用來將核酸序列比對到蛋白庫上的。(nt就是核酸庫)
因為我們下載的是已經建好索引的數據庫,所以省去了makeblastdb的過程。
常見的命令有下面幾個:
-query <File_In> 要查詢的核酸序列
-db <String> 數據庫名字
-out <File_Out> 輸出文件
-evalue <Real> evalue閾值
-outfmt <String> 輸出的格式

blast構建索引 | makeblastdb

makeblastdb -in mature.fa -input_type fasta -dbtype nucl -title miRBase -parse_seqids -out miRBase -logfile File_Name

-in 后接輸入文件,你要格式化的fasta序列
-dbtype 后接序列類型,nucl為核酸,prot為蛋白
-title 給數據庫起個名,好看~~(不能用在后面搜索時-db的參數)
-parse_seqids 推薦加上,現在有啥原因還沒搞清楚
-out 后接數據庫名,自己起一個有意義的名字,以后blast+搜索時要用到的-db的參數
-logfile 日志文件,如果沒有默認輸出到屏幕

資源消耗 

blastx -query test.merged.transcript.fasta -db nr -out test.blastx.out

其中fasta文件只有19938行。

可是運行起來耗費了很多資源:

平均內存消耗:51.45G;峰值:115.37G

cpu:1個

運行時間:06:00:24(你敢信?這才是一個小小的test)

所以我強烈推薦用diamond替代blast來做數據庫搜索。

blast結果解讀

每一個合格的序列比對都會給出一個這樣的結果(一個query sequence比對到多個就有多個結果):

>AAB70410.1 Similar to Schizosaccharomyces CCAAT-binding factor (gb|U88525).
EST gb|T04310 comes from this gene [Arabidopsis thaliana]
Length=208

 Score = 238 bits (607),  Expect = 7e-76, Method: Compositional matrix adjust.
 Identities = 116/145 (80%), Positives = 127/145 (88%), Gaps = 2/145 (1%)
 Frame = +1

Query  253  FWASQYQEIEQTSDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR  432
            FW +Q++EIE+T+DFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR
Sbjct  39   FWENQFKEIEKTTDFKNHSLPLARIKKIMKADEDVRMISAEAPVVFARACEMFILELTLR  98

Query  433  SWNHTEENKRRTLQKNDIAAAITRNEIFDFLVDIVPREDLKDEVLASIPRGTLPMGAPTE  612
            SWNHTEENKRRTLQKNDIAAA+TR +IFDFLVDIVPREDL+DEVL SIPRGT+P  A
Sbjct  99   SWNHTEENKRRTLQKNDIAAAVTRTDIFDFLVDIVPREDLRDEVLGSIPRGTVPEAA-AA  157

Query  613  GLPYYYMQPQHAPQVGAPGMFMGKP  687
            G PY Y+    AP +G PGM MG P
Sbjct  158  GYPYGYLPAGTAP-IGNPGMVMGNP  181  

結果解讀網上很多,這里不啰嗦了。

以下是我在同樣條件下測試的diamond:

平均內存消耗:11.01G;峰值:12.44G

cpu:1個(571.17%)也就是會自動占用5-6個cpu

運行時間:00:26:15

而且diamond注明了,它的優勢是處理>1M 的query,量越大速度越快。

diamond的簡單用法:

diamond makedb --in nr.fa -d nr
diamond blastx -d nr -q test.merged.transcript.fasta -o test.matches.m8

 但是diamond使用有限制,只能用於比對蛋白數據庫。

以下是OrfPredictor推薦的參數設置:

To minimize the file size of BLASTX output for loading, the following parameters are recommended if the BLASTX in the 'NCBI-blastall' package is used: "-v 1 -b 1 -e 1e-5" (Note: we used version 2.2.19 - earlier or later versions may not work properly).
 

下面是詳細的blastx幫助文檔,以供查閱:

$ blastx -help
USAGE
  blastx [-h] [-help] [-import_search_strategy filename]
    [-export_search_strategy filename] [-task task_name] [-db database_name]
    [-dbsize num_letters] [-gilist filename] [-seqidlist filename]
    [-negative_gilist filename] [-negative_seqidlist filename]
    [-entrez_query entrez_query] [-db_soft_mask filtering_algorithm]
    [-db_hard_mask filtering_algorithm] [-subject subject_input_file]
    [-subject_loc range] [-query input_file] [-out output_file]
    [-evalue evalue] [-word_size int_value] [-gapopen open_penalty]
    [-gapextend extend_penalty] [-qcov_hsp_perc float_value]
    [-max_hsps int_value] [-xdrop_ungap float_value] [-xdrop_gap float_value]
    [-xdrop_gap_final float_value] [-searchsp int_value]
    [-sum_stats bool_value] [-max_intron_length length] [-seg SEG_options]
    [-soft_masking soft_masking] [-matrix matrix_name]
    [-threshold float_value] [-culling_limit int_value]
    [-best_hit_overhang float_value] [-best_hit_score_edge float_value]
    [-window_size int_value] [-ungapped] [-lcase_masking] [-query_loc range]
    [-strand strand] [-parse_deflines] [-query_gencode int_value]
    [-outfmt format] [-show_gis] [-num_descriptions int_value]
    [-num_alignments int_value] [-line_length line_length] [-html]
    [-max_target_seqs num_sequences] [-num_threads int_value] [-remote]
    [-comp_based_stats compo] [-use_sw_tback] [-version]

DESCRIPTION
   Translated Query-Protein Subject BLAST 2.7.1+

OPTIONAL ARGUMENTS
 -h
   Print USAGE and DESCRIPTION;  ignore all other parameters
 -help
   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
 -version
   Print version number;  ignore other arguments

 *** Input query options
 -query <File_In>
   Input file name
   Default = `-'
 -query_loc <String>
   Location on the query sequence in 1-based offsets (Format: start-stop)
 -strand <String, `both', `minus', `plus'>
   Query strand(s) to search against database/subject
   Default = `both'
 -query_gencode <Integer, values between: 1-6, 9-16, 21-25>
   Genetic code to use to translate query (see user manual for details)
   Default = `1'

 *** General search options
 -task <String, Permissible values: 'blastx' 'blastx-fast' >
   Task to execute
   Default = `blastx'
 -db <String>
   BLAST database name
    * Incompatible with:  subject, subject_loc
 -out <File_Out>
   Output file name
   Default = `-'
 -evalue <Real>
   Expectation value (E) threshold for saving hits
   Default = `10'
 -word_size <Integer, >=2>
   Word size for wordfinder algorithm
 -gapopen <Integer>
   Cost to open a gap
 -gapextend <Integer>
   Cost to extend a gap
 -max_intron_length <Integer, >=0>
   Length of the largest intron allowed in a translated nucleotide sequence
   when linking multiple distinct alignments
   Default = `0'
 -matrix <String>
   Scoring matrix name (normally BLOSUM62)
 -threshold <Real, >=0>
   Minimum word score such that the word is added to the BLAST lookup table
 -comp_based_stats <String>
   Use composition-based statistics:
       D or d: default (equivalent to 2 )
       0 or F or f: No composition-based statistics
       1: Composition-based statistics as in NAR 29:2994-3005, 2001
       2 or T or t : Composition-based score adjustment as in Bioinformatics
   21:902-911,
       2005, conditioned on sequence properties
       3: Composition-based score adjustment as in Bioinformatics 21:902-911,
       2005, unconditionally
   Default = `2'

 *** BLAST-2-Sequences options
 -subject <File_In>
   Subject sequence(s) to search
    * Incompatible with:  db, gilist, seqidlist, negative_gilist,
   negative_seqidlist, db_soft_mask, db_hard_mask
 -subject_loc <String>
   Location on the subject sequence in 1-based offsets (Format: start-stop)
    * Incompatible with:  db, gilist, seqidlist, negative_gilist,
   negative_seqidlist, db_soft_mask, db_hard_mask, remote

 *** Formatting options
 -outfmt <String>
   alignment view options:
     0 = Pairwise,
     1 = Query-anchored showing identities,
     2 = Query-anchored no identities,
     3 = Flat query-anchored showing identities,
     4 = Flat query-anchored no identities,
     5 = BLAST XML,
     6 = Tabular,
     7 = Tabular with comment lines,
     8 = Seqalign (Text ASN.1),
     9 = Seqalign (Binary ASN.1),
    10 = Comma-separated values,
    11 = BLAST archive (ASN.1),
    12 = Seqalign (JSON),
    13 = Multiple-file BLAST JSON,
    14 = Multiple-file BLAST XML2,
    15 = Single-file BLAST JSON,
    16 = Single-file BLAST XML2,
    18 = Organism Report

   Options 6, 7 and 10 can be additionally configured to produce
   a custom format specified by space delimited format specifiers.
   The supported format specifiers are:
           qseqid means Query Seq-id
              qgi means Query GI
             qacc means Query accesion
          qaccver means Query accesion.version
             qlen means Query sequence length
           sseqid means Subject Seq-id
        sallseqid means All subject Seq-id(s), separated by a ';'
              sgi means Subject GI
           sallgi means All subject GIs
             sacc means Subject accession
          saccver means Subject accession.version
          sallacc means All subject accessions
             slen means Subject sequence length
           qstart means Start of alignment in query
             qend means End of alignment in query
           sstart means Start of alignment in subject
             send means End of alignment in subject
             qseq means Aligned part of query sequence
             sseq means Aligned part of subject sequence
           evalue means Expect value
         bitscore means Bit score
            score means Raw score
           length means Alignment length
           pident means Percentage of identical matches
           nident means Number of identical matches
         mismatch means Number of mismatches
         positive means Number of positive-scoring matches
          gapopen means Number of gap openings
             gaps means Total number of gaps
             ppos means Percentage of positive-scoring matches
           frames means Query and subject frames separated by a '/'
           qframe means Query frame
           sframe means Subject frame
             btop means Blast traceback operations (BTOP)
           staxid means Subject Taxonomy ID
         ssciname means Subject Scientific Name
         scomname means Subject Common Name
       sblastname means Subject Blast Name
        sskingdom means Subject Super Kingdom
          staxids means unique Subject Taxonomy ID(s), separated by a ';'
                (in numerical order)
        sscinames means unique Subject Scientific Name(s), separated by a ';'
        scomnames means unique Subject Common Name(s), separated by a ';'
       sblastnames means unique Subject Blast Name(s), separated by a ';'
                (in alphabetical order)
       sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
                (in alphabetical order)
           stitle means Subject Title
       salltitles means All Subject Title(s), separated by a '<>'
          sstrand means Subject Strand
            qcovs means Query Coverage Per Subject
          qcovhsp means Query Coverage Per HSP
           qcovus means Query Coverage Per Unique Subject (blastn only)
   When not provided, the default value is:
   'qaccver saccver pident length mismatch gapopen qstart qend sstart send
   evalue bitscore', which is equivalent to the keyword 'std'
   Default = `0'
 -show_gis
   Show NCBI GIs in deflines?
 -num_descriptions <Integer, >=0>
   Number of database sequences to show one-line descriptions for
   Not applicable for outfmt > 4
   Default = `500'
    * Incompatible with:  max_target_seqs
 -num_alignments <Integer, >=0>
   Number of database sequences to show alignments for
   Default = `250'
    * Incompatible with:  max_target_seqs
 -line_length <Integer, >=1>
   Line length for formatting alignments
   Not applicable for outfmt > 4
   Default = `60'
 -html
   Produce HTML output?

 *** Query filtering options
 -seg <String>
   Filter query sequence with SEG (Format: 'yes', 'window locut hicut', or
   'no' to disable)
   Default = `12 2.2 2.5'
 -soft_masking <Boolean>
   Apply filtering locations as soft masks
   Default = `false'
 -lcase_masking
   Use lower case filtering in query and subject sequence(s)?

 *** Restrict search or results
 -gilist <String>
   Restrict search of database to list of GI's
    * Incompatible with:  negative_gilist, seqidlist, negative_seqidlist,
   remote, subject, subject_loc
 -seqidlist <String>
   Restrict search of database to list of SeqId's
    * Incompatible with:  gilist, negative_gilist, negative_seqidlist, remote,
   subject, subject_loc
 -negative_gilist <String>
   Restrict search of database to everything except the listed GIs
    * Incompatible with:  gilist, seqidlist, remote, subject, subject_loc
 -negative_seqidlist <String>
   Restrict search of database to everything except the listed SeqIDs
    * Incompatible with:  gilist, seqidlist, remote, subject, subject_loc
 -entrez_query <String>
   Restrict search with the given Entrez query
    * Requires:  remote
 -db_soft_mask <String>
   Filtering algorithm ID to apply to the BLAST database as soft masking
    * Incompatible with:  db_hard_mask, subject, subject_loc
 -db_hard_mask <String>
   Filtering algorithm ID to apply to the BLAST database as hard masking
    * Incompatible with:  db_soft_mask, subject, subject_loc
 -qcov_hsp_perc <Real, 0..100>
   Percent query coverage per hsp
 -max_hsps <Integer, >=1>
   Set maximum number of HSPs per subject sequence to save for each query
 -culling_limit <Integer, >=0>
   If the query range of a hit is enveloped by that of at least this many
   higher-scoring hits, delete the hit
    * Incompatible with:  best_hit_overhang, best_hit_score_edge
 -best_hit_overhang <Real, (>0 and <0.5)>
   Best Hit algorithm overhang value (recommended value: 0.1)
    * Incompatible with:  culling_limit
 -best_hit_score_edge <Real, (>0 and <0.5)>
   Best Hit algorithm score edge value (recommended value: 0.1)
    * Incompatible with:  culling_limit
 -max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep
   Not applicable for outfmt <= 4
   Default = `500'
    * Incompatible with:  num_descriptions, num_alignments

 *** Statistical options
 -dbsize <Int8>
   Effective length of the database
 -searchsp <Int8, >=0>
   Effective length of the search space
 -sum_stats <Boolean>
   Use sum statistics

 *** Search strategy options
 -import_search_strategy <File_In>
   Search strategy to use
    * Incompatible with:  export_search_strategy
 -export_search_strategy <File_Out>
   File name to record the search strategy used
    * Incompatible with:  import_search_strategy

 *** Extension options
 -xdrop_ungap <Real>
   X-dropoff value (in bits) for ungapped extensions
 -xdrop_gap <Real>
   X-dropoff value (in bits) for preliminary gapped extensions
 -xdrop_gap_final <Real>
   X-dropoff value (in bits) for final gapped alignment
 -window_size <Integer, >=0>
   Multiple hits window size, use 0 to specify 1-hit algorithm
 -ungapped
   Perform ungapped alignment only?

 *** Miscellaneous options
 -parse_deflines
   Should the query and subject defline(s) be parsed?
 -num_threads <Integer, (>=1 and =<24)>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote
 -remote
   Execute search remotely?
    * Incompatible with:  gilist, seqidlist, negative_gilist,
   negative_seqidlist, subject_loc, num_threads
 -use_sw_tback
   Compute locally optimal Smith-Waterman alignments?
 

以下是copy的詳細英文教學:

1. Quick Start

  • Get all numbered files for a database with the same base name: Each of these files represents a subset (volume) of that database, and all of them are needed to reconstitute the database.
  • After extraction, there is no need to concatenate the resulting files:Call the database with the base name, for nr database files, use "-db nr". 這些數據庫是已經預先進行過makeblastdb命令的,下載后可以直接使用
  • For easy download, use the update_blastdb.pl script from the blast+ package.
  • Incremental update is not available.

2. General Introduction

BLAST search pages under the Basic BLAST section of the NCBI BLAST home page(http://blast.ncbi.nlm.nih.gov/) use a standard set of BLAST databases for nucleotide, protein, and translated BLAST searches.  These databases are made
available as compressed archives of pre-formatted form) and can be donwloaed from the /db directory of the BLAST ftp site (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). The FASTA files reside under the /FASTA subdirectory.

The pre-formatted databases offer the following advantages:

  • Pre-formatting removes the need to run makeblastdb; 無需再運行建庫命令行
  • Species-level taxonomy ids are included for each database entry;
  • Databases are broken into smaller-sized volumes and are therefore easier to download;
  • Sequences in FASTA format can be generated from the pre-formatted databases by using the blastdbcmd utility;可以從這些數據庫文件中導出FASTA文件
  • A convenient script (update_blastdb.pl) is available in the blast+ package to download the pre-formatted databases. 可用該腳本升級數據庫

Pre-formatted databases must be downloaded using the update_blastdb.pl script or via FTP in binary mode. Documentation for this script can be obtained by running the script without any arguments; Perl installation is required.

The compressed files downloaded must be inflated with gzip or other decompress tools. The BLAST database files can then be extracted out of the resulting tar file using the tar utility on Unix/Linux, or WinZip and StuffIt Expander on
Windows and Macintosh platforms, respectively.  下載的數據庫為壓縮包,要解壓縮

Large databases are formatted in multiple one-gigabyte volumes, which are named using the basename.##.tar.gz convention. All volumes with the same base name are required. An alias file is provided to tie individual volumes together so that the database can be called using the base name (without the .nal or .pal extension). For example, to call the est database, simply use "-db est" option in the command line (without the quotes). 大的數據庫通常分為多個壓縮包,例如nr庫有11個壓縮包。所有的相關壓縮包都要下載,解壓。解壓縮會生成對應的庫文件,同時生成一個nr.pal文件。檢索nr庫時輸入-d nr 即可。

Additional BLAST databases that are not provided in pre-formatted formats may be available in the FASTA subdirectory. For other genomic BLAST databases, please check the genomes ftp directory at:    ftp://ftp.ncbi.nlm.nih.gov/genomes/

3. Contents of the /blast/db/ directory

The pre-formatted BLAST databases are archived in this directory. The names of these databases and their contents are listed below.

+-----------------------------+------------------------------------------------+
 File Name        #  Content Description   
+-----------------------------+------------------------------------------------+
16SMicrobial.tar.gz          #  Bacterial and Archaeal 16S rRNA sequences from BioProjects 33175 and 33117
FASTA/        #  Subdirectory for FASTA formatted sequences
README        #  README for this subdirectory (this file)
Representative_Genomes.*tar.gz        #  Representative bacterial/archaeal genomes database
cdd_delta.tar.gz          #  Conserved Domain Database sequences for use with stand alone deltablast
cloud/          #  Subdirectory of databases for BLAST AMI; see http://1.usa.gov/TJAnEt
env_nr.*tar.gz        #  Protein sequences for metagenomes
env_nt.*tar.gz        #  Nucleotide sequences for metagenomes
est.tar.gz        #  This file requires est_human.*.tar.gz, est_mouse.*.tar.gz, and est_others.*.tar.gz files to function. It contains the est.nal alias so that searches against est (-db est) will include est_human, est_mouse and est_others. 
est_human.*.tar.gz        #  Human subset of the est database from the est division of GenBank, EMBL and DDBJ.
est_mouse.*.tar.gz        #  Mouse subset of the est databasae
est_others.*.tar.gz           #  Non-human and non-mouse subset of the est database
gss.*tar.gz           #  Sequences from the GSS division of GenBank, EMBL, and DDBJ
htgs.*tar.gz          #  Sequences from the HTG division of GenBank, EMBL,and DDBJ
human_genomic.*tar.gz         #  Human RefSeq (NC_) chromosome records with gap adjusted concatenated NT_ contigs
nr.*tar.gz        #  Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq
nt.*tar.gz        #  Partially non-redundant nucleotide sequences from all traditional divisions of GenBank, EMBL, and DDBJ excluding GSS,STS, PAT, EST, HTG, and WGS.
other_genomic.*tar.gz         #  RefSeq chromosome records (NC_) for non-human organisms
pataa.*tar.gz         #  Patent protein sequences
patnt.*tar.gz         #  Patent nucleotide sequences. Both patent databases are directly from the USPTO, or from the EPO/JPO via EMBL/DDBJ
pdbaa.*tar.gz         #  Sequences for the protein structure from the Protein Data Bank
pdbnt.*tar.gz         #  Sequences for the nucleotide structure from the Protein Data Bank. They are NOT the protein coding sequences for the corresponding pdbaa entries.
refseq_genomic.*tar.gz        #  NCBI genomic reference sequences
refseq_protein.*tar.gz        #  NCBI protein reference sequences
refseq_rna.*tar.gz        #  NCBI Transcript reference sequences
sts.*tar.gz           #  Sequences from the STS division of GenBank, EMBL,and DDBJ
swissprot.tar.gz          #  Swiss-Prot sequence database (last major update)
taxdb.tar.gz          #  Additional taxonomy information for the databases listed here providing common and scientific names
tsa_nt.*tar.gz        #  Sequences from the TSA division of GenBank, EMBL,and DDBJ
vector.tar.gz         #  Vector sequences from 2010, see Note 2 in section 4.
wgs.*tar.gz           #  Sequences from Whole Genome Shotgun assemblies
+-----------------------------+------------------------------------------------+

4. Contents of the /blast/db/FASTA directory

This directory contains FASTA formatted sequence files. The file names and database contents are listed below. These files must be unpacked and processed through blastdbcmd before they can be used by the BLAST programs.

+-----------------------+-----------------------------------------------------+
File Name          #  Content Description         # 
+-----------------------+-----------------------------------------------------+
alu.a.gz        #  translation of alu.n repeats
alu.n.gz        #  alu repeat elements (from 2003)
drosoph.aa.gz           #  CDS translations from drosophila.nt  
drosoph.nt.gz           #  genomic sequences for drosophila (from 2003)
env_nr.gz*          #  Protein sequences for metagenomes, taxid 408169
env_nt.gz*          #  Nucleotide sequences for metagenomes, taxid 408169
est_human.gz*           #  human subset of the est database (see Note 1)
est_mouse.gz*           #  mouse subset of the est database
est_others.gz*          #  non-human and non-mouse subset of the est database
gss.gz*         #  sequences from the GSS division of GenBank, EMBL,   and DDBJ
htgs.gz*        #  sequences from the HTG division of GenBank, EMBL,   and DDBJ 
human_genomic.gz*           #  human RefSeq (NC_) chromosome records  with gap adjusted concatenated NT_ contigs 
igSeqNt.gz          #  human and mouse immunoglobulin variable region   nucleotide sequences
igSeqProt.gz        #  human and mouse immunoglobulin variable region   protein sequences
mito.aa.gz          #  CDS translations of complete mitochondrial genomes
mito.nt.gz          #  complete mitochondrial genomes
nr.gz*          #  non-redundant protein sequence database with entries from GenPept, Swissprot, PIR, PDF, PDB, and RefSeq
nt.gz*          #  nucleotide sequence database, with entries from all traditional divisions of GenBank, EMBL, and DDBJ; excluding bulk divisions (gss, sts, pat, est, htg) and wgs entries. Partially non-redundant.
other_genomic.gz*           #  RefSeq chromosome records (NC_) for organisms other than human
pataa.gz*           #  patent protein sequences
patnt.gz*           #  patent nucleotide sequences. Both patent sequence   files are from the USPTO, or EPO/JPO via EMBL/DDBJ
pdbaa.gz*           #  protein sequences from pdb protein structures
pdbnt.gz*           #  nucleotide sequences from pdb nucleic acid structures. They are NOT the protein coding sequences for the corresponding pdbaa entries.
sts.gz*         #  database for sequence tag site entries 
swissprot.gz*           #  swiss-prot database (last major release)
vector.gz           #  vector sequences from 2010. (See Note 2)
wgs.gz*         #  whole genome shotgun genome assemblies
yeast.aa.gz         #  protein translations from yeast.nt
yeast.nt.gz         #  yeast genomes (from 2003)
+-----------------------+---------------------------------------------------+


NOTE:
(1) NCBI does not provide the complete est database in FASTA format. One  needs to get all three subsets (est_human, est_mouse, and est_others and concatenate them into the complete est fasta database).
(2) For screening for vector contamination, use the UniVec database: ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/
*  marked files have pre-formatted counterparts.

5. Database updates

The BLAST databases are updated regularly. There is no established incremental pdate scheme. We recommend downloading the complete databases regularly to keep their content current.

6. Non-redundant defline syntax

The non-redundant databases are nr, nt (partially) and pataa. In them, identical sequences are merged into one entry. To be merged two sequences must have identical lengths and every residue at every position must be the
same.  The FASTA deflines for the different entries that belong to one record are separated by control-A characters invisible to most programs. In the example below both entries gi|1469284 and gi|1477453 have the same sequence, in every respect:

>gi|3023276|sp|Q57293|AFUC_ACTPL   Ferric transport ATP-binding protein afuC 
^Agi|1469284|gb|AAB05030.1|   afuC gene product ^Agi|1477453|gb|AAB17216.1|   
afuC [Actinobacillus pleuropneumoniae]
MNNDFLVLKNITKSFGKATVIDNLDLVIKRGTMVTLLGPSGCGKTTVLRLVAGLENPTSGQIFIDGEDVT
KSSIQNRDICIVFQSYALFPHMSIGDNVGYGLRMQGVSNEERKQRVKEALELVDLAGFADRFVDQISGGQ
QQRVALARALVLKPKVLILDEPLSNLDANLRRSMREKIRELQQRLGITSLYVTHDQTEAFAVSDEVIVMN
KGTIMQKARQKIFIYDRILYSLRNFMGESTICDGNLNQGTVSIGDYRFPLHNAADFSVADGACLVGVRPE
AIRLTATGETSQRCQIKSAVYMGNHWEIVANWNGKDVLINANPDQFDPDATKAFIHFTEQGIFLLNKE

The syntax of sequence header lines used by the NCBI BLAST server depends on the database from which each sequence was obtained.  The table at http://www.ncbi.nlm.nih.gov/toolkit/doc/book/ch_demo/?report=objectonly#ch_demo.T5 lists the supported FASTA identifiers. 有些BLAST數據庫沒有提供預先建庫的文件,這些數據庫可以從FASTA文件夾里下載

For databases whose entries are not from official NCBI sequence databases, such as Trace database, the gnl| convention is used. For custom databases, this convention should be followed and the id for each sequence must be
unique, if one would like to take the advantage of indexed database, which enables specific sequence retrieval using blastdbcmd program included in the blast executable package.  One should refer to documents distributed in the standalone BLAST package for more details.

7. Formatting a FASTA file into a BLASTable database

FASTA files need to be formatted with makeblastdb before they can be used in local blast search. For those from NCBI, the following makeblastdb commands are recommended:

For nucleotide fasta file:  

makeblastdb -in input_db -dbtype nucl -parse_seqids

For protein fasta file:     

makeblastdb -in input_db -dbtype prot -parse_seqids

 

                         


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM