QIIME1 聚OTU


qiime 本身不提供聚類的算法,它只是對其他聚otu軟件的封裝
根據聚類軟件的算法,分成了3個方向:
de novo:                   pick_de_novo_otus.py 
closed-reference:      pick_closed_reference_otus.py
open-reference OTU: pick_open_reference_otus.py
 
 不同算法的優缺點:
de novo:    pick_de_novo_otus.py 
優點: 所有的reads 都會聚類
缺點:不支持並行,計算速度慢,當reads > 10M 時就會非常慢
使用場景: 研究不常見的marker 基因
 
closed-reference: pick_closed_reference_otus.py
和數據庫比對,比對不上數據庫的reasd 直接丟掉,數據庫中reads 帶有taxonpmy 注釋, 可以方便的進行taxonomy 注釋
優點:完全並行, 速度快;tree 或者taxonomy 注釋更好, 數據庫中的otu分類效果都很好
缺點: 不能檢測數據庫中沒有的物種
Because reads that don’t hit the reference sequence collection are discarded, your analyses only focus on the diversity that you “already know about”
 
open-reference OTU: pick_open_reference_otus.py
首先和數據庫比對,沒有比對上的reads 在使用denovo的聚類策略進行聚otu
open-reference OTU 是推薦的聚otu策略
優點: 所有reads都會聚類,部分並行,速度較快
缺點: 當新物種較多時,速度會很慢
 
我們最常用的是open-reference OTU聚類, 對應的腳本是 pick_open_reference_otus.py  
可以看做一個pipieline, 共有6個步驟,其中前4步為OTU 聚類,后2步為產生OTU table 和 聚類的tree
 
Step 1) Prefiltering and picking closed reference OTUs
The first step is an optional prefiltering of the input fasta file to remove
sequences that do not hit the reference database with a given sequence
identity (PREFILTER_PERCENT_ID). This step can take a very long time, so is
disabled by default. The prefilter parameters can be changed with the options:
--prefilter_refseqs_fp
--prefilter_percent_id
This filtering is accomplished by picking closed reference OTUs at the specified
prefilter percent id to produce:
prefilter_otus/seqs_otus.log
prefilter_otus/seqs_otus.txt
prefilter_otus/seqs_failures.txt
prefilter_otus/seqs_clusters.uc
Next, the seqs_failures.txt file is used to remove these failed sequences from
the original input fasta file to produce:
prefilter_otus/prefiltered_seqs.fna
This prefiltered_seqs.fna file is then considered to contain the reads
of the marker gene of interest, rather than spurious reads such as host
genomic sequence or sequencing artifacts
 
首先對序列進行一個預處理,給定一個比對相似度 ,采用close-reference OTU 方法刪除輸入序列中不能比對上數據庫的序列,這一步是可選的
如果執行了預處理,會產生 prefilter_otus/prefiltered_seqs.fna 文件,如果不執行,直接拿 input.fasta 去進行下一步的處理
 
If prefiltering is applied, this step progresses with the prefiltered_seqs.fna.
Otherwise it progresses with the input file. The Step 1 closed reference OTU
picking is done against the supplied reference database. This command produces:
step1_otus/_clusters.uc
step1_otus/_failures.txt
step1_otus/_otus.log
step1_otus/_otus.txt
 
然后采用close-reference OTU的方式聚OTU
 
The representative sequence for each of the Step 1 picked OTUs are selected to
produce:
step1_otus/step1_rep_set.fna
 
Next, the sequences that failed to hit the reference database in Step 1 are
filtered from the Step 1 input fasta file to produce:
step1_otus/failures.fasta
 
Then the failures.fasta file is randomly subsampled to PERCENT_SUBSAMPLE of
the sequences to produce:
step1_otus/subsampled_failures.fna.
Modifying PERCENT_SUBSAMPLE can have a big effect on run time for this workflow,
but will not alter the final OTUs.
 
對於沒能比對上數據庫的read, 會生成 step1_otus/failures.fasta 文件,同時隨機抽取一部分reads, 產生step1_otus/subsampled_failures.fna 文件
修改 PERCENT_SUBSAMPLE 參數,可以加速運行時間
 
 
 
Step 2) The subsampled_failures.fna are next clustered de novo, and each cluster
centroid is then chosen as a "new reference sequence" for use as the reference
database in Step 3, to produce:
step2_otus/subsampled_seqs_clusters.uc
step2_otus/subsampled_seqs_otus.log
step2_otus/subsampled_seqs_otus.txt
step2_otus/step2_rep_set.fna
 
對於第一步產生的step1_otus/subsampled_failures.fna 文件,使用denovo 聚類的方式對這部分序列聚類,產生新的參考序列
 
Step 3) Pick Closed Reference OTUs against Step 2 de novo OTUs
Closed reference OTU picking is performed using the failures.fasta file created
in Step 1 against the 'reference' de novo database created in Step 2 to produce:
step3_otus/failures_seqs_clusters.uc
step3_otus/failures_seqs_failures.txt
step3_otus/failures_seqs_otus.log
step3_otus/failures_seqs_otus.txt
 
用step1_otus/failures.fasta 比對step2_otus/step2_rep_set.fna 進行比對
 
Assuming the user has NOT passed the --suppress_step4 flag:
The sequences which failed to hit the reference database in Step 3 are removed
from the Step 3 input fasta file to produce:
step3_otus/failures_failures.fasta
 
沒有比對上的序列會產生step3_otus/failures_failures.fasta 文件
 
 
Step 4) Additional de novo OTU picking
It is assumed by this point that the majority of sequences have been assigned
to an OTU, and thus the sequence count of failures_failures.fasta is small
enough that de novo OTU picking is computationally feasible. However, depending
on the sequences being used, it might be that the failures_failures.fasta file
is still prohibitively large for de novo clustering, and the jobs might take
too long to finish. In this case it is likely that the user would want to pass
the --suppress_step4 flag to avoid this additional de novo step.
 
A final round of de novo OTU picking is done on the failures_failures.fasta file
to produce:
step4_otus/failures_failures_cluster.uc
step4_otus/failures_failures_otus.log
step4_otus/failures_failures_otus.txt
 
用第三步產生failures_failures.fasta 文件再次聚OTU
 
Step 5) Produce the final OTU map and rep set
If Step 4 is completed, the OTU maps from Step 1, Step 3, and Step 4 are
concatenated to produce:
final_otu_map.txt
 
如果第四步執行了的話,將1,3,4 產生的map 文件合並起來,產生final_otu_map.txt 文件
 
If Step 4 was not completed, the OTU maps from Steps 1 and Step 3 are
concatenated together to produce:
final_otu_map.txt
 
如果第四步沒有執行,將1,3產生的map 文件合並起來,產生final_otu_map.txt 文件
 
Next, the minimum specified OTU size required to keep an OTU is specified with
the --min_otu_size flag. For example, if the user left the --min_otu_size as the
default value of 2, requiring each OTU to contain at least 2 sequences, the any
OTUs which failed to meet this criteria would be removed from the
final_otu_map.txt to produce:
final_otu_map_mc2.txt
 
If --min_otu_size 10 was passed, it would produce:
final_otu_map_mc10.txt
 
The final_otu_map_mc2.txt is used to build the final representative set:
rep_set.fna
 
-min_otu_size 對OTU進行過濾,產生final_otu_map_mc2.txt 文件已經對應的代表序列 rep_set.fna
 
Step 6) Making the OTU tables and trees
An OTU table is built using the final_otu_map_mc2.txt file to produce:
otu_table_mc2.biom
 
由final_otu_map_mc2.txt 產生 otu_table_mc2.biom OTU table
 
As long as the --suppress_taxonomy_assignment flag is NOT passed,
then taxonomy will be assigned to each of the representative sequences
in the final rep_set produced in Step 5, producing:
rep_set_tax_assignments.log
rep_set_tax_assignments.txt
This taxonomic metadata is then added to the otu_table_mc2.biom to produce:
otu_table_mc_w_tax.biom
 
對otu 代表序列進行 taxonomy 注釋, 產生 otu_table_mc_w_tax.biom 文件
 
As long as the --suppress_align_and_tree is NOT passed, then the rep_set.fna
file will be used to align the sequences and build the phylogenetic tree,
which includes the de novo OTUs. Any sequences that fail to align are
omitted from the OTU table and tree to produce:
otu_table_mc_no_pynast_failures.biom
rep_set.tre
 
對otu代表序列進行多序列比對,構建進化樹, 產生 rep_set.tre 文件
 
If both --suppress_taxonomy_assignment and --suppress_align_and_tree are
NOT passed, the script will produce:
otu_table_mc_w_tax_no_pynast_failures.biom
 
It is important to remember that with a large workflow script like this that
the user can jump into intermediate steps. For example, imagine that for some
reason the script was interrupted on Step 2, and the user did not want to go
through the process of re-picking OTUs as was done in Step 1. They can simply
rerun the script and pass in the:
--step_1_otu_map_fp
--step1_failures_fasta_fp
parameters, and the script will continue with Steps 2 - 4.
 
對於大型的腳本,要求可以在大致的步驟之間跳轉,不執行前面的步驟
 
**Note:** If most or all of your sequences are failing to hit the reference
during the prefiltering or closed-reference OTU picking steps, your sequences
may be in the reverse orientation with respect to your reference database. To
address this, you should add the following line to your parameters file
(creating one, if necessary) and pass this file as -p:
 
pick_otus:enable_rev_strand_match True
 
Be aware that this doubles the amount of memory used in these steps of the
workflow.
 
如果原始序列中有很大一部分序列,沒有比對上數據庫中的序列,可能的原因是輸入序列與數據庫中的是反向互補的,可以添加 pick_otus:enable_rev_strand_match True 參數
但是這個參數會導致內存加倍
 
基本用法:
pick_open_reference_otus.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss_sortmerna_sumaclust/ -p $PWD/ucrss_smr_suma_params.txt -m sortmerna_sumaclust
 
-i  : 輸入的原始序列,fasta格式
-r : 數據庫中的序列,fasta格式, 默認采用的是 greengene /usr/local/lib/python2.7/site-packages/qiime_default_r eference/gg_13_8_otus/rep_set/97_otus.fasta
-o : 輸出結果的目錄 
-p : 參數對應的文件
-m : 聚類的軟件,可選的有'uclust', 'usearch61', 'sortmerna_sumaclust', 默認為 uclust


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM