微生物16S的OTU聚類工具有很多,最常用的就是 usearch、cdhit-OTU、mothur。
這些工具大多都是針對二代測序平台的,usearch的64bit版本是收費的。
如果要跑PacBio的OTU聚類,目前就只能用 mothur 了。
mothur有着非常詳細的說明文檔!
- General operations
- Sequence processing
- OTU-based approaches
- Hypothesis testing approaches
- Frequently asked questions
- Phylotype analysis
- Gui Tutorial
以下轉載一篇教程:原文鏈接:http://blog.sina.com.cn/s/blog_80572f5d0101ac2v.html
mothur 用來處理高通量測序結果或者克隆文庫的序列處理,這里進行OTU分類,就是排除克隆測序結果的重復序列,以97%的 相似度進行分類,即cutoff=0.03
版本:mothur v.1.25.1 Last updated: 5/14/2012
mothur > unique.seqs(fasta=*.fasta)
Output File Names:
*.unique.fasta
*.names
mothur > align.seqs(candidate=*.unique.fasta,template=tmp.fasta,flip=T,processors=2)
Reading in the tmp.fasta template sequences... DONE.
It took 0 to read 1 sequences.
Aligning sequences from *.unique.fasta ...
Reading in the tmp.fasta template sequences... DONE.
It took 0 to read 1 sequences.
Some of your sequences generated alignments that eliminated too many bases, a list is provided in *.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 1 secs to align *** sequences.
Output File Names:
*.unique.align
*.unique.align.report
*.unique.flip.accnos
mothur > filter.seqs(fasta=*.unique.align)
Length of filtered alignment: ***
Number of columns removed: 0
Length of the original alignment: ***
Number of sequences used to construct filter: ***
Output File Names:
*.filter
*.unique.filter.fasta
mothur > dist.seqs(fasta=*.unique.filter.fasta,calc=onegap,countends=F,cutoff=0.03,output=lt)
Output File Name: g5gta.unique.filter.phylip.dist
It took 0 to calculate the distances for *** sequences.
mothur > cluster(phylip=*.unique.filter.phylip.dist,method=furthest,cutoff=0.03)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
unique 2 124 2
0.01 3 89 15 3
0.02 5 69 7 10 0 3
0.03 8 60 7 7 3 1 0 0 2
Output File Names:
*.unique.filter.phylip.fn.sabund
*.unique.filter.phylip.fn.rabund
*.unique.filter.phylip.fn.list
It took 0 seconds to cluster
mothur > bin.seqs(fasta=*.fasta,name=*.names)
Using *.unique.filter.phylip.fn.list as input file for the list parameter.
unique
0.01
0.02
0.03
Output File Names:
*.unique.filter.phylip.fn.unique.fasta
*.unique.filter.phylip.fn.0.01.fasta
*.unique.filter.phylip.fn.0.02.fasta
*.unique.filter.phylip.fn.0.03.fasta
mothur > get.oturep(phylip=*.unique.filter.phylip.dist,fasta=*.fasta,list=*.unique.filter.phylip.fn.list,label=0.03)
********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
0.03 80
Output File Names:
*.unique.filter.phylip.fn.0.03.rep.names
*.unique.filter.phylip.fn.0.03.rep.fasta
總結
mothur下載地址:http://www.mothur.org/wiki/Download_mothur
支持Mac,Windows以及Linux操作系統
首先要打開mothur -----cmd,進入命令行界面cd xxx 進入mothur.exe目錄,輸入mothur.exe 回車
假設該目錄內fasta格式文件叫做 Great.fasta
那么使用如下命令處理:
1.mothur > unique.seqs(fasta=Great.fasta) 2.mothur > dist.seqs(fasta=Great.unique.fasta,calc=onegap,countends=F,cutoff=0.03,output=lt) 3.mothur > cluster(phylip=Great.unique.phylip.dist,method=furthest,cutoff=0.03) 4.mothur > bin.seqs(fasta=Great.fasta,name=Great.names) 5.mothur > get.oturep(phylip=Great.unique.phylip.dist,fasta=Great.unique.fasta,list=Great.unique.phylip.fn.list,label=0.03)
最后打開Great.phylip.fn.0.03.rep.fasta 查看各OTU 的數目即可,即每個序列名最后的數字。
另外,還可以看到有多少個OTUs
Great.unique.phylip.fn.0.03.rep.names 這個文件顯示了每個OTU都有哪些序列,可以用記事本打開
Great.unique.phylip.fn.list 這個文件顯示了unique_0.01_0.02_0.03 分別有哪些類型的OTU,每一組用空格或Tab隔開,每組OTU內的各序列用"," 隔開。
注意,cluster命令能讀取phylip矩陣,也能讀取column矩陣,如果讀取的是column,還需要提供一個names文件。 另外,用unique.seqs命令,是為了生成names文件
如果在運行第1步的時候,提示[ERROR]:your sequences are not the same length, aborting.
那么需要運行一下命令:
先提供一個template fasta文件,以以上fasta文件中序列長度最普遍的某個為序列模板,新建一個fasta,命名為temp.fasta
a. mothur > unique.seqs(fasta=Great.fasta) b. mothur > align.seqs(candidate=Great.unique.fasta,template=temp.fasta,flip=T,processors=2) c. mothur > filter.seqs(fasta=Great.unique.align)
再把這里生成的Great.unique.filter.fasta文件更名為Great.fasta 進行以上5步處理。(盡量在一個新的mothur目錄內,打開一個新的mothur窗口,以免文件名沖突)
命令詳情
unique.seqs
identify the unique sequences in a collection and generate a names file
找出fasta文件中獨一無二的序列,即去冗余,在二代中有很多一模一樣的冗余序列,需要首先去掉,而在三代中這一步可以不跑。
這一步生成 All_Sample.unique.fasta 和 All_Sample.names
align.seqs
align sequences against a reference alignment 比對到ref上,等於是多重比對
The default threshold is 0.50, meaning if 50% of the bases are removed in the alignment process. The flip parameter is used to specify whether or not you want mothur to try the reverse complement of a sequence if the sequence falls below the threshold. The default is false. If the flip parameter is set to true the reverse complement of the sequence is aligned and the better alignment is reported.
template means ref
這一步生成 All_Sample.unique.align 、 All_Sample.unique.align.report 、 All_Sample.unique.flip.accnos
filter.seqs
filter positions out of an alignment 過濾
we can probably remove a lot of positions from the alignment to speed up our distance calculations.
加速距離計算
dist.seqs
generate a pairwise distance matrix,生成雙序列比對距離矩陣
This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file. Furthermore, it is possible to ignore "large" distances that one might not be interested in.
得到了序列兩兩之間的距離
cluster
clustering sequences into OTUs based on genetic distance 遺傳距離
有多種聚類方法,選哪一種好呢?
The methods available in mothur include opticlust (opti), average neighbor (average), furthest neighbor (furthest), nearest neighbor (nearest), Vsearch agc (agc), Vsearch dgc (dgc). By default cluster() uses the opticlust algorithm.
bin.seqs
identify the OTU that each sequence belongs to
prints out a fasta-formatted file where sequences are ordered according to the OTU that they belong to. Such an output may be helpful for generating primers specific to an OTU or for classification of sequences.
get.oturep
identify a representative sequence from each OTU
While the bin.seqs command reports the OTU number for all sequences, the get.oturep command generates a fasta-formatted sequence file containing only a representative sequence for each OTU.
