使用mothur進行OTU聚類

本文轉載自查看原文 2017-06-09 15:40 4197 16S

微生物16S的OTU聚類工具有很多，最常用的就是 usearch、cdhit-OTU、mothur。

這些工具大多都是針對二代測序平台的，usearch的64bit版本是收費的。

如果要跑PacBio的OTU聚類，目前就只能用 mothur 了。

mothur有着非常詳細的說明文檔！

以下轉載一篇教程：原文鏈接：http://blog.sina.com.cn/s/blog_80572f5d0101ac2v.html

mothur 用來處理高通量測序結果或者克隆文庫的序列處理，這里進行OTU分類，就是排除克隆測序結果的重復序列，以97%的相似度進行分類，即cutoff=0.03

版本：mothur v.1.25.1 Last updated: 5/14/2012

mothur > unique.seqs(fasta=*.fasta)

Output File Names:
*.unique.fasta
*.names

mothur > align.seqs(candidate=*.unique.fasta,template=tmp.fasta,flip=T,processors=2)

Reading in the tmp.fasta template sequences... DONE.
It took 0 to read 1 sequences.
Aligning sequences from *.unique.fasta ...

Reading in the tmp.fasta template sequences... DONE.
It took 0 to read 1 sequences.
Some of your sequences generated alignments that eliminated too many bases, a list is provided in *.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 1 secs to align *** sequences.

Output File Names:
*.unique.align
*.unique.align.report
*.unique.flip.accnos

mothur > filter.seqs(fasta=*.unique.align)

Length of filtered alignment: ***

Number of columns removed: 0

Length of the original alignment: ***

Number of sequences used to construct filter: ***

Output File Names:
*.filter
*.unique.filter.fasta

mothur > dist.seqs(fasta=*.unique.filter.fasta,calc=onegap,countends=F,cutoff=0.03,output=lt)

Output File Name: g5gta.unique.filter.phylip.dist

It took 0 to calculate the distances for *** sequences.

mothur > cluster(phylip=*.unique.filter.phylip.dist,method=furthest,cutoff=0.03)

********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
unique 2 124 2
0.01 3 89 15 3
0.02 5 69 7 10 0 3
0.03 8 60 7 7 3 1 0 0 2

Output File Names:
*.unique.filter.phylip.fn.sabund
*.unique.filter.phylip.fn.rabund
*.unique.filter.phylip.fn.list

It took 0 seconds to cluster

mothur > bin.seqs(fasta=*.fasta,name=*.names)

Using *.unique.filter.phylip.fn.list as input file for the list parameter.
unique
0.01
0.02
0.03

Output File Names:
*.unique.filter.phylip.fn.unique.fasta
*.unique.filter.phylip.fn.0.01.fasta
*.unique.filter.phylip.fn.0.02.fasta
*.unique.filter.phylip.fn.0.03.fasta

mothur > get.oturep(phylip=*.unique.filter.phylip.dist,fasta=*.fasta,list=*.unique.filter.phylip.fn.list,label=0.03)

Output File Names:
*.unique.filter.phylip.fn.0.03.rep.names
*.unique.filter.phylip.fn.0.03.rep.fasta

總結

mothur下載地址：http://www.mothur.org/wiki/Download_mothur
支持Mac，Windows以及Linux操作系統
首先要打開mothur -----cmd，進入命令行界面cd xxx 進入mothur.exe目錄，輸入mothur.exe 回車
假設該目錄內fasta格式文件叫做 Great.fasta
那么使用如下命令處理：

1.mothur > unique.seqs(fasta=Great.fasta)
2.mothur > dist.seqs(fasta=Great.unique.fasta,calc=onegap,countends=F,cutoff=0.03,output=lt)
3.mothur > cluster(phylip=Great.unique.phylip.dist,method=furthest,cutoff=0.03)
4.mothur > bin.seqs(fasta=Great.fasta,name=Great.names)
5.mothur > get.oturep(phylip=Great.unique.phylip.dist,fasta=Great.unique.fasta,list=Great.unique.phylip.fn.list,label=0.03)

最后打開Great.phylip.fn.0.03.rep.fasta 查看各OTU 的數目即可，即每個序列名最后的數字。
另外，還可以看到有多少個OTUs
Great.unique.phylip.fn.0.03.rep.names 這個文件顯示了每個OTU都有哪些序列，可以用記事本打開

Great.unique.phylip.fn.list 這個文件顯示了unique_0.01_0.02_0.03 分別有哪些類型的OTU,每一組用空格或Tab隔開，每組OTU內的各序列用"，" 隔開。

注意，cluster命令能讀取phylip矩陣，也能讀取column矩陣，如果讀取的是column，還需要提供一個names文件。另外，用unique.seqs命令,是為了生成names文件
如果在運行第1步的時候，提示[ERROR]:your sequences are not the same length, aborting.

那么需要運行一下命令：

先提供一個template fasta文件，以以上fasta文件中序列長度最普遍的某個為序列模板，新建一個fasta，命名為temp.fasta

a. mothur > unique.seqs(fasta=Great.fasta)
b. mothur > align.seqs(candidate=Great.unique.fasta,template=temp.fasta,flip=T,processors=2)
c. mothur > filter.seqs(fasta=Great.unique.align)

再把這里生成的Great.unique.filter.fasta文件更名為Great.fasta 進行以上5步處理。（盡量在一個新的mothur目錄內，打開一個新的mothur窗口，以免文件名沖突）

命令詳情

unique.seqs

identify the unique sequences in a collection and generate a names file

找出fasta文件中獨一無二的序列，即去冗余，在二代中有很多一模一樣的冗余序列，需要首先去掉，而在三代中這一步可以不跑。

這一步生成 All_Sample.unique.fasta 和 All_Sample.names

align.seqs

align sequences against a reference alignment 比對到ref上，等於是多重比對

The default threshold is 0.50, meaning if 50% of the bases are removed in the alignment process. The flip parameter is used to specify whether or not you want mothur to try the reverse complement of a sequence if the sequence falls below the threshold. The default is false. If the flip parameter is set to true the reverse complement of the sequence is aligned and the better alignment is reported.

template means ref

這一步生成 All_Sample.unique.align 、 All_Sample.unique.align.report 、 All_Sample.unique.flip.accnos

filter.seqs

filter positions out of an alignment 過濾

we can probably remove a lot of positions from the alignment to speed up our distance calculations.

加速距離計算

dist.seqs

generate a pairwise distance matrix，生成雙序列比對距離矩陣

This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file. Furthermore, it is possible to ignore "large" distances that one might not be interested in.

得到了序列兩兩之間的距離

cluster

clustering sequences into OTUs based on genetic distance 遺傳距離

有多種聚類方法，選哪一種好呢？

The methods available in mothur include opticlust (opti), average neighbor (average), furthest neighbor (furthest), nearest neighbor (nearest), Vsearch agc (agc), Vsearch dgc (dgc). By default cluster() uses the opticlust algorithm.

bin.seqs

identify the OTU that each sequence belongs to

prints out a fasta-formatted file where sequences are ordered according to the OTU that they belong to. Such an output may be helpful for generating primers specific to an OTU or for classification of sequences.

get.oturep

identify a representative sequence from each OTU

While the bin.seqs command reports the OTU number for all sequences, the get.oturep command generates a fasta-formatted sequence file containing only a representative sequence for each OTU.

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【轉】使用scipy進行層次聚類和k-means聚類使用sklearn進行K_Means聚類算法使用K-means進行聚類，用calinski_harabaz_score評價聚類效果使用訓練好的word2vector進行文本聚類使用K-means和高斯混合模型對圖像進行聚類 python用sklearn進行聚類實踐使用 k-means 聚類算法對多維屬性數據進行分類在Python中使用K-Means聚類和PCA主成分分析進行圖像壓縮 [Go]使用Golang對鳶尾花數據集進行k-means聚類使用K均值算法進行聚類分析實戰數據集（注釋全）