基因組組裝工具之 SOAPdenovo 使用方法

本文轉載自查看原文 2016-06-22 11:01 10200 基因組組裝

SOAPdenovo是一個新穎的適用於組裝短reads的方法，能組裝出類似人類基因組大小的de novo草圖。

該軟件特地設計用來組裝Illumina GA short reads，新的版本減少了在圖創建時的內存消耗，解決了contig組裝時的重復區域的問題，增加了scaffold組裝時的覆蓋度和長度，改進了gap closing，更加適用於大型基因組組裝。

（SOAPdenovo是為了組裝大型植物和動物基因組而設計的，同樣也適用於組裝細菌和真菌，組裝大型基因組大小如人類時，可能需要150G內存。）

1.配置文件

一般大型基因組組裝項目都會有多個文庫，配置文件包含文庫的位置信息以及其他信息。

配置文件包含全局信息和多個文庫部分信息。

全局信息：max_rd_len：任何比它大的read會被切到這個長度。

文庫部分由[LIB]開始，並包含如下信息：

1) avg_ins

文庫的平均插入長度，或者是插入長度分布圖的峰值。（科普：理論上插入片段長度是成正態分布的，並不是嚴格控制的）
2) reverse_seq

這個選項有 0 或 1 兩個選項，它告訴組裝器read序列是否需要被完全反轉。Illumima GA 產生兩種 paired-end 文庫：一是forward-reverse；另一個是 reverse-forward。"reverse_seq"參數應該如下設置：0，forward-reverse（由典型的插入長度少於500 bp的DNA末端片段生成）；1，reverse-forward（由環狀文庫，典型的2 kb以上的文庫生成）。

3) asm_flags

決定reads哪一段會被利用，1（僅進行contig組裝）；2（僅進行scaffold組裝）；3（contig和scaffold都組裝）；4（只進行gap closure）。
4) rd_len_cutof

組裝器會過濾掉當前文庫中到這個長度之間的reads。
5) rank

為整數值，它決定在scaffold組裝時reads被利用的順序。文庫中具有同樣rank值的會被同時使用（在組裝scaffold時）。
6) pair_num_cutoff

該參數是成對number的 cutoff value，為了得到兩條contigs的可靠的連接或 pre-scaffolds。paired-end reads and mate-pair reads 的最小數量分別是 3 和 5.
7) map_len

這個參數在“map”階段生效，它是read 和 contig 的最小比對長度，用來建立一個可靠的read定位。

paired-end reads and mate-pair reads 的最小的長度分別是 32 和 35.

組裝器接受三種read格式：FASTA, FASTQ and BAM。

Mate-pair關系：fastq中兩個文件的同行序列；fasta中的鄰行序列，bam文件比較特殊。

配置文件中，單端文件用"f=/path/filename" or "q=/pah/filename" 表示 fasta or fastq 格式。

雙端reads被放在兩個fasta文件中，分別為"f1=" and "f2="。fastq文件由"q1=" and "q2="表示。

雙端reads如果全在一個fasta文件中，則用"p=" 選項；reads在bam文件中則用"b=".選項。

以上參數大多是可選的，如果你不知道怎么用，可以不設置，讓軟件使用默認參數。

2.命令及參數

常用的一站式運行方式：

${bin} all -s config_file -K 63 -R -o graph_prefix 1>ass.log 2>ass.err

分四步運行：

${bin} pregraph -s config_file -K 63 -R -o graph_prefix 1>pregraph.log 2>pregraph.err
OR
${bin} sparse_pregraph -s config_file -K 63 -z 5000000000 -R -o graph_prefix 1>pregraph.log 2>pregraph.err

${bin} contig -g graph_prefix -R 1>contig.log 2>contig.err

${bin} map -s config_file -g graph_prefix 1>map.log 2>map.err

${bin} scaff -g graph_prefix -F 1>scaff.log 2>scaff.err

all (pregraph-contig-map-scaff)的參數

-s <string>    配置文件：config

  -o <string>    輸出圖：輸出圖文件名的前綴

  -K <int>       kmer(最小 13, 最大 63/127): kmer size, [23]

  -p <int>       cpu核數 [8]

  -a <int>       初始的內存：避免內存再分配，單位為G [0]

  -d <int>       KmerFreqCutoff: kmers with frequency no larger than KmerFreqCutoff will be deleted, [0]

  -R (optional)  resolve repeats by reads, [NO]


  -D <int>       EdgeCovCutoff: edges with coverage no larger than EdgeCovCutoff will be deleted, [1]

  -M <int>       mergeLevel(min 0, max 3): the strength of merging similar sequences during contiging, [1]

  -m <int>       max k when using multi kmer

  -e <int>       weight to filter arc when linearize two edges(default 0)

  -r (optional)  keep available read(*.read)

  -E (optional)  merge clean bubble before iterate

  -f (optional)  output gap related reads in map step for using SRkgf to fill gap, [NO]

  -k <int>       kmer_R2C(min 13, max 63): kmer size used for mapping read to contig, [K]

  -F (optional)  fill gaps in scaffold, [NO]

  -u (optional)  un-mask contigs with high/low coverage before scaffolding, [mask]

  -w (optional)  keep contigs weakly connected to other contigs in scaffold, [NO]

  -G <int>       gapLenDiff: allowed length difference between estimated and filled gap, [50]

  -L <int>       minContigLen: shortest contig for scaffolding, [K+2]

  -c <float>     minContigCvg: minimum contig coverage (c*avgCvg), contigs shorter than 100bp with coverage smaller than c*avgCvg will be masked before scaffolding unless -u is set, [0.1]

  -C <float>     maxContigCvg: maximum contig coverage (C*avgCvg), contigs with coverage larger than C*avgCvg or contigs shorter than 100bp with coverage larger than 0.8*C*avgCvg will be masked before scaffolding unless -u is set, [2]

  -b <float>     insertSizeUpperBound: (b*avg_ins) will be used as upper bound of insert size for large insert size ( > 1000) when handling pair-end connections between contigs if b is set to larger than 1, [1.5]

  -B <float>     bubbleCoverage: remove contig with lower cvoerage in bubble structure if both contigs coverage are smaller than bubbleCoverage*avgCvg, [0.6]

  -N <int>      基因組大小 [0]

  -V (optional)  組裝的可視化信息輸出 [NO]

學到的基本概念：

參考資料：

SOAPdenovo官方網站

SOAPdenovo軟件使用說明

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 PacBio長reads的大基因組組裝項目一：使用二代測序數據進行基因組組裝（局部組裝） NextDenovo 組裝基因組「三代組裝」使用Pilon對基因組進行polish PacBio全基因組測序和組裝使用wtdbg利用三代數據進行基因組de novo組裝組裝三代番木瓜基因組——by Serenity Fang Java 實現簡單的SQL動態組裝工具類單細胞RNA-seq比對定量用什么工具好？使用哪個版本的基因組？數據來說話不同的方法從gtf中提取相關基因組信息in R