BWA MEM算法


現在BWA大家基本上只用其mem算法了,無論是二代還是三代比對到參考基因組上,BWA應用得最多的就是在重測序方面。

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM - arXiv:1303.3997v2

摘要

BWA-MEM is a new alignment algorithm for aligning sequence reads or assembly contigs against a large reference
genome such as human.  MEM是bwa中最新的算法,基本取代了前兩種,目的是將測序reads或組裝的contig比對到Reference上。

It automatically chooses between local and end-to-end alignments, supports paired-end reads and performs chimeric alignment.   MEM自動選擇局部或全局比對,支持paired-end和嵌合體比對。

The algorithm is robust to sequencing errors and applicable to a wide range of sequence lengths from 70bp to a few megabases.   MEM算法很健壯,支持多種錯誤類型,可以應用到一系列長度的reads比對上,簡單來說就是可以比對短reads,又可以比對長的contig,還可以比對長的PB reads,它們的錯誤率都是不同的。(但是最好是比對到Reference上,而且不能三代比三代,不能二代比三代,也不能二代比二代)

For mapping 100bp sequences, BWA-MEM shows better performance than several state-of-art read aligners to date.

引言

Most short-reads mappers for next-generation sequencing (NGS) data were developed when sequence reads were about 36bp in length. For 36bp reads, it is reasonable to require end-to-end alignment (i.e. every read base to be aligned to the reference) and to only report hits within certain hamming or edit distance.  2013年,大多數的NGS的比對工具都是為了36bp開發的,36bp基本就都是全局比對了,因此只輸出特定編輯距離的hits(現在不是了,最少也有50bp吧,長一點的都300bp了)

However, with emerging technologies and improved chemistry, NGS reads are not short any more, which poses new challenges to read alignment. 和現在的PacBio一樣,reads的長度一變,比對的工具也要大革新

For 100bp or longer reads, it becomes more important to allow long gaps under the affine-gap penalty and to report multiple nonoverlapping local hits potentially caused by structural variations or misassemblies in the reference genome. 重點來了:對於長reads,在affine-gap penalty(防射空位罰分,即區分了gap open和gap extend) 允許長 gaps 非常重要,而且需要輸出SV和Ref錯誤組裝引起的多個不重疊的局部hits。參見:防射空位罰分

之前的許多算法不支持長reads比對,有些成熟的算法支持sanger序列比對,但他們很慢,而且不適合分析大規模的NGS數據,所以急需開發針對NGS的長reads比對。

目前,已有一些long-read alignment algorithms:BWA-SW、Bowtie2、Cushaw2、GEM。但都有不足:

  1. BWA-SW is slower than bowtie2 for 100bp reads at a comparable accuracy and less accurate then Cushaw2 at a comparable speed.
  2. Bowtie2 and Cushaw2 are slower for 600bp reads (see Results).
  3. GEM is both fast and accurate for up to approximately 1000bp reads, it mandates end-to-end alignment and does not perform affine-gap alignment, which limits its uses for long-read alignment.

Even for typical sequence reads ranged from 100bp to 1000bp in length, no mappers work well across the full spectrum.  這就是開發BWA-MEM的目的,全面適應100bp to 1000bp 的reads的比對。

方法

Aligning a single query sequence

1.Seeding and re-seeding

BWA-MEM follows the canonical seed-and-extend paradigm. It initially seeds an alignment with supermaximal exact matches (SMEMs) using an algorithm we found previously (Li, 2012, Algorithm 5), which essentially finds at each query position the longest exact match covering the position.  BWA依舊沿用了之前的 seed-and-extend 策略,它使用之前的算法,用 supermaximal exact matches (SMEMs)來seeds一個比對,找到每個查詢位點的 longest exact match 來覆蓋該位點。

Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly (SMEMs)

However, occasionally the true alignment may not contain any SMEMs. To reduce mismappings caused by missing seeds, we introduce re-seeding. 然而,真實的比對不一定包含任何SMEMs,為了減少丟失seeds引起的錯誤比對,我們引入了re-seeding過程。

Suppose we have a SMEM of length l with k occurrences in the reference genome. If l is too large (over 28bp by default), we re-seed with the longest exact matches that cover the middle base of the SMEM and occur at least k + 1 times in the genome. Such seeds can be found by requiring a minimum occurrence in the original SMEM algorithm.  假設我們有一個長l、出現k次的SMEM ,如果l太長,我們會re-seed longest exact matches,使其包含SMEM中間的鹼基且最少出現k+1次。這樣,seeds就會需要一個最小的出現次數在原來的SMEM算法中。(完全不知所雲)

2.Chaining and chain filtering

We call a group of seeds that are colinear and close to each other as a chain. We greedily chain the seeds while seeding and then filter out short chains that are largely contained in a long chain and are much worse than the long chain (by default, both 50% and 38bp shorter than the long chain). Chain filtering aims to reduce unsuccessful seed extension at a later step. Each chain may not always correspond to a final hit. Chains detected here do not need to be accurate.  我們稱一群共線性且彼此接近的seeds為一個chain,我們貪婪的將seeds鏈在一起,在seeding的時候,然后過濾掉短的chains。 過濾chain的目的在於減少下一步中的不成功的seed extension。每一個chain不一定總是對應一個最終的hit。這個時候的chains的檢測不一定要非常准確。

3.Seed extension

We rank a seed by the length of the chain it belongs to and then by the seed length. For each seed in the ranked list,
from the best to the worst, we drop the seed if it is already contained in an alignment found before, or extend the seed with a banded affine-gap-penalty dynamic programming (DP) if it potentially leads to a new alignment.

BWA-MEM’s seed extension differs from the standard seed extension in two aspects.

Firstly, suppose at a certain extension step we come to reference position x with the best extension score achieved at query position y. BWAMEM will stop extension if the difference between the best extension score and the score at (x, y) is larger than Z+|x−y|×pgapExt, where pgapExt is the gap extension penalty and Z is an arbitrary cutoff. This heuristic avoids extension through a poorly aligned region with good flanking alignment. It is similar to the X-dropoff heuristic in BLAST (Altschul et al., 1997), but does not penalize long gaps in one of the reference or the query sequences.

Secondly, while extending a seed, BWA-MEM tries to keep track of the best extension score reaching the end of the query sequence. If the difference between the best score reaching the end and the best local alignment score
is below a threshold, the local alignment will be rejected even if it has a higher score. BWA-MEMuses this strategy to automatically choose between local and end-to-end alignments. It may align through true variations towards
the end of a read and thus reduces reference bias, while avoids introducing excessive mismatches and gaps which may happen when we force a chimeric read into an end-to-end alignment.

Paired-end mapping

1.Rescuing missing hits

 

2.Pairing

 

結果和討論


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM