一般人都知道 H 和 S 的表面上的區別,即 S 就是 soft, H 就是 hard,S 后,序列里還是會保留序列的信息,而 H 則不會。
-------------------------------------------后面都不用看了,H和S沒有區別,比對軟件不能發現嵌合體--------------------------------------
但這只是表面上的,在深層次的意義上, H 和 S 又有什么本質的不同呢?
首先要了解嵌合體的概念:
嵌合體就是兩個不同的序列錯誤的拼接到了一起,也就是一條序列分別比對到了 ref 的兩個地方(這和多重比對、次級比對之間又有區別)
Example of extended CIGAR and the pileup output.
(a) Alignments of one pair of reads and three single-end reads.
(b) The corresponding SAM file. The ‘@SQ’ line in the header section gives the order of reference sequences. Notably, r001 is the name of a read pair. According to FLAG 163 (=1 + 2 + 32 + 128), the read mapped to position 7 is the second read in the pair (128) and regarded as properly paired (1 + 2); its mate is mapped to 37 on the reverse strand (32). Read r002 has three soft-clipped (unaligned) bases. The coordinate shown in SAM is the position of the first aligned base. The CIGAR string for this alignment contains a P (padding) operation which correctly aligns the inserted sequences. Padding operations can be absent when an aligner does not support multiple sequence alignment. The last six bases of read r003 map to position 9, and the first five to position 29 on the reverse strand. The hard clipping operation H indicates that the clipped sequence is not present in the sequence field. The NM tag gives the number of mismatches. Read r004 is aligned across an intron, indicated by the N operation.
(c) Simplified pileup output by SAMtools. Each line consists of reference name, sorted coordinate, reference base, the number of reads covering the position and read bases. In the fifth field, a dot or a comma denotes a base identical to the reference; a dot or a capital letter denotes a base from a read mapped on the forward strand, while a comma or a lowercase letter on the reverse strand.
clipped alignment因為着在比對過程中,並沒有用到全部的read的序列,read兩段的序列被截取了(clip or trim)。如下表示,即為clip alignment。
Alignment:
Read: ACGGTTGCGTTAA-TCCGCCACG
| ||||||||| ||||||
Reference: TAACTTGCGTTAAATCCGCCTGG
與clipped alignment對應的是spliced alignment,即read的中間沒有比對到而兩段比對上了。對應的表示如下:
Alignment:
Read: ACGGTTGCGTTAAGCTCATCCGCCACG
| ||||||||||||| |||||||||
Reference: ACGGTTGCGTTAA…..TCCGCCACG
clip alignment對應的CIGAR表示有兩種S (soft clip) 和H (hard clip)。
BWA提到If the read has a chimeric alignment, the paired or the top hit uses soft clipping and is marked with neither 0x800 nor 0x100 bits. All the other hits part of the chimeric alignment will use hard clipping and be marked with 0x800 if option “-M” is not in use, or marked with 0x100 otherwise.
即如果發現嵌合比對,最好的比對top hit標記為soft clipping,其余的則標記為hard clipping。
如果是hard clip,則截取的部分不會在SAM文件對應的read中出現 (clipped sequences not present in SEQ),如果是soft clip (clipped sequences present in SEQ),則會出現。
Understand?
Ref:https://github.com/lh3/bwa/blob/master/NEWS.md
轉自: http://wp.zxzyl.com/?p=131理解1:
Hard masked bases do not appear in the SEQ string, soft masked bases do.
So, if your cigar is: `10H10M10H` then the SEQ will only be 10 bases long.
if your cigar is 10S10M10S then the SEQ and base-quals will be 30 bases long.
首先,結果展示方式有區別:比如說10H10M10H,第10列的鹼基序列只顯示10bp;而如果是10S10M10S的話,就會顯示30bp的序列,盡管開頭和結尾的20bp也沒比上。
In the case of soft-masking, even though the SEQ is present, it is not used by variant callers and not displayed when you view your data in a viewer. In either case, masked bases should not be used in calculating coverage.
在soft中,即使顯示的序列比hard的要長,但是計算變異或可視化比對結果時,這些序列也不會被考慮。而且,2種情況計算覆蓋度時,mask的鹼基都不會考慮。
例子:
20692128 97 viral_genome 21417 60 69M32S chr7 101141242 0 TACATCTTCTCCCTCTCTCACGACACAAGAATTAGTCACATAGGGATGTTCTCGTAAATCTACATTATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT (101bp) GGGGGGGGGGGGGGEGGEGGGGGGGGGFGGGGGGGGGGGGGEGFFGGGGGGGFGGFGGGGEGGGGGGGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@ NM:i:4 MD:Z:6G34G6C5C14 AS:i:49 XS:i:0 SA:Z:chr7,101141091,+,66S35M,60,0;
20692128 353 chr7 101141091 60 66H35M = 101141242 252 ATCTTACAAAAACATTTTTTAAAAATTTGCTAGGT (35bp)GGGGGGEGEFFGGGFEGGGGGFGCGGGFBGGGBG@ NM:i:0 MD:Z:35 AS:i:35 XS:i:23 SA:Z:gi|224020395|ref|NC_001664.2|,21417,+,69M32S,60,4;
20692128 145 chr7 101141242 60 101M gi|224020395|ref|NC_001664.2| 21417 0 GCAACAGAGCGAGACCCTATATTCATGAGTGTTGCAATGAGCCAAGTAGTGGAGGTTGGCTTTTGAAGGCAGAAAAGGACTGAGAAAAGCTAACACAGAGA FEGCGGGGGCGEFCDEEEEGGGGGGGGGGGGGGGEGGGGGGFGGGEGGG
理解2:
當同一條reads比對到不同chr時(嵌合reads),會以hard clip的顯示顯示。比如上面的例子,R1分別比到了viral基因組和chr7上(前面69bp比到viral,后面35比到chr7),R2比到了chr7。